Get rid of the muff: pre-processing PDPC Decisions

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

The life of a budding data science enthusiast.

You need data to work on, so you look all around you for something that has meaning to you. Anyone who reads the latest 10 posts of this blog knows I have a keen interest in data protection in Singapore. So naturally, I wanted to process the PDPC’s enforcement cases. Fortunately for me, the listing of published cases are complete, and they are not exactly hindered by things like stare decisis. We begin by scraping the website.

The Problem

Using the scraping method I devised, you will now get a directory filled with PDFs of the cases. Half the battle done, right? If you thought so, then you have not looked hard enough at the product.

It’s right there. They are PDFs. Notwithstanding its name, PDFs do not actually represent text well. They do represent pages or documents, which means you what get what you see and not what you read.

I used PDFminer to extract the text from the PDF, and this is a sample output that I get:

The operator, in the mistaken belief that the three statements belonged to the same individual, removed the envelope from the reject bin and moved it to the main bin. Further, the operator completed the QC form in a way that showed that the number of “successful” and rejected Page 3 of 7 B. (i) (ii) 9. (iii) envelopes tallied with the expected total from the run. As the envelope was no longer in the reject bin, the second and third layers of checks were by-passed, and the envelope was sent out without anyone realising that it contained two extra statements. The manual completion of the QC form by the operator to show that the number of successful and rejected envelopes tallied allowed this to go undetected.

Notice the following:

If you had fed this to your computer or training, you are going to get rubbish. The next step, which is very common in data science but particularly troublesome in natural language processing is preprocessing. We have to fix the errors ourselves before letting your computer do more with it.

I used to think that I could manually edit the files and fix the issues one by one, but it turned out to be very time consuming. (Duh!) My computer had to do the work for me. Somehow!

The Preprocessing Solution

Although I decided I would let the computer do the correcting for me, I still had to do some work on my own. Instead of looking for errors, this time I was looking for patterns instead. This involved scanning through the output of the PDF to Text converter and the actual document. Once I figured out how these errors came about, I can get down to fixing it.

Deciding what to take out, what to leave behind

Not so fast. Unfortunately, correcting errors is not the only decision I had to make. Like many legal reports, PDPC decisions have paragraph numbers. These numbers are important. They are used for cross-referencing. In the raw output, the paragraph numbers are present as plain numbers in the text. They may be useful to a human reader who knows they are meta-information in the text, but to a computer it probably is just noise.

I decided to take it out. I don’t doubt that one day we can engineer a solution that makes it useful, but for now, they are just distractions.

Put Regular Expressions to work!

As mentioned earlier, we looked for patterns in the text to help the computer to look for them and correct them. I found regular expressions to be a very powerful way to express such patterns in a way that the computer can look for. A regular expression is sort of like a language to define a search pattern.

For example,. this code in bold looks for feed carriages in the text (a “carriage return” is what happens when you press ‘Enter’ on your keyboard, and is much cooler in a typewriter)

def removefeedcarriage(source): return [x for x in source if not re.search(r'\f', x)]

This python code tells the computer not to include any text which contains feed carriages in the text. This eliminates the multiple blank lines created by the PDF converter (usually blank space in the PDF).

A well crafted regular expression can find a lot of things. For example, the expression in bold looks for citations (“[2016] SGPDPC 15” and others) and removes them.

def remove_citations(source): return [x for x in source if not re.search(r'^[\d{4}]\s+(?:\d\s+)?[A-Z|()]+\s+\d+[\s.]?$', x)]

Figuring out the “language” in regular expressions does takes some time, but it pays dividends. To help the journey, I test my expressions using freely available websites for testing and providing a reference for regular expressions. For python, this is one of the sites I used.

Getting some results!

Besides removing citations, feed carriages and paragraph numbers, I also tried to join broken sentences together. In all, the code manages to remove around 90% of the extra line breaks. Most of the paragraphs in the text actually reads like sentences and I feel much more confident training a model based on these text.

It ain’t perfect of course. The text gets really mushed up once a table is involved, and the headers and footers are not exactly removed all the time. But as Voltaire said, “Perfect is the enemy of the good”. For now, this will do.

Concluding remarks

Hopefully this pretty quick rundown of how I pre-processed the pdpc-decisions will give you some idea as to what to do in your own projects. Now that I have the text, I have got to find something to use it for! :O Is there any other ways to improve the code to catch even more errors? Feel free to comment to let me know!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

#PDPC-Decisions #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu