Get rid of the muff: pre-processing PDPC Decisions

Get rid of the muff: pre-processing PDPC Decisions
This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

The life of a budding data science enthusiast.

You need data to work on, so you look all around you for something that has meaning to you. Anyone who reads the latest 10 posts of this blog knows I have a keen interest in data protection in Singapore. So naturally, I wanted to process the PDPC’s enforcement cases. Fortunately for me, the listing of published cases are complete, and they are not exactly hindered by things like stare decisis. We begin by scraping the website.

The Problem

Using the scraping method I devised, you will now get a directory filled with PDFs of the cases. Half the battle done, right? If you thought so, then you have not looked hard enough at the product.

It’s right there. They are PDFs. Notwithstanding its name, PDFs do not actually represent text well. They do represent pages or documents, which means you what get what you see and not what you read.

I used PDFminer to extract the text from the PDF, and this is a sample output that I get:

 The operator, in the mistaken belief that the three statements belonged
 to  the  same  individual,  removed  the  envelope  from  the  reject  bin  and
 moved it to the main bin. Further, the operator completed the QC form
 in  a  way  that  showed  that  the  number  of  “successful”  and  rejected
 Page 3 of 7
 
 B.
 (i)
 (ii)
 9.
 (iii)
 envelopes tallied with the expected total from the run. As the envelope
 was  no  longer  in  the  reject bin, the  second and  third  layers  of  checks
 were by-passed, and the envelope was sent out without anyone realising
 that it contained two extra statements. The manual completion of the QC
 form by the operator to show that the number of successful and rejected
 envelopes tallied allowed this to go undetected. 

Notice the following:

  • There are line breaks in the middle of sentences. This was where the sentence broke apart for new lines in the document. The computer would read “The operator, in the mistaken belief that the three statements belonged” and then go “What? What happened?”
  • Page footers and headers appears in the document. They make sense when you are viewing a document, but are meaningless in a text.
  • Orphan bullet and paragraph numbers. They used to belong to some text, but now nobody knows. Table contents are also seriously borked.

If you had fed this to your computer or training, you are going to get rubbish. The next step, which is very common in data science but particularly troublesome in natural language processing is preprocessing. We have to fix the errors ourselves before letting your computer do more with it.

I used to think that I could manually edit the files and fix the issues one by one, but it turned out to be very time consuming. (Duh!) My computer had to do the work for me. Somehow!

This post is for subscribers only

Already have an account? Sign in.