Get rid of the muff: pre-processing PDPC Decisions

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!
The life of a budding data science enthusiast.
You need data to work on, so you look all around you for something that has meaning to you. Anyone who reads the latest 10 posts of this blog knows I have a keen interest in data protection in Singapore. So naturally, I wanted to process the PDPC’s enforcement cases. Fortunately for me, the listing of published cases are complete, and they are not exactly hindered by things like stare decisis. We begin by scraping the website.
The Problem
Using the scraping method I devised, you will now get a directory filled with PDFs of the cases. Half the battle done, right? If you thought so, then you have not looked hard enough at the product.
It’s right there. They are PDFs. Notwithstanding its name, PDFs do not actually represent text well. They do represent pages or documents, which means you what get what you see and not what you read.
I used PDFminer to extract the text from the PDF, and this is a sample output that I get:
The operator, in the mistaken belief that the three statements belonged
to the same individual, removed the envelope from the reject bin and
moved it to the main bin. Further, the operator completed the QC form
in a way that showed that the number of “successful” and rejected
Page 3 of 7
B.
(i)
(ii)
9.
(iii)
envelopes tallied with the expected total from the run. As the envelope
was no longer in the reject bin, the second and third layers of checks
were by-passed, and the envelope was sent out without anyone realising
that it contained two extra statements. The manual completion of the QC
form by the operator to show that the number of successful and rejected
envelopes tallied allowed this to go undetected.
Notice the following:
- There are line breaks in the middle of sentences. This was where the sentence broke apart for new lines in the document. The computer would read “The operator, in the mistaken belief that the three statements belonged” and then go “What? What happened?”
- Page footers and headers appears in the document. They make sense when you are viewing a document, but are meaningless in a text.
- Orphan bullet and paragraph numbers. They used to belong to some text, but now nobody knows. Table contents are also seriously borked.
If you had fed this to your computer or training, you are going to get rubbish. The next step, which is very common in data science but particularly troublesome in natural language processing is preprocessing. We have to fix the errors ourselves before letting your computer do more with it.
I used to think that I could manually edit the files and fix the issues one by one, but it turned out to be very time consuming. (Duh!) My computer had to do the work for me. Somehow!