9 min read

Mining PDFs to obtain better text from Decisions

After several attempts at wrangling with PDFs, I managed to extract more text information from complicated documents using PDFMiner.
Mining PDFs to obtain better text from Decisions
This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

This post is the latest one dealing with creating a corpus out of the decisions of the Personal Data Protection Commission of Singapore. However, this time I believe I got it right. If I don’t turn out to be an ardent perfectionist, I might have a working corpus.

Problem Statement

I must have written this a dozen of times here already. Law reports in Singapore unfortunately are generally available in PDF format. There is a web format accessible from LawNet, but it is neither free as in beer nor free as in libre. You have to work with PDFs to create a solution that is probably not burdened with copyrights.

However, text extracted from PDFs can look like crap. It’s not the PDF’s fault. PDFs are designed to look the same on any computer. Thus a PDF comprises of a bunch of text boxes and containers rather than paragraphs and headers. If the text extractor is merely reading the text boxes, the results will not look pretty. See an example below:

The operator, in the mistaken belief that the three statements belonged
 to  the  same  individual,  removed  the  envelope  from  the  reject  bin  and
 moved it to the main bin. Further, the operator completed the QC form
 in  a  way  that  showed  that  the  number  of  “successful”  and  rejected
 Page 3 of 7
 
 B.
 (i)
 (ii)
 9.
 (iii)
 envelopes tallied with the expected total from the run. As the envelope
 was  no  longer  in  the  reject bin, the  second and  third  layers  of  checks
 were by-passed, and the envelope was sent out without anyone realising
 that it contained two extra statements. The manual completion of the QC
 form by the operator to show that the number of successful and rejected
 envelopes tallied allowed this to go undetected. 

Those lines are broken up with new lines by the way. So the computer reads the line in a way that showed that the number of “successful” and rejected, and then wonders “rejected” what?! The rest of the story continues about seven lines away. None of this makes sense to a human, let alone a computer.

Previous workarounds were... unimpressive

Most beginning data science books advise programmers to using regular expressions as a way to clean the text. This allowed me to pick what to keep and what to reject in the text. I then joined up the lines to form a paragraph. This was the subject of Get rid of the muff: pre-processing PDPC Decisions.

As mentioned in that post, I was very bothered with removing useful content such as paragraph numbers. Furthermore, it was very tiring playing whack a mole figuring out which regular expression to use to remove a line. The results were not good enough for me, as several errors continued to persist in the text.

Not wanting to play whack a mole, I decided to train a model to read lines and make a guess as to what lines to keep or remove. This was the subject of First forays into natural language processing — get rid of a line or keep it? The model was surprisingly accurate and managed to capture most of what I thought should be removed.

However, the model was quite big, and the processing was also slow. While using natural language processing was certainly less manual, I was just getting results I would have obtained if I worked harder at regular expressions. This method was still not satisfactory for me.

I needed a new approach.

A breakthrough — focus on the layout rather than the content

I was browsing Github issues on PDFMiner when it hit me like a brick. The issue author had asked how to get the font data of a text on PDFMiner.

pdfminer/pdfminer.six
Community maintained fork of pdfminer - we fathom PDF - pdfminer/pdfminer.six

I then realised that I had another option other than removing text more effectively or efficiently.

This post is for subscribers only