Mining PDFs to obtain better text from Decisions
After several attempts at wrangling with PDFs, I managed to extract more text information from complicated documents using PDFMiner.
This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!
This post is the latest one dealing with creating a corpus out of the decisions of the Personal Data Protection Commission of Singapore. However, this time I believe I got it right. If I don’t turn out to be an ardent perfectionist, I might have a working corpus.
I must have written this a dozen of times here already. Law reports in Singapore unfortunately are generally available in PDF format. There is a web format accessible from LawNet, but it is neither free as in beer nor free as in libre. You have to work with PDFs to create a solution that is probably not burdened with copyrights.
However, text extracted from PDFs can look like crap. It’s not the PDF’s fault. PDFs are designed to look the same on any computer. Thus a PDF comprises of a bunch of text boxes and containers rather than paragraphs and headers. If the text extractor is merely reading the text boxes, the results will not look pretty. See an example below:
The operator, in the mistaken belief that the three statements belonged to the same individual, removed the envelope from the reject bin and moved it to the main bin. Further, the operator completed the QC form in a way that showed that the number of “successful” and rejected Page 3 of 7 B. (i) (ii) 9. (iii) envelopes tallied with the expected total from the run. As the envelope was no longer in the reject bin, the second and third layers of checks were by-passed, and the envelope was sent out without anyone realising that it contained two extra statements. The manual completion of the QC form by the operator to show that the number of successful and rejected envelopes tallied allowed this to go undetected.
Those lines are broken up with new lines by the way. So the computer reads the line
in a way that showed that the number of “successful” and rejected, and then wonders “rejected” what?! The rest of the story continues about seven lines away. None of this makes sense to a human, let alone a computer.
Previous workarounds were... unimpressive
Most beginning data science books advise programmers to using regular expressions as a way to clean the text. This allowed me to pick what to keep and what to reject in the text. I then joined up the lines to form a paragraph. This was the subject of Get rid of the muff: pre-processing PDPC Decisions.
As mentioned in that post, I was very bothered with removing useful content such as paragraph numbers. Furthermore, it was very tiring playing whack a mole figuring out which regular expression to use to remove a line. The results were not good enough for me, as several errors continued to persist in the text.
Not wanting to play whack a mole, I decided to train a model to read lines and make a guess as to what lines to keep or remove. This was the subject of First forays into natural language processing — get rid of a line or keep it? The model was surprisingly accurate and managed to capture most of what I thought should be removed.
However, the model was quite big, and the processing was also slow. While using natural language processing was certainly less manual, I was just getting results I would have obtained if I worked harder at regular expressions. This method was still not satisfactory for me.
I needed a new approach.
A breakthrough — focus on the layout rather than the content
I was browsing Github issues on PDFMiner when it hit me like a brick. The issue author had asked how to get the font data of a text on PDFMiner.
I then realised that I had another option other than removing text more effectively or efficiently.