Love.Law.Robots. by Ang Hou Fu

PDFMiner

Feature image

Introduction

Over the course of 2019 and 2020, I embarked on a quest to apply the new things I was learning in data science to my field of work in law.

The dataset I chose was the enforcement decisions from the Personal Data Protection Commission in Singapore. The reason I chose it was quite simple. I wanted a simple dataset covering a limited number of issues and is pretty much independent (not affected by stare decisis or extensive references to legislation or other cases). Furthermore, during that period, the PDPC was furiously issuing several decisions.

This experiment proved to be largely successful, and I learned a lot from the experience. This post gathers all that I have written on the subject at the time. I felt more confident to move on to more complicated datasets like the Supreme Court Decisions, which feature several of the same problems faced in the PDPC dataset.

Since then, the dataset has changed a lot, such as the website has changed, so your extraction methods would be different. I haven't really maintained the code, so they are not intended to create your own dataset and analysis today. However, techniques are still relevant, and I hope they still point you in a good direction.

Extracting Judgement Data

Dog & Baltic SeaPhoto by Janusz Maniak / Unsplash

The first step in any data science journey is to extract data from a source. In Singapore, one can find judgements from courts on websites for free. You can use such websites as the source of your data. API access is usually unavailable, so you have to look at the webpage to get your data.

It's still possible to download everything by clicking on it. However, you wouldn't be able to do this for an extended period of time. Automate the process by scraping it!

Automate Boring Stuff: Get Python and your Web Browser to download your judgements]

I used Python and Selenium to access the website and download the data I want. This included the actual judgement. Metadata, such as the hearing date etc., are also available conveniently from the website, so you should try and grab them simultaneously. In Automate Boring Stuff, I discussed my ideas on how to obtain such data.

Processing Judgement Data in PDF

Photo by Pablo Lancaster Jones / Unsplash

Many judgements which are available online are usually in #PDF format. They look great on your screen but are very difficult for robots to process. You will have to transform this data into a format that you can use for natural language processing.

I took a lot of time on this as I wanted the judgements to read like a text. The raw text that most (free) PDF tools can provide you consists of joining up various text boxes the PDF tool can find. This worked all right for the most part, but if the text ran across the page, it would get mixed up with the headers and footers. Furthermore, the extraction revealed lines of text, not paragraphs. As such, additional work was required.

Firstly, I used regular expressions. This allowed me to detect unwanted data such as carriage returns, headers and footers in the raw text matched by the search term.

I then decided to use machine learning to train my computer to decide whether to keep a line or reject it. This required me to create a training dataset and tag which lines should be kept as the text. This was probably the fastest machine-learning exercise I ever came up with.

However, I didn't believe I was getting significant improvements from these methods. The final solution was actually fairly obvious. Using the formatting information of how the text boxes were laid out in the PDF , I could make reasonable conclusions about which text was a header or footer, a quote or a start of a paragraph. It was great!

Natural Language Processing + PDPC Decisions = 💕

Photo by Moritz Kindler / Unsplash

With a dataset ready to be processed, I decided that I could finally use some of the cutting-edge libraries I have been raring to use, such as #spaCy and #HuggingFace.

One of the first experiments was to use spaCy's RuleMatcher to extract enforcement information from the summary provided by the authorities. As the summary was fairly formulaic, it was possible to extract whether the authorities imposed a penalty or the authority took other enforcement actions.

I also wanted to undertake key NLP tasks using my prepared data. This included tasks like Named Entity Recognition (does the sentence contain any special entities), summarisation (extract key points in the decision) and question answering (if you ask the machine a question, can it find the answer in the source?). To experiment, I used the default pipelines from Hugging Face to evaluate the results. There are clearly limitations, but very exciting as well!

Visualisations

Photo by Annie Spratt / Unsplash

Visualisations are very often the result of the data science journey. Extracting and processing data can be very rewarding, but you would like to show others how your work is also useful.

One of my first aims in 2019 was to show how PDPC decisions have been different since they were issued in 2016. Decisions became greater in number, more frequent, and shorter in length. There was clearly a shift and an intensifying of effort in enforcement.

I also wanted to visualise how the PDPC was referring to its own decisions. Such visualisation would allow one to see which decisions the PDPC was relying on to explain its decisions. This would definitely help to narrow down which decisions are worth reading in a deluge of information. As such, I created a network graph and visualised it. I called the result my “Star Map”.

Data continued to be very useful in leading the conclusion I made about the enforcement situation in Singapore. For example, how great an impact would the increase in maximum penalties in the latest amendments to the law have? Short answer: Probably not much, but they still have a symbolic effect.

What's Next?

As mentioned, I have been focusing on other priorities, so I haven't been working on PDPC-Decisions for a while. However, my next steps were:

  • I wanted to train a machine to process judgements for named entity recognition and summarization. For the second task, one probably needs to use a transformer in a pipeline and experiment with what works best.
  • Instead of using Selenium and Beautiful Soup, I wanted to use scrapy to create a sustainable solution to extract information regularly.

Feel free to let me know if you have any comments!

#Features #PDPC-Decisions #PersonalDataProtectionAct #PersonalDataProtectionCommission #Decisions #Law #NaturalLanguageProcessing #PDFMiner #Programming #Python #spaCy #tech

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural language processing, data extraction and processing!

This post is the latest one dealing with creating a corpus out of the decisions of the Personal Data Protection Commission of Singapore. However, this time I believe I got it right. If I don’t turn out to be an ardent perfectionist, I might have a working corpus.

Problem Statement

I must have written this a dozen of times here already. Law reports in Singapore unfortunately are generally available in PDF format. There is a web format accessible from LawNet, but it is neither free as in beer nor free as in libre. You have to work with PDFs to create a solution that is probably not burdened with copyrights.

However, text extracted from PDFs can look like crap. It’s not the PDF’s fault. PDFs are designed to look the same on any computer. Thus a PDF comprises of a bunch of text boxes and containers rather than paragraphs and headers. If the text extractor is merely reading the text boxes, the results will not look pretty. See an example below:

    The operator, in the mistaken belief that the three statements belonged
     to  the  same  individual,  removed  the  envelope  from  the  reject  bin  and
     moved it to the main bin. Further, the operator completed the QC form
     in  a  way  that  showed  that  the  number  of  “successful”  and  rejected
     Page 3 of 7
     
     B.
     (i)
     (ii)
     9.
     (iii)
     envelopes tallied with the expected total from the run. As the envelope
     was  no  longer  in  the  reject bin, the  second and  third  layers  of  checks
     were by-passed, and the envelope was sent out without anyone realising
     that it contained two extra statements. The manual completion of the QC
     form by the operator to show that the number of successful and rejected
     envelopes tallied allowed this to go undetected. 

Those lines are broken up with new lines by the way. So the computer reads the line in a way that showed that the number of “successful” and rejected, and then wonders “rejected” what?! The rest of the story continues about seven lines away. None of this makes sense to a human, let alone a computer.

Previous workarounds were... unimpressive

Most beginning data science books advise programmers to using regular expressions as a way to clean the text. This allowed me to choose what to keep and reject in the text. I then joined up the lines to form a paragraph. This was the subject of Get rid of the muff: pre-processing PDPC Decisions.

As mentioned in that post, I was very bothered with removing useful content such as paragraph numbers. Furthermore, it was very tiring playing whack a-mole figuring out which regular expression to use to remove a line. The results were not good enough for me, as several errors continued to persist in the text.

Not wanting to play whack-a-mole, I decided to train a model to read lines and make a guess as to what lines to keep or remove. This was the subject of First forays into natural language processing — get rid of a line or keep it? The model was surprisingly accurate and managed to capture most of what I thought should be removed.

However, the model was quite big, and the processing was also slow. While using natural language processing was certainly less manual, I was just getting results I would have obtained if I had worked harder at regular expressions. This method was still not satisfactory for me.

I needed a new approach.

A breakthrough — focus on the layout rather than the content

I was browsing Github issues on PDFMiner when it hit me like a brick. The issue author had asked how to get the font data of a text on PDFMiner.

I then realised that I had another option other than removing text more effectively or efficiently.

Readers don’t read documents by reading lines and deciding whether to ignore them or not. The layout — the way the text is arranged and the way it looks visually — informs the reader about the structure of the document.

Therefore, if I knew how the document was laid out, I could determine its structure based on observing its rules.

Useful layout information to analyse the structure of a PDF.

Rather than relying only on the content of the text to decide whether to keep or remove text, you now also have access to information about what it looks like to the reader. The document’s structure is now available to you!

Information on the structure also allows you to improve the meaning of the text. For example, I replaced the footnote of a text with its actual text so that the proximity of the footnote is closer to where it was supposed to be read rather than finding it at the bottom of page. This makes the text more meaningful to a computer.

Using PDFMiner to access layout information

To access the layout information of the text in a PDF, unfortunately, you need to understand the inner workings of a PDF document. Fortunately, PDFMiner simplifies this and provides it in a Python-friendly manner.

Your first port of call is to extract the page of the PDF as an LTPage. You can use the high level function extract_pages for this.

Once you extract the page, you will be able to access the text objects as a list of containers. That’s right — using list comprehension and generators will allow you to access the containers themselves.

Once you have access to each container in PDFMiner, the metadata can be found in its properties.

The properties of a LTPage reveal its layout information.

It is not apparent from the documentation, so studying the source code itself can be very useful.

Code Highlights

Here are some highlights and tips from my experience so far implementing this method using PDFMiner.

Consider adjusting LAParams first

Before trying to rummage through your text boxes in the PDF, pay attention to the layout parameters which PDFMiner uses to extract text. Parameters like line, word, and char margin determine how PDFMiner groups texts together. Effectively setting these parameters can help you to just extract the text.

Notwithstanding, I did not use LAParams as much for this project. This is because the PDFs in my collection can be very arbitrary in terms of layout. Asking PDFMiner to generalise in this situation did not lead to satisfactory results. For example, PDFMiner would try to join lines together in some instances and not be able to in others. As such, processing the text boxes line by line was safer.

Retrieving text margin information as a list

As mentioned in the diagram above, margin information can be used to separate the paragraph numbers from their text. The original method of using a regular expression to detect paragraph numbers had the risk of raising false positives.

The x0 property of an LTTextBoxContainer represents its left co-ordinate, which is its left margin. Assuming you had a list of LTTextBoxContainers (perhaps extracted from a page), a simple list comprehension will get you all the left margins of the text.

    from collections import Counter
    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTTextContainer, LAParams
    
    limit = 1 # Set a limit if you want to remove uncommon margins
    first_page = extract_pages(pdf, page_numbers=[0], laparams=LAParams(line_margin=0.1, char_margin=3.5))
    containers = [container for container in first_page if isinstance(container, LTTextContainer)]
    text_margins = [round(container.x0) for container in containers]
    c = Counter(text_margins)
    sorted_text_margins = sorted([margin for margin, count in c.most_common() if count &gt; limit])</code></pre>

Counting the number of times a text margin occurs is also useful to eliminate elements that do not belong to the main text, such as titles and sometimes headers and footers.

You also get a hierarchy by sorting the margins from left to right. This is useful for separating first-level text from the second-level, and so on.

Using the gaps between lines to determine if there is a paragraph break

As mentioned in the diagram above, the gap between lines can be used to determine if the next line is a new paragraph.

In PDFMiner as well as PDF, the x/y coordinate system of a page starts from its lower-left corner (0,0). The gap between the current container and the one immediately before it is between the current container’s y1 (the current container’s top) and the previous container’s y0 (the previous container’s base). Conversely, the gap between the current container and the one immediately after is the current container’s y0 (the current container’s base) and the next container’s y1 (the next container’s top)

Therefore, if we have a list of LTTextBoxContainers, we can write a function to determine if there is a bigger gap.

    def check_gap_before_after_container(containers: List[LTTextContainer], index: int) -> bool:
        index_before = index - 1
        index_after = index + 1
        gap_before = round(containers[index_before].y1 - containers[index].y0)
        gap_after = round(containers[index].y1 - containers[index_after].y0)
        return gap_after >= gap_before

Therefore if the function returns true, we know that the current container is the last line of the paragraph. As I would then save the paragraph and start a new one, keeping the paragraph content as close to the original as possible.

Retrieve footnotes with the help of the footnote separator

As mentioned in the diagram, footnotes are generally smaller than the main text, so you could get footnotes by comparing the height of the main text with that of the footnotes. However, this method is not so reliable because some small text may not be footnotes (figure captions?).

For this document, the footnotes are contained below a separator, which looks like a line. As such, our first task is to locate this line on the page. In my PDF file, another layout object, LTRect, describes lines. An LTRect with a height of less than 1 point appears not as a rectangle, but as a line!

    if footnote_line_container := [container for container in page if all([ 
        isinstance(container, LTRect),
        container.height < 1,
        container.y0 < 350,
        container.width < 150,
        round(container.x0) == text_margins[0]
    ])]:
        footnote_line_container = sorted(footnote_line_container, key=lambda container: container.y0)
        footnote_line = footnote_line_container[0].y0

Once we have obtained the y0 coordinate of the LTRect, we know that any text box under this line is a footnote. This accords with the intention of the maker of the document, which is sometimes Microsoft Word!

You might notice that I have placed several other conditions to determine whether a line is indeed a footnote marker. It turns out that LTRects are also used to mark underlined text. The extra conditions I added are the length of the line (container.width < 150), whether the line is at the top or bottom half of the page (container.y0 < 350), and that it is in line with the leftmost margin of the text (round(container.x0) == text_margins[0]).

For Python programming, I also found using the built-in all() and any() to be useful in improving the readability of the code if I have several conditions.

I also liked using a new feature in the latest 3.8 version of Python: the walrus operator! The code above might not work for you if you are on a different version.

Conclusion

Note that we are reverse-engineering the PDF document. This means that getting perfect results is very difficult, and you would probably need to try several times to deal with the side effects of your rules and conditions. The code which I developed for pdpc-decisions can look quite complicated and it is still under development!

Given a choice, I would prefer using documents where the document’s structure is apparent (like web pages). However, such an option may not be available depending on your sources. For complicated materials like court judgements, studying the structure of a PDF can pay dividends. Hopefully, some of the ideas in this post will get you thinking about how to exploit such information more effectively.

#PDPC-Decisions #NaturalLanguageProcessing #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

The life of a budding data science enthusiast.

You need data to work on, so you look all around you for something that has meaning to you. Anyone who reads the latest 10 posts of this blog knows I have a keen interest in data protection in Singapore. So naturally, I wanted to process the PDPC’s enforcement cases. Fortunately for me, the listing of published cases are complete, and they are not exactly hindered by things like stare decisis. We begin by scraping the website.

The Problem

Using the scraping method I devised, you will now get a directory filled with PDFs of the cases. Half the battle done, right? If you thought so, then you have not looked hard enough at the product.

It’s right there. They are PDFs. Notwithstanding its name, PDFs do not actually represent text well. They do represent pages or documents, which means you what get what you see and not what you read.

I used PDFminer to extract the text from the PDF, and this is a sample output that I get:

The operator, in the mistaken belief that the three statements belonged to the same individual, removed the envelope from the reject bin and moved it to the main bin. Further, the operator completed the QC form in a way that showed that the number of “successful” and rejected Page 3 of 7 B. (i) (ii) 9. (iii) envelopes tallied with the expected total from the run. As the envelope was no longer in the reject bin, the second and third layers of checks were by-passed, and the envelope was sent out without anyone realising that it contained two extra statements. The manual completion of the QC form by the operator to show that the number of successful and rejected envelopes tallied allowed this to go undetected.

Notice the following:

  • There are line breaks in the middle of sentences. This was where the sentence broke apart for new lines in the document. The computer would read “The operator, in the mistaken belief that the three statements belonged” and then go “What? What happened?”
  • Page footers and headers appears in the document. They make sense when you are viewing a document, but are meaningless in a text.
  • Orphan bullet and paragraph numbers. They used to belong to some text, but now nobody knows. Table contents are also seriously borked.

If you had fed this to your computer or training, you are going to get rubbish. The next step, which is very common in data science but particularly troublesome in natural language processing is preprocessing. We have to fix the errors ourselves before letting your computer do more with it.

I used to think that I could manually edit the files and fix the issues one by one, but it turned out to be very time consuming. (Duh!) My computer had to do the work for me. Somehow!

The Preprocessing Solution

Although I decided I would let the computer do the correcting for me, I still had to do some work on my own. Instead of looking for errors, this time I was looking for patterns instead. This involved scanning through the output of the PDF to Text converter and the actual document. Once I figured out how these errors came about, I can get down to fixing it.

Deciding what to take out, what to leave behind

Not so fast. Unfortunately, correcting errors is not the only decision I had to make. Like many legal reports, PDPC decisions have paragraph numbers. These numbers are important. They are used for cross-referencing. In the raw output, the paragraph numbers are present as plain numbers in the text. They may be useful to a human reader who knows they are meta-information in the text, but to a computer it probably is just noise.

I decided to take it out. I don’t doubt that one day we can engineer a solution that makes it useful, but for now, they are just distractions.

Put Regular Expressions to work!

As mentioned earlier, we looked for patterns in the text to help the computer to look for them and correct them. I found regular expressions to be a very powerful way to express such patterns in a way that the computer can look for. A regular expression is sort of like a language to define a search pattern.

For example,. this code in bold looks for feed carriages in the text (a “carriage return” is what happens when you press ‘Enter’ on your keyboard, and is much cooler in a typewriter)

def removefeedcarriage(source): return [x for x in source if not re.search(r'\f', x)]

This python code tells the computer not to include any text which contains feed carriages in the text. This eliminates the multiple blank lines created by the PDF converter (usually blank space in the PDF).

A well crafted regular expression can find a lot of things. For example, the expression in bold looks for citations (“[2016] SGPDPC 15” and others) and removes them.

def remove_citations(source): return [x for x in source if not re.search(r'^[\d{4}]\s+(?:\d\s+)?[A-Z|()]+\s+\d+[\s.]?$', x)]

Figuring out the “language” in regular expressions does takes some time, but it pays dividends. To help the journey, I test my expressions using freely available websites for testing and providing a reference for regular expressions. For python, this is one of the sites I used.

Getting some results!

Besides removing citations, feed carriages and paragraph numbers, I also tried to join broken sentences together. In all, the code manages to remove around 90% of the extra line breaks. Most of the paragraphs in the text actually reads like sentences and I feel much more confident training a model based on these text.

It ain’t perfect of course. The text gets really mushed up once a table is involved, and the headers and footers are not exactly removed all the time. But as Voltaire said, “Perfect is the enemy of the good”. For now, this will do.

Concluding remarks

Hopefully this pretty quick rundown of how I pre-processed the pdpc-decisions will give you some idea as to what to do in your own projects. Now that I have the text, I have got to find something to use it for! :O Is there any other ways to improve the code to catch even more errors? Feel free to comment to let me know!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

#PDPC-Decisions #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu