8 min read

Data Science with Judgement Data – My PDPC Decisions Journey

An interesting experiment to apply what I learnt in Data Science to the area of law.
Data Science with Judgement Data – My PDPC Decisions Journey
This Features article features many articles which may require a free subscription to read. Become a subscriber today and get access to all the articles!

Introduction

Over the course of 2019 and 2020, I embarked on a quest to apply the new things I was learning in data science to my field of work in law.

The dataset I chose was the enforcement decisions from the Personal Data Protection Commission in Singapore. The reasons I chose it was quite simple. I wanted a simple dataset covering a limited number of issues and is pretty much independent (not affected by stare decisis or extensive references to legislation or other cases). Furthermore, during that period, the PDPC was furiously issuing several decisions.

This experiment proved to be largely successful, and I learned a lot from the experience. This post gathers all that I have written on the subject at the time. I felt more confident to move on to more complicated datasets like the Supreme Court Decisions, which feature several of the same problems faced in the PDPC dataset.

Since then, the dataset has changed a lot, such as the website has changed, so your extraction methods would be different. I haven't really maintained the code, so they are not intended to create your own dataset and analysis today. However, techniques are timeless, and I hope they still point you in a good direction.

Introducing — pdpc-decisions!
It’s a milestone today! I wrote something that I felt is worthy of v 1.0. That version number is magical, because it means the software works. Yup pdpc-decisions is one! houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an …
This post explains the project at the time. The code hasn't been maintained so no warranties. 

Extracting Judgement Data

Dog & Baltic Sea
Photo by Janusz Maniak / Unsplash

The first step in any data science journey is to extract data from a source. In Singapore, one can find judgements from courts on websites for free. You can use such websites as the source of your data. API access is usually unavailable, so you have to look at the webpage to get your data.

It's still possible to download everything by clicking on it. However, you wouldn't be able to do this for an extended period of time. Automate the process by scraping it!

Automate Boring Stuff: Get Python and your Web Browser to download your judgements
Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wrote this post, the structure of the PDPC’s website was drastically altered. The concepts and the ideas in this post haven’t changed, but the examples are outdated. This gives

I used Python and Selenium to access the website and download the data I want. This included the actual judgement. Metadata, such as the hearing date etc., are also available conveniently from the website, so you should try and grab them simultaneously. In Automate Boring Stuff, I discussed my ideas on how to obtain such data.

Processing Judgement Data in PDF

Photo by Pablo Lancaster Jones / Unsplash

Many judgements which are available online are usually in PDF format. They look great on your screen but are very difficult for robots to process. You will have to transform this data into a format that you can use for natural language processing.

I took a lot of time on this as I wanted the judgements to read like a text. The raw text that most (free) PDF tools can provide you consists of joining up various text boxes the PDF tool can find. This worked all right for the most part, but if the text ran across the page, it would get mixed up with the headers and footers. Furthermore, the extraction revealed lines of text, not paragraphs. As such, additional work was required.

Firstly, I used regular expressions. This allowed me to detect unwanted data such as carriage returns, headers and footers in the raw text matched by the search term.

Get rid of the muff: pre-processing PDPC Decisions
The life of a budding data science enthusiast. You need data to work on, so you look all around you for something that has meaning to you. Anyone who reads the latest 10 posts of this blog knows I have a keen interest in data protection in Singapore. So naturally,
Regular expressions are cool.

I then decided to use machine learning to train my computer to decide whether to keep a line or reject it. This required me to create a training dataset and tag which lines should be kept as the text. This was probably the fastest machine learning exercise I ever came up with.

First forays into natural language processing — get rid of a line or keep it?
Avid followers of Love Law Robots will know that I have been hard at creating a corpus of Personal Data Protection Commission decisions. Downloading them and pre-processing them has taken a lot of work! However, it has managed to help me create interesting charts that shows insight at a macro
But machine learning is even cooler.

However, I didn't believe I was getting significant improvements from these methods. The final solution was actually fairly obvious. Using the formatting information of how the text boxes were laid out in the PDF, I could make reasonable conclusions of which text was a header or footer, a quote or a start of a paragraph. It was great!

Mining PDFs to obtain better text from Decisions
After several attempts at wrangling with PDFs, I managed to extract more text information from complicated documents using PDFMiner.
A true eureka moment.

Natural Language Processing + PDPC Decisions = 💕

Photo by Moritz Kindler / Unsplash

With a dataset ready to be processed, I decided that I can finally use some of the cutting edge libraries I have been raring to use, such as spacy and Hugging Face.

One of the first experiments was to use spaCy's RuleMatcher to extract enforcement information from the summary provided by the authorities. As the summary was fairly formulaic, it was possible to extract whether the authorities imposed a penalty or other enforcement actions were taken by the authority.

New features for pdpc-decisions!
As mentioned in my previous post, I have not been able to spend time writing as much code as I wanted. I had to rewrite a lot of code due to the layout change in the PDPC Website. That was not the post I wanted to write. I have finally
I mention RuleMatcher in this post. 

I also wanted to undertake key NLP tasks using the data I had prepared. This included tasks like Named Entity Recognition (does the sentence contain any special entities), summarisation (extract key points in the decision) and question answering (if you ask the machine a question, can it find the answer in the source?). To experiment, I used the default pipelines from Hugging Face to evaluate the results. There are clearly limitations, but very exciting as well!

Toying around with Hugging Face transformers and PDPC decisions
The default hugging face transformers are given a run on three samples of PDPC decisions on NER, summarisation and QA.

Visualisations

Photo by Annie Spratt / Unsplash

Visualisations are very often the result of the data science journey. Extracting and processing data can be very rewarding in itself, but you would like to show others how your work is useful as well.

One of my first aims in 2019 was to show how PDPC decisions are different since they were issued in 2016. Decisions became greater in number, more frequent, and shorter in length. There was clearly a shift and an intensifying of effort in enforcement.

My review of Data Protection in Singapore in one graph
It’s that time of the year! I wonderfully tackle trying to compress an entire year into a few paragraphs. If you are an ardent fan of data protection here, you will already have several events in mind. Here is just a few — the decision on the SingHealth data breach,

I also wanted to visualise how the PDPC was referring to its own decisions. Such visualisation would allow one to see which decisions the PDPC was relying on to explain its decisions. This would definitely help to narrow down which decisions are worth reading in a deluge of information. As such, I created a network graph and visualised it. I called the result my "Star Map".

Presenting: The PDPC Decision Star Map (Version 2)
Networks are one of the most straightforward ways to analyse judgements and cases. We can establish relationships between cases and transform them into data. Computers crunch data. Computers produce a beautiful graph. Now I have the latest iteration of the network data of the Personal Data Protectio…

Data continued to be very useful in leading the conclusion I made about the enforcement situation in Singapore. For example, how great an impact would the increase in maximum penalties in the latest amendments to the law have? Short answer: Probably not much, but they still have a symbolic effect.

Will Increased Penalties Lead to Greater Compliance With the PDPA?
This post is part of a series relating to the amendments to the Personal Data Protection Act in Singapore in 2020. Check out the main post for more articles!When the GDPR made its star turn in 2018, the jaw-dropping penalties drew a lot of attention. Up to €20 million,
A “big data science chart” was used.

What's Next?

As mentioned, I have been focusing on other priorities, so I haven't been working on PDPC-Decisions for a while. However, my next steps were:

  • I wanted to train a machine to process judgements for named entity recognition and summarization. For the second task, one probably needs to use a transformer in a pipeline and experiment with what works best.
  • Instead of using Selenium and Beautiful Soup, I wanted to use scrapy to create a sustainable solution to extract information regularly.

Feel free to let me know if you have any comments!