New features for pdpc-decisions!

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

As mentioned in my previous post, I have not been able to spend time writing as much code as I wanted. I had to rewrite a lot of code due to the layout change in the PDPC Website. That was not the post I wanted to write. I have finally been able to write about my newest forays for this project.

Enforcement information#

I had noticed that the summary provided by the Personal Data Protection Commission provided an easy place to cull basic information. So, I have added enforcement information. Decisions now tell you whether a financial penalty or a warning was meted out.

Information is extracted from the summaries using RuleMatcher in spaCy. It isn’t perfect. Some text does not really fit the mould. However, due to the way the summaries are written, information is mostly extracted accurately.

Visualising the parts of speech in a typical sentence can allow you to write rules to extract information.

This is the first time I have used spaCy or any natural language processing for this purpose. Remarkably, it has been fast. Culling this information (as well as the other extra features) only added about two hundred seconds to building a database from scratch. I would like to find more avenues to use these newfound techniques!

spaCy · Industrial-strength Natural Language Processing in PythonspaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.


Court decisions are special in that they often require references to leading cases. This is because they are either binding (stare decisis) or persuasive to the decision maker. Of course, previous PDPC Decisions are not binding on the PDPC. Lately, respondents have been referring to the body of cases to argue they should be treated alike. I have not read a decision where this argument has worked.

Nevertheless, the network of cases referring to and referred by offers remarkably interesting insights. To imagine, we are looking at a social network of cases. To establish a point, the Personal Data Protection Commission does refer to earlier cases. All things being equal, a case with more references is more influential.

pdpc-decisions now reads the text of the decisions to create a list of decisions it refers to in the decision (“ referring to “). From the list of decisions, we can also create a list of decisions which makes references to it (“ referred by “). Because of the haphazard way the PDPC has been writing its decisions and its citations, this is also not perfect, but it is still kind of accurate.

As I mentioned, compiling a network of decisions can offer some interesting insights. So here it is — the social network of PDPC decisions.

I guess this is the real pdpc decisions in one chart

Update (24/4/2020): The chart was lumping together the Aviva case in 2018 with the Aviva case in 2017. The graph has been updated. Not much has changed in the big picture though.

Of course, a more advanced visualisation tool would allow you to drill down to see which cases are more influential. However, a big diagram like the above shows you which are the big boys in this social network.

Before I leave this section, here’s a fun fact to take home. Based on the computer’s analysis, over 68% of PDPC decisions refer to one another. That’s a lot of chatter!

Moving On

I keep thinking I have finished my work here, but there seems to be new things coming up. Here is some interesting information I would like to find out:

You would just have to keep watching this space! What kind of information is interesting to you too?

#PDPC-Decisions #NaturalLanguageProcessing #spaCy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu