3 min read

Introducing — pdpc-decisions!

Introducing — pdpc-decisions!
This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

It’s a milestone today! I wrote something that I felt is worthy of v 1.0. That version number is magical, because it means the software works. Yup pdpc-decisions is one!

Data Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.

What’s the Problem?

What does pdpc-decisions do? Basically it goes through the PDPC Enforcement Decisions site and creates three things:

  1. A table of basic information of every decision published on the site
  2. Download every single decision on that site as a PDF
  3. Converts the PDF into a plain text file which can be used as a data set.

This means that you can get your own copy of the PDPC Enforcement Decision by running the code! Refer to the Readme for technical instructions on how to get going.

Why would you like to get your own copy of every decision ever delivered by the PDPC? If you are not going to do anything else with the data, I can see some uses already:

  1. If you don’t subscribe to LawNet, downloading your own library is the best way to read and review it anytime you like.
  2. The table referred of basic information can be really useful in letting you get a glance of all the decisions in one table. Let’s face it — there are many decisions now and it is difficult to keep up with it.
  3. Although it’s great the PDPC provided a search functionality, being able to view more than five decisions at one time is pretty nifty.

Unlike other jurisdictions with a Legal Information Institute, like Hong Kong and New Zealand, Singapore is not an easy place to get legal information easily for free. Furthermore, the legal profession’s obsession with PDF makes accessing such information difficult for computers. This tool makes it much easier to access such information for computers. The results had already allowed to make a time comparison of decisions pretty easily.

Show the number and average length of PDPC Decisions since April 2016

Things I learnt

pdpc-decisions uses Python, which is remarkable because it is a language I picked up less than a year with very little offline or online training. Besides dealing with a new programming language, I also had to figure out how to use web scraping tools like selenium and beautiful soup, as well as python testing tools such as pytest. (Coverage is 94%!)

I also got to experiment with distributing a software, primarily via docker. This tool is especially well suited to be run as a image, since it is best run only once. Not only did I try and succeed at getting automated builds done, I also managed to setup continuous integration through Travis-CI.

So, the unit tests and the automated testing and builds work. Hopefully I have made code that can be easier to maintain. Since reaching v 1.o, I will be leaving this code alone for a while.

What’s Next?

Of course the code is not perfect. I have spotted a few typos here and there. I might want to leave it alone to collect a few more bugs before I create a new version.

Furthermore, the site changes so I expect the code to break soon. During the course of writing this software, I have already notice some subtle changes to the website. Since I do use this package from time to time, I will be able to maintain as and when the code changes.

The ultimate goal of this code however leads to my slow going super-project which I called zeeker. It’s a database of personal data protection resources in the cloud and I hope to expand on the source material here to create an even richer database. So this will not be the last post I will make on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interests in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it useful so if there is an interest, please let me know!

For now though, I am going to sit back and enjoy my code. Let’s run it again! Haha!