Love.Law.Robots. by Ang Hou Fu

PDPC

Feature image

I’ve wanted to pen down my thoughts on the next stage of the evolution of my projects for some time. Here I go!

What’s next after pdpc-decisions?

I had a lot of fun writing pdpc-decisions. It scraped data from the Personal Data Protection Commission’s enforcement decisions web page and produced a table, downloads and text. Now I got my copy of the database! From there, I made visualisations, analyses and fun graphs.

All for free.

The “free” includes the training I got coding in Python and trying out various stages of software development, from writing tests to distributing a package as a module and a docker container.

In the lofty “what’s next” section of the post, I wrote:

The ultimate goal of this code, however leads to my slow-going super-project, which I called zeeker. It’s a database of personal data protection resources in the cloud, and I hope to expand on the source material here to create an even richer database. So this will not be my last post on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interest in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it helpful so if there is an interest, please let me know!

What has happened since then?

For one, personal data protection commission decisions are not interesting enough for me. Since working on that project, the deluge of decisions has trickled as the PDPC appeared to have changed its focus to compliance and other cool techy projects.

Furthermore, there are much more interesting data out there: for example, the PDPC has created many valuable guidelines which are currently unsearchable. As Singapore’s rules and regulations grow in complexity, there’s much hidden beneath the surface. The zeeker project shouldn’t just focus on a narrow area of law or judgements and decisions.

In terms of system architecture, I made two other decisions.

Use more open-source libraries, and code less.

I grew more confident in my coding skills doing pdpc-decisions, but I used a few basic libraries and hacked my way through the data. When I look back at my code, it is unmaintainable. Any change can break the library, and the bog of whacked-up coding made it hard for me to understand what I was doing several months later. Tests, comments and other documentation help, but only if you’re a disciplined person. I’m not that kind of guy.

Besides writing code (which takes time and lots of motivation), I could also “piggyback” on the efforts of others to create a better stack. The stack I’ve decided so far has made coding more pleasant.

There are also other programs I would like to try — for example, I plan to deliver the data through an API, so I don’t need to use Python to code the front end. A Javascript framework like Next.JS would be more effective for developing websites.

Decoupling the project with the programming language also expands the palette of tools I can have. For example, instead of using a low-level Python library like pdfminer to “plumb” a PDF, I could use a self-hosted docker container like parsr to OCR or analyse the PDF and then convert it to text.

It’s about finding the best tool for the job, not depending only on my (mediocre) programming skills to bring results.

There’s, of course, an issue of technical debt (if parsr is not being developed anymore, my project can slow down as well). I think this is not so bad because all the projects I picked are open-source. I would also pick well-documented and popular projects to reduce this risk.

It’s all a pipeline, baby.

The only way the above is possible is a paradigm shift from making one single package of code to thinking about the work as a process. There are discrete parts to a task, and the code is suited for that particular task.

I was inspired to change the way I thought about zeeker when I saw the flow chart for OpenLaw NZ’s Data Pipeline.

OpenLaw NZ’s data pipeline structure looks complicated, but it’s easy to follow for me!

It’s made of several AWS components and services (with some Azure). The steps are small, like receiving an event, sending it to a serverless function, putting the data in an S3 bucket, and then running another serverless function.

The key insight is to avoid building a monolith. I am not committed to building a single program or website. Instead, a project is broken into smaller parts. Each part is only intended to do a small task well. In this instance, zeekerscrapers is only a scraper. It looks at the webpage, takes the information already present on the web page, and saves or downloads the information. It doesn't bother with machine learning, displaying the results or any other complicated processing.

Besides using the right tool for the job, it is also easier to maintain.

The modularity also makes it simple to chop and change for different types of data. For example, you need to OCR a scanned PDF but don’t need to do that for a digital PDF. If the workflow is a pipeline, you can take that task out of the pipeline. Furthermore, some tasks, such as downloading a file, are standard fare. If you have a code you can reuse over several pipelines, you can save much coding time.

On the other hand, I would be relying heavily on cloud infrastructure to accomplish this, which is by no means cheap or straightforward.

Experiments continue

Photo by Alex Kondratiev / Unsplash

I have been quite busy lately, so I have yet to develop this at the pace I would like. For now, I have been converting pdpc-decisions to seeker. It’s been a breeze even though I took so much time.

On the other hand, my leisurely pace also allowed me to think about more significant issues, like what I can generalise and whether I will get bad vibes from this code in the future. Hopefully, the other scrapers can develop at breakneck speed once I complete thinking through the issues.

I have also felt more and more troubled by what to prioritise. Should I write more scrapers? Scrape what? Should I focus on adding more features to existing scrapers (like extracting entities and summarisation etc.)? When should I start writing the front end? When should I start advertising this project?

It’d be great to hear your comments. Meanwhile, keep watching this space!

#zeeker #Programming #PDPC-Decisions #Ideas #CloudComputing #LegalTech #OpenSource #scrapy #SQLModel #spaCy #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

☝🏼

Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.

Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.

There are thus a lot of articles which teach you how to scrape websites — here’s mine.

Automate Boring Stuff: Get Python and your Web Browser to download your judgementsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wroteLove.Law.Robots.Houfu

After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.

GitHub – houfu/pdpc-decisions: Data Protection Enforcement Cases in SingaporeData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufuThe code repository of the original web scraper is available on GitHub.

I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.

Levelling up... is not easy.

Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.

To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.

Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.

Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.

Enter scrapy

The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.

Enter scrapy.

The scrapy project is of some vintage. It reached 1.0 in 2015 and is currently at version 2.6.2 (as of August 2022). Scrapy’s age is screaming at you when it recommends you to install it in a “virtual environment” (who doesn’t install anything in Python except in a virtual environment?). On the other hand, scrapy is stable and production ready. It’s one of the best pieces of Python software I have encountered.

I decided to port my original web scraper to scrapy. I anticipated spending lots of time reading documentation, failing and then giving up. It turned out that I spent more time procrastinating, and the actual work was pretty straightforward.

Transitioning to scrapy

Here’s another thing you would notice about scrapy’s age. It encourages you to use a command line tool to generate code. This command creates a new project:

scrapy startproject tutorial

This reminds me of Angular and the ng command. (Do people still do that these days?)

While I found these commands convenient, it also reminded me that the learning curve of such frameworks is quite steep. Scrapy is no different. In the original web scraper, I defined the application's entry point through the command line function. This seemed the most obvious place to start for me.

@click.command() @click.argument('action') def pdpcdecision(csv, download, corpus, action, root, extras, extracorpus, verbose): starttime = time.time() scraperesults = Scraper.scrape() if (action == 'all') or (action == 'files'): downloadfiles(options, scraperesults) if (action == 'all') or (action == 'corpus'): createcorpus(options, scraperesults) if extras and ((action == 'all') or (action == 'csv')): scraperextras(scraperesults, options) if (action == 'all') or (action == 'csv'): savescraperesultstocsv(options, scraperesults) diff = time.time() – starttime logger.info('Finished. This took {}s.'.format(diff))

The original code was shortened to highlight the process.

The organisation of a scrapy project is different. You can generate a new project with the command above. However, the spider does the web crawling, and you have to create that within your project separately. If you started coding, you would not find this intuitive.

For the spider, the starting point is a function which generates or yields requests. The code example below does a few things. First, we find out how many pages there are on the website. We then yield a request for each page by submitting data on a web form.

import requests import scrapy from scrapy import FormRequest

class CommissionDecisionSpider(scrapy.Spider): name = “PDPCCommissionDecisions”

def startrequests(self): defaultform_data = { “keyword”: “”, “industry”: “all”, “nature”: “all”, “decision”: “all”, “penalty”: “all”, “page”: “1 }

response = requests.post(CASELISTINGURL, data=defaultformdata)

if response.statuscode == requests.codes.ok: responsejson = response.json() totalpages = responsejson[“totalPages”]

for page in range(1, totalpages + 1): yield FormRequest(CASELISTINGURL, formdata=createform_data(page=page))

Now, you need to write another function that deals with requests and yields items, the standard data format in scrapy.

def parse(self, response, **kwargs): responsejson = response.json() for item in responsejson[“items”]: from datetime import datetime nature = [DPObligations(nature.strip()) for nature in item[“nature”].split(',')] if item[ “nature”] else “None” decision = [DecisionType(decision.strip()) for decision in item[“decision”].split(',')] if item[ “decision”] else “None” yield CommissionDecisionItem( title=item[“title”], summaryurl=f”https://www.pdpc.gov.sg{item['url']}“, publisheddate=datetime.strptime(item[“date”], '%d %b %Y'), nature=nature, decision=decision )

You now have a spider! (Scrapy’s Quotesbot example is more minimal than this)

Run the spider using this command in the project directory:

scrapy crawl PDPCCommissionDecisions -o output.csv

Using its default settings, the spider scraped the PDPC website in a zippy 60 seconds. That’s because it already implements multithreading, so you are not waiting for tasks to complete one at a time. The command above even gets you a file containing all the items you scraped with no additional coding.

Transitioning from a pure Python codebase to a scrapy framework takes some time. It might be odd at first to realise you did not have to code the writing of a CSV file or manage web requests. This makes scrapy an excellent framework — you can focus on the code that makes your spider unique rather than reinventing the essential parts of a web scraper, probably very poorly.

It’s all in the pipeline.

If being forced to write spiders in a particular way isn’t irritating yet, dealing with pipelines might be the last straw. Pipelines deal with a request that doesn’t involve generating items. The most usual pipeline component checks an item to see if it’s a duplicate and then drops it if that’s true.

Pipelines look optional, and you can even avoid the complexity by incorporating everything into the main code. It turns out that many operations can be expressed as components in a timeline. Breaking them up into parts also helps the program implement multithreading and asynchronous operations effectively.

In pdpc-decisions, it wasn’t enough to grab the data from the filter or search page. You’d need to follow the link to the summary page, which makes additional information and a PDF download available. I wrote a pipeline component for that:

class CommissionDecisionSummaryPagePipeline: def processitem(self, item, spider): adapter = ItemAdapter(item) soup = bs4.BeautifulSoup(requests.get(adapter[“summaryurl”]).text, features=“html5lib”) article = soup.find('article')

# Gets the summary from the decision summary page paragraphs = article.find(class='rte').findall('p') result = '' for paragraph in paragraphs: if not paragraph.text == '': result += re.sub(r'\s+', ' ', paragraph.text) break adapter[“summary”] = result

# Gets the respondent in the decision adapter[“respondent”] = re.split(r”\s+[bB]y|[Aa]gainst\s+“, article.find('h2').text, re.I)[1].strip()

# Gets the link to the file to download the PDF decision decisionlink = article.find('a') adapter[“decisionurl”] = f”https://www.pdpc.gov.sg{decision_link['href']}”

adapter[“fileurls”] = [f”https://www.pdpc.gov.sg{decisionlink['href']}“]

return item

This component takes an item, visits the summary page and grabs the summary, respondent’s name and the link to the PDF, which contains the entire decision.

Note also the item has a field called file_urls. I did not create this data field. It’s a field used to tell scrapy to download a file from the web.

You can activate pipeline components as part of the spider’s settings.

ITEM_PIPELINES = { 'pdpcSpider.pipelines.CommissionDecisionSummaryPagePipeline': 300, 'pdpcSpider.pipelines.PDPCDecisionDownloadFilePipeline': 800, }

In this example, the pipeline has two components. Given a priority of 300, the CommissionDecisionSummaryPagePipeline goes first. PDPCDecisionDownloadFilePipeline then downloads the files listed in the file_urls field we referred to earlier.

Note also that PDPCDecisionDownloadFilePipeline is an implementation of the standard FilesPipeline component provided by scrapy, so I didn’t write any code to download files on the internet. Like the CSV feature, scrapy downloads the files when its files pipeline is activated.

Once again, it’s odd not to write code to download files. Furthermore, writing components for your pipeline and deciding on their seniority in a settings file isn’t very intuitive if you’re not sure what’s going on. Once again, I am grateful that I did not have to write my own pipeline.

I would note that “pipeline” is a fancy term for describing what your program is probably already doing. It’s true — in the original pdpc-decisions, the pages are scraped, the files are downloaded and the resulting items are saved in a CSV file. That’s a pipeline!

Settings, settings everywhere

Someone new to the scrapy framework will probably find the settings file daunting. In the previous section, we introduced the setting to define the seniority of the components in a pipeline. If you’re curious what else you can do in that file, the docs list over 50 items.

I am not going to go through each of them in this article. To me, though, the number of settings isn’t user-friendly. Still, it hints at the number of things you can do with scrapy, including randomising the delay before downloading files from the same site, logging and settings for storage adapters to common backends like AWS or FTP.

As a popular and established framework, you will also find an ecosystem. This includes scrapyd, a service you can run on your server to schedule scrapes and run your spiders. Proxy services are also available commercially if your operations are very demanding.

There are lots to explore here!

Conclusion

Do I have any regrets about doing pdpc-decisions? Nope. I learned a lot about programming in Python doing it. It made me appreciate what special considerations are involved in web scraping and how scrapy was helping me to do that.

I also found that following a framework made the code more maintainable. When I revisited the original pdpc-decisions while writing this article, I realised the code didn’t make sense. I didn’t name my files or function sensibly or write tests which showed what the code was doing.

Once I became familiar with the scrapy framework, I knew how to look for what I wanted in the code. This extends to sharing — if everyone is familiar with the framework, it’s easier for everyone to get on the same page rather than learning everything I wrote from scratch.

Scrapy could afford power and convenience, which is specialised for web scraping. I am going to keep using it for now. Learning such a challenging framework is already starting to pay dividends.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuRead more interesting adventures.

#Programming #Python #WebScraping #DataScience #Law #OpenSource #PDPC-Decisions #scrapy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Regular readers might have noticed the disappearance of articles relating to the Personal Data Protection Commission’s decisions lately. However, as news of the “largest” data breach in Singapore came out, I decided to look into this area again.

My lack of interest paralleled the changing environment, which allowed me to keep up-to-date on them:

  1. The PDPC removed their RSS feed for the latest updates;
  2. I am not allowed to monitor their website manually; and
  3. The PDPC started issuing shorter summaries of their decisions, which makes their work more opaque and less interesting.

Looking at this area again, I wanted to see whether the insights I gleaned from my earlier data project might hold and what would still be relevant going forward.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.Houfu

Something big struck, well, actually not much.

Photo by Francesca Saraco / Unsplash

The respondent in the case that had attracted media attention is Reddoorz, which operates a hotel booking platform in the budget hotel space. The cause of the breach is as sad as it is unremarkable — they had left the keys to their production database in the code of a disused but still available version of their mobile app. Using those keys, bad actors probably exfiltrated the data. This is yet another example of how lazy practices in developing apps can translate to real-world harm. They even missed the breach when they tried to perform some pen tests because it was old.

PDPC | Breach of the Protection Obligation by CommeasureBreach of the Protection Obligation by CommeasurePDPC LogoRead the PDPC’s enforcement decision here.

The data breach is the “largest” because it involved nearly 6 million customers. Given that the resident population in Singapore is roughly 5.5 million, this probably includes people from around our region.

The PDPC penalised the respondent with a $74,000 fine. This roughly works out to be about 1 cent per person. Even though this is the “largest” data breach handled under the PDPA, the PDPC did not use its full power to issue a penalty of up to $1 million. Under the latest amendments, which have yet to take effect, the potential might of the PDPC can be even greater than that.

The decision states that the PDPC took into account the COVID-19 situation and its impact on the hospitality industry in reducing the penalty amount. It would have been helpful to know how much this factor had reduced the penalty to have an accurate view of it.

In any case, this is consistent with several PDPC decisions. Using the PDPC’s website’s filters, only three decisions doled out more than $75,000 in penalties, and a further 4 doled out more than $50,000. This is among more than 100 decisions with a financial penalty. Even among the rare few cases, only 1 case exercised more than 25% of the current limit of the penalty. The following case only amounts to $120,000 (a high profile health-related case, too!).

The top of the financial penalty list (As of November 2021). Take note of the financial penalty filters at the bottom left corner.

This suggests that the penalties are, in practice, quite limited. What would it take for the PDPC to penalise an offender? Probably not the number of records breached. Maybe public disquiet?

In a world without data breaches

Throttle Roll - Swap Meat MarketPhoto by Parker Burchfield / Unsplash

While the media focuses on financial penalties, I am not a big fan of them.

While doling out “meaningful” penalties strikes a balance between compliance with the law and business interests, there are limits to this approach. As mentioned above, dealing with a risk of $5,000 fines may not be sufficient for a company to hire a team of specialists or even a professional Data Protection Officer. If a company’s best strategy is not to get caught for a penalty, this does not promote compliance with the law at all.

Unfortunately, we don’t live in a world without data breaches. The decisions, including those mentioned above, are filled with human errors. Waiting to get caught for such mistakes is not a responsible strategy. Luckily, the PDPA doesn’t require the organisation to provide bulletproof security measures, only reasonable ones. Then, the crux is figuring out what the PDPC thinks is enough to be reasonable.

So while all these data protection decisions and financial penalties are interesting in showing how others get it wrong, the real gem for the data protection professional in Singapore is finding someone who got it right.

And here’s the gem: Giordano. Now I am sorry I haven’t bought a shirt from them in decades.

There was a data breach, and the suspect was compromised credentials. However, the perpetrator did not get far:

  • The organisation deployed various endpoint solutions
  • The organisation implemented real-time system monitoring of web traffic abnormalities
  • Data was regularly and automatically backed up and encrypted anyway

Kudos to the IT and data protection team!

Compared to other “Not in Breach” decisions, this decision is the only one I know to directly link to one of the many guides made by the PDPC for organisations. “How to Guard Against Common Types of Data Breaches” makes a headline appearance in the Summary when introducing the reasonable measures that Giordano implemented.

The close reference to the guides signals that organisations following them can have a better chance of being in the “No Breach” category.

An approach that promotes best practices is arguably more beneficial to society than one that penalises others for making a mistake. Reasonable industry practices must include encrypting essential data and other recommendations from the PDPC. It would need leaders like Giordano, an otherwise ordinary clothing apparel store in many shopping malls, to make a difference.

A call from the undertaking

Photo by Nicola Fioravanti / Unsplash

The final case in this post isn’t found in the regular enforcement decisions section of the PDPC’s website — undertakings.

If you view a penalty as recognising a failure of data protection and no breach as an indicator of its success, the undertaking is that weird creature in between. It rewards organisations that have the data protection system for taking the initiative to settle with the PDPC early but recognises that there are still gaps in its implementation.

I was excited about undertakings and called them the “teeth of the accountability principle”. However, I haven’t found much substance in my excitement, and the parallel with US anti-corruption practices appears unfounded.

Between February 2021, when the undertaking procedure was given legislative force, and November 2021, 10 organisations spanning different industries went through this procedure. In the meantime, the PDPC delivered 21 decisions with a financial penalty, direction or warning. I reckon roughly 30% is a good indicator that organisations use this procedure when they can.

My beef is that very little information is provided on these undertakings, which appears even shorter than the summaries of enforcement decisions. With very little information, it isn’t clear why these organisations get undertakings rather than penalties.

Take the instant case in November as an example. Do they have superior data protection structures in their organisations? (The organisation didn’t have any and had to undertake to implement something.) Are they all Data Protection Trust Mark organisations? (Answer: No.) Are they minor breaches? (On the surface, I can’t tell. 2,771 users were affected in this case.)

My hunch is that (like the Guide to Active Enforcement says) these organisations voluntarily notified the PDPC with a remediation plan that the PDPC could accept. This is not as easy as it sounds, as you might probably engage lawyers and other professionals to navigate your way to that remediation plan.

With very little media attention and even a separate section away from the good and the ugly on the PDPC’s website, the undertaking is likely to be practically the best way for organisations to deal with the consequences of a data breach. Whether the balance goes too far in shielding organisations from them remains to be seen.

Conclusion

Having peeked back at this area, I am still not sure I like what I find. There was a time when there was excitement about data protection in Singapore, and becoming a professional was seen as a viable place to find employment. It would be fascinating to see how much this industry develops. If it does or it doesn’t, I believe that the actions and the approach of the PDPC to organisations with data breaches would be a fundamental cause.

Until there is information on how many data protection professionals there are in Singapore and what they are doing, I don’t think you will find many more articles in this area on this blog.

#Privacy #PersonalDataProtectionCommission #PersonalDataProtectionAct #Penalties #Undertakings #Benchmarking #DataBreach #DataProtectionOfficer #Enforcement #Law ##PDPAAmendment2020 #PDPC-Decisions #Singapore #Decisions

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Introduction

Over the course of 2019 and 2020, I embarked on a quest to apply the new things I was learning in data science to my field of work in law.

The dataset I chose was the enforcement decisions from the Personal Data Protection Commission in Singapore. The reason I chose it was quite simple. I wanted a simple dataset covering a limited number of issues and is pretty much independent (not affected by stare decisis or extensive references to legislation or other cases). Furthermore, during that period, the PDPC was furiously issuing several decisions.

This experiment proved to be largely successful, and I learned a lot from the experience. This post gathers all that I have written on the subject at the time. I felt more confident to move on to more complicated datasets like the Supreme Court Decisions, which feature several of the same problems faced in the PDPC dataset.

Since then, the dataset has changed a lot, such as the website has changed, so your extraction methods would be different. I haven't really maintained the code, so they are not intended to create your own dataset and analysis today. However, techniques are still relevant, and I hope they still point you in a good direction.

Extracting Judgement Data

Dog & Baltic SeaPhoto by Janusz Maniak / Unsplash

The first step in any data science journey is to extract data from a source. In Singapore, one can find judgements from courts on websites for free. You can use such websites as the source of your data. API access is usually unavailable, so you have to look at the webpage to get your data.

It's still possible to download everything by clicking on it. However, you wouldn't be able to do this for an extended period of time. Automate the process by scraping it!

Automate Boring Stuff: Get Python and your Web Browser to download your judgements]

I used Python and Selenium to access the website and download the data I want. This included the actual judgement. Metadata, such as the hearing date etc., are also available conveniently from the website, so you should try and grab them simultaneously. In Automate Boring Stuff, I discussed my ideas on how to obtain such data.

Processing Judgement Data in PDF

Photo by Pablo Lancaster Jones / Unsplash

Many judgements which are available online are usually in #PDF format. They look great on your screen but are very difficult for robots to process. You will have to transform this data into a format that you can use for natural language processing.

I took a lot of time on this as I wanted the judgements to read like a text. The raw text that most (free) PDF tools can provide you consists of joining up various text boxes the PDF tool can find. This worked all right for the most part, but if the text ran across the page, it would get mixed up with the headers and footers. Furthermore, the extraction revealed lines of text, not paragraphs. As such, additional work was required.

Firstly, I used regular expressions. This allowed me to detect unwanted data such as carriage returns, headers and footers in the raw text matched by the search term.

I then decided to use machine learning to train my computer to decide whether to keep a line or reject it. This required me to create a training dataset and tag which lines should be kept as the text. This was probably the fastest machine-learning exercise I ever came up with.

However, I didn't believe I was getting significant improvements from these methods. The final solution was actually fairly obvious. Using the formatting information of how the text boxes were laid out in the PDF , I could make reasonable conclusions about which text was a header or footer, a quote or a start of a paragraph. It was great!

Natural Language Processing + PDPC Decisions = 💕

Photo by Moritz Kindler / Unsplash

With a dataset ready to be processed, I decided that I could finally use some of the cutting-edge libraries I have been raring to use, such as #spaCy and #HuggingFace.

One of the first experiments was to use spaCy's RuleMatcher to extract enforcement information from the summary provided by the authorities. As the summary was fairly formulaic, it was possible to extract whether the authorities imposed a penalty or the authority took other enforcement actions.

I also wanted to undertake key NLP tasks using my prepared data. This included tasks like Named Entity Recognition (does the sentence contain any special entities), summarisation (extract key points in the decision) and question answering (if you ask the machine a question, can it find the answer in the source?). To experiment, I used the default pipelines from Hugging Face to evaluate the results. There are clearly limitations, but very exciting as well!

Visualisations

Photo by Annie Spratt / Unsplash

Visualisations are very often the result of the data science journey. Extracting and processing data can be very rewarding, but you would like to show others how your work is also useful.

One of my first aims in 2019 was to show how PDPC decisions have been different since they were issued in 2016. Decisions became greater in number, more frequent, and shorter in length. There was clearly a shift and an intensifying of effort in enforcement.

I also wanted to visualise how the PDPC was referring to its own decisions. Such visualisation would allow one to see which decisions the PDPC was relying on to explain its decisions. This would definitely help to narrow down which decisions are worth reading in a deluge of information. As such, I created a network graph and visualised it. I called the result my “Star Map”.

Data continued to be very useful in leading the conclusion I made about the enforcement situation in Singapore. For example, how great an impact would the increase in maximum penalties in the latest amendments to the law have? Short answer: Probably not much, but they still have a symbolic effect.

What's Next?

As mentioned, I have been focusing on other priorities, so I haven't been working on PDPC-Decisions for a while. However, my next steps were:

  • I wanted to train a machine to process judgements for named entity recognition and summarization. For the second task, one probably needs to use a transformer in a pipeline and experiment with what works best.
  • Instead of using Selenium and Beautiful Soup, I wanted to use scrapy to create a sustainable solution to extract information regularly.

Feel free to let me know if you have any comments!

#Features #PDPC-Decisions #PersonalDataProtectionAct #PersonalDataProtectionCommission #Decisions #Law #NaturalLanguageProcessing #PDFMiner #Programming #Python #spaCy #tech

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

Networks are one of the most straightforward ways to analyse judgements and cases. We can establish relationships between cases and transform them into data. Computers crunch data. Computers produce a beautiful graph. Now I have the latest iteration of the network data of the Personal Data Protection Commission of Singapore’s decisions. I call it a star map , because it looks like a constellation, and it shows us the “stars” of the universe of PDPC decisions.

Making the star map

One of the earliest features of the pdpc-decision is to grab citations and make a table of which decisions cite which decisions. In the broadest sense of the word, the automated process works. Running a regular expression search on a document is going to find citations. Except when the citations do not follow that format. I had to go through each decision to check its accuracy.

Notwithstanding, the first version of the star map was produced fairly quickly.

The first version of the star map.

Here are the biggest difference between the first version and the second version:

  • The data is directed instead of undirected. As such, the information now knows that Case A cites Case B. Previously it only knew that Case A is connected to Case B.
  • PageRank determined the size of the nodes and labels. In short, PageRank determines how likely a decision will be visited when following citations randomly. It is very similar to how search engines work. The nodes in the previous version were sized based on the number of connections it enjoyed.

I also tweaked the visualisation so that it looks more appealing. You can see now that there are many cases which don’t cite any decision. Out of curiosity, I asked the computer to partition the nodes and edges; the colours of the nodes and edges show what the computer thinks are distinct groups of decision.

I used Gephi from start to end. The pdpc-decision package produces information on relationships in a CSV format. I then imported the CSV into Gephi. During the sight checking of decisions, I would make any revisions directly in Gephi (there were about 1-3 fixes overall). Gephi was then used to produce the visualisation you would see as the final product.

How effective is the star map?

PageRank makes a difference

As a result of using directed references and PageRank, you can see some subtle differences between the two versions of the star map. Some decisions (nodes) have changed their size as a result of applying a different metric.

If you can’t see it (and I don’t blame you), you can look for the mammoth Singhealth decision. Singhealth was one of the largest nodes in version 1 but seems to have disappeared in the new version. (It’s actually between Tutor City and Genki Sushi in the orange zone.)

Location of SingHealth decision on the star map.

The connected metric, which only counts how many connections are made, made SingHealth pretty large because it cites several cases, and some cases cite it. This isn’t the whole picture though. An analogy helps to explain better. You might have a friend that name drops all the important people they know, but nobody else important knows him. It is part of my case that SingHealth was an outlier — it might have cited several decisions as a basis of its decision, but few decisions thereafter have cited it.

Changing the metric does not remove all biases. The top decisions — Aviva 2017, NUS, Toh Shi Printing Singapore, M Star Movers and the Cellar Door were all published in 2016 and the early part of 2017. This isn’t exactly unexpected. Older decisions have more time to accumulate citations. Ceteris paribus , an older decision is more likely to be cited than a newer one.

Looking at Citations is still useful

One should not overstate the power of citations in PDPC decisions. Unlike Court judgements, the PDPC does not have to justify their reasoning by reference to previous cases. Ergo, PDPC decisions are non-binding. This confusion leads to some respondents trying to “distinguish” their case by referring to previous cases.

Once we understand that the PDPC’s use of citations is purely voluntary, the PDPC would make a citation when the PDPC thinks that the decision helps to explain its own decisions. The most significant nodes in the map relate to a series of decisions that cite several previous decisions and also cite information and practices from other jurisdictions like the UK, Canada and Hong Kong. These include M Star Movers, Aviva (2017), NUS, Orchard Turn and Social Metric.

While also focusing on the size and the density of its connections, the node’s position in the map is also useful. The general position of the nodes is not random. The closer the node is to the centre, the more likely is that the node is a hub. As such, besides SingHealth, Tutor City in 2019 received quite a lot of focus as well.

Being able to visualise the relationships between decisions in a network map helps us to understand the whole landscape of decisions in aggregate. The larger nodes represent important decisions. The nodes near the centre represent hubs. We can now summarise the landscape of decisions.

Next steps

This work tries to answer a practical question. If I only have enough time to read a few PDPC decisions, which decisions should I read? Of course, there is room for professional judgement or opinion. However, we should not ignore a computational method too. Arguably it’s more objective as well. Hopefully, I get around to creating a document using such research.

Having allowed this project to go over a few iterations also allowed me to understand its limitations. The use of decisions is far too narrow in this case because a lot of interesting information can also be found in the Commission’s guidelines. This is not captured in studying how decisions cite each other.

However, the obstacle to implementing this was actually architectural. As the structure was created for decisions , its data structures can’t accommodate other reference materials. A date of publication or complaint isn’t present in a guideline. The code needs to be rewritten, possibly in a graph database.

Urrgh, the project never ends.

#PDPC-Decisions

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural language processing, data extraction and processing!

This post is the latest one dealing with creating a corpus out of the decisions of the Personal Data Protection Commission of Singapore. However, this time I believe I got it right. If I don’t turn out to be an ardent perfectionist, I might have a working corpus.

Problem Statement

I must have written this a dozen of times here already. Law reports in Singapore unfortunately are generally available in PDF format. There is a web format accessible from LawNet, but it is neither free as in beer nor free as in libre. You have to work with PDFs to create a solution that is probably not burdened with copyrights.

However, text extracted from PDFs can look like crap. It’s not the PDF’s fault. PDFs are designed to look the same on any computer. Thus a PDF comprises of a bunch of text boxes and containers rather than paragraphs and headers. If the text extractor is merely reading the text boxes, the results will not look pretty. See an example below:

    The operator, in the mistaken belief that the three statements belonged
     to  the  same  individual,  removed  the  envelope  from  the  reject  bin  and
     moved it to the main bin. Further, the operator completed the QC form
     in  a  way  that  showed  that  the  number  of  “successful”  and  rejected
     Page 3 of 7
     
     B.
     (i)
     (ii)
     9.
     (iii)
     envelopes tallied with the expected total from the run. As the envelope
     was  no  longer  in  the  reject bin, the  second and  third  layers  of  checks
     were by-passed, and the envelope was sent out without anyone realising
     that it contained two extra statements. The manual completion of the QC
     form by the operator to show that the number of successful and rejected
     envelopes tallied allowed this to go undetected. 

Those lines are broken up with new lines by the way. So the computer reads the line in a way that showed that the number of “successful” and rejected, and then wonders “rejected” what?! The rest of the story continues about seven lines away. None of this makes sense to a human, let alone a computer.

Previous workarounds were... unimpressive

Most beginning data science books advise programmers to using regular expressions as a way to clean the text. This allowed me to choose what to keep and reject in the text. I then joined up the lines to form a paragraph. This was the subject of Get rid of the muff: pre-processing PDPC Decisions.

As mentioned in that post, I was very bothered with removing useful content such as paragraph numbers. Furthermore, it was very tiring playing whack a-mole figuring out which regular expression to use to remove a line. The results were not good enough for me, as several errors continued to persist in the text.

Not wanting to play whack-a-mole, I decided to train a model to read lines and make a guess as to what lines to keep or remove. This was the subject of First forays into natural language processing — get rid of a line or keep it? The model was surprisingly accurate and managed to capture most of what I thought should be removed.

However, the model was quite big, and the processing was also slow. While using natural language processing was certainly less manual, I was just getting results I would have obtained if I had worked harder at regular expressions. This method was still not satisfactory for me.

I needed a new approach.

A breakthrough — focus on the layout rather than the content

I was browsing Github issues on PDFMiner when it hit me like a brick. The issue author had asked how to get the font data of a text on PDFMiner.

I then realised that I had another option other than removing text more effectively or efficiently.

Readers don’t read documents by reading lines and deciding whether to ignore them or not. The layout — the way the text is arranged and the way it looks visually — informs the reader about the structure of the document.

Therefore, if I knew how the document was laid out, I could determine its structure based on observing its rules.

Useful layout information to analyse the structure of a PDF.

Rather than relying only on the content of the text to decide whether to keep or remove text, you now also have access to information about what it looks like to the reader. The document’s structure is now available to you!

Information on the structure also allows you to improve the meaning of the text. For example, I replaced the footnote of a text with its actual text so that the proximity of the footnote is closer to where it was supposed to be read rather than finding it at the bottom of page. This makes the text more meaningful to a computer.

Using PDFMiner to access layout information

To access the layout information of the text in a PDF, unfortunately, you need to understand the inner workings of a PDF document. Fortunately, PDFMiner simplifies this and provides it in a Python-friendly manner.

Your first port of call is to extract the page of the PDF as an LTPage. You can use the high level function extract_pages for this.

Once you extract the page, you will be able to access the text objects as a list of containers. That’s right — using list comprehension and generators will allow you to access the containers themselves.

Once you have access to each container in PDFMiner, the metadata can be found in its properties.

The properties of a LTPage reveal its layout information.

It is not apparent from the documentation, so studying the source code itself can be very useful.

Code Highlights

Here are some highlights and tips from my experience so far implementing this method using PDFMiner.

Consider adjusting LAParams first

Before trying to rummage through your text boxes in the PDF, pay attention to the layout parameters which PDFMiner uses to extract text. Parameters like line, word, and char margin determine how PDFMiner groups texts together. Effectively setting these parameters can help you to just extract the text.

Notwithstanding, I did not use LAParams as much for this project. This is because the PDFs in my collection can be very arbitrary in terms of layout. Asking PDFMiner to generalise in this situation did not lead to satisfactory results. For example, PDFMiner would try to join lines together in some instances and not be able to in others. As such, processing the text boxes line by line was safer.

Retrieving text margin information as a list

As mentioned in the diagram above, margin information can be used to separate the paragraph numbers from their text. The original method of using a regular expression to detect paragraph numbers had the risk of raising false positives.

The x0 property of an LTTextBoxContainer represents its left co-ordinate, which is its left margin. Assuming you had a list of LTTextBoxContainers (perhaps extracted from a page), a simple list comprehension will get you all the left margins of the text.

    from collections import Counter
    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTTextContainer, LAParams
    
    limit = 1 # Set a limit if you want to remove uncommon margins
    first_page = extract_pages(pdf, page_numbers=[0], laparams=LAParams(line_margin=0.1, char_margin=3.5))
    containers = [container for container in first_page if isinstance(container, LTTextContainer)]
    text_margins = [round(container.x0) for container in containers]
    c = Counter(text_margins)
    sorted_text_margins = sorted([margin for margin, count in c.most_common() if count &gt; limit])</code></pre>

Counting the number of times a text margin occurs is also useful to eliminate elements that do not belong to the main text, such as titles and sometimes headers and footers.

You also get a hierarchy by sorting the margins from left to right. This is useful for separating first-level text from the second-level, and so on.

Using the gaps between lines to determine if there is a paragraph break

As mentioned in the diagram above, the gap between lines can be used to determine if the next line is a new paragraph.

In PDFMiner as well as PDF, the x/y coordinate system of a page starts from its lower-left corner (0,0). The gap between the current container and the one immediately before it is between the current container’s y1 (the current container’s top) and the previous container’s y0 (the previous container’s base). Conversely, the gap between the current container and the one immediately after is the current container’s y0 (the current container’s base) and the next container’s y1 (the next container’s top)

Therefore, if we have a list of LTTextBoxContainers, we can write a function to determine if there is a bigger gap.

    def check_gap_before_after_container(containers: List[LTTextContainer], index: int) -> bool:
        index_before = index - 1
        index_after = index + 1
        gap_before = round(containers[index_before].y1 - containers[index].y0)
        gap_after = round(containers[index].y1 - containers[index_after].y0)
        return gap_after >= gap_before

Therefore if the function returns true, we know that the current container is the last line of the paragraph. As I would then save the paragraph and start a new one, keeping the paragraph content as close to the original as possible.

Retrieve footnotes with the help of the footnote separator

As mentioned in the diagram, footnotes are generally smaller than the main text, so you could get footnotes by comparing the height of the main text with that of the footnotes. However, this method is not so reliable because some small text may not be footnotes (figure captions?).

For this document, the footnotes are contained below a separator, which looks like a line. As such, our first task is to locate this line on the page. In my PDF file, another layout object, LTRect, describes lines. An LTRect with a height of less than 1 point appears not as a rectangle, but as a line!

    if footnote_line_container := [container for container in page if all([ 
        isinstance(container, LTRect),
        container.height < 1,
        container.y0 < 350,
        container.width < 150,
        round(container.x0) == text_margins[0]
    ])]:
        footnote_line_container = sorted(footnote_line_container, key=lambda container: container.y0)
        footnote_line = footnote_line_container[0].y0

Once we have obtained the y0 coordinate of the LTRect, we know that any text box under this line is a footnote. This accords with the intention of the maker of the document, which is sometimes Microsoft Word!

You might notice that I have placed several other conditions to determine whether a line is indeed a footnote marker. It turns out that LTRects are also used to mark underlined text. The extra conditions I added are the length of the line (container.width < 150), whether the line is at the top or bottom half of the page (container.y0 < 350), and that it is in line with the leftmost margin of the text (round(container.x0) == text_margins[0]).

For Python programming, I also found using the built-in all() and any() to be useful in improving the readability of the code if I have several conditions.

I also liked using a new feature in the latest 3.8 version of Python: the walrus operator! The code above might not work for you if you are on a different version.

Conclusion

Note that we are reverse-engineering the PDF document. This means that getting perfect results is very difficult, and you would probably need to try several times to deal with the side effects of your rules and conditions. The code which I developed for pdpc-decisions can look quite complicated and it is still under development!

Given a choice, I would prefer using documents where the document’s structure is apparent (like web pages). However, such an option may not be available depending on your sources. For complicated materials like court judgements, studying the structure of a PDF can pay dividends. Hopefully, some of the ideas in this post will get you thinking about how to exploit such information more effectively.

#PDPC-Decisions #NaturalLanguageProcessing #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

Avid followers of Love Law Robots will know that I have been hard at creating a corpus of Personal Data Protection Commission decisions. Downloading them and pre-processing them has taken a lot of work! However, it has managed to help me create interesting charts that shows insight at a macro level. How many decisions are released in a year and how long have they been? What decisions refer to each other in a network?

Unfortunately, what I would really to do is natural language processing. A robot should analyse text and make conclusions from it. This is much closer to the bread and butter of what lawyers do. I have been poking around spaCy, but using their regular expression function doesn’t really cut it.

This is not going to be the post where I say I trained a model to ask what the ratio decendi of a decision is. Part of the difficulty is finding a problem that is solvable given my current learning. So I have picked something that is useful and can be implemented fast.

The Problem

The biggest problem I have is that the decisions, like many other judgements produced by Singapore courts, is in PDF. This looks great on paper but is gibberish to a computer. I explained this problem in an earlier post about pre-processing.

Get rid of the muff: pre-processing PDPC DecisionsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! The life of a budding data science enthusiast. You need data to work on, so you look all around youLove.Law.Robots.Houfu

Having seen how the PDF extraction tool does its work, you can figure out which lines you want or don’t want. You don’t want empty lines. You don’t want lines with just numbers on them (these are usually page numbers). Citations? One-word indexes? The commissioner’s name. You can’t exactly think up of all the various permutations and then brainstorm on regular expression rules to cover all of them.

It becomes a whack a mole.

Training a Model for the win

It was during one of those rage-filled “how many more things do I have to do to improve this” nights when it hit me.

“I know what lines I do not want to keep. Why don’t I just tell the computer what they are instead of abstracting the problem with regular expressions?!”

Then I suddenly remembered about machine learning. Statistically, the robot, after learning about what lines I would keep or not, could make a guess. If the robot can guess right most of the time, that would determine in which cases regular expression must be used.

So, I got off my chair, selected dozens of PDFs and converted them into text. Then I separated the text into a CSV file and started classifying them.

Classification of lines for training

I managed to compile a list of over five thousand lines for my training and test data. After that, I lifted the training code from spaCy’s documentation to train the model. My Macbook Pro’s fans got noisy, but it was done in a matter of minutes.

Asking the model to classify sentences gave me the following results:

Text Remove or Keep
Hello. Keep
Regulation 10(2) provides that a contract referred to in regulation 10(1) must: Keep
YEONG ZEE KIN Remove
[2019] SGPDPC 18 Remove
transferred under the contract”. Keep
There were various internal frameworks, policies and standards which apply to Keep
(vi) Remove

By applying it to text extracted from the PDF, we can get a resulting document which can be used in the corpus. You can check out the code used for this in the Github Repository under the branch “line_categoriser”.

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

Conclusion

Will I use this for production purposes? Nope. When I ran some decisions through this process, the effectiveness is unfortunately like using regular expressions. The model, which weighs nearly 19Mbs, also took noticeably longer to process a series of strings.

My current thoughts on this specific problem is for a different approach. It would involve studying PDF internals and observing things like font size and styles to determine whether a line is a header or a footnote. It would also make it easier to join lines of the same style to make a paragraph. Unfortunately, that is some homework for another day.

Was it a wasted adventure? I do not think so. Ultimately, I wanted to experiment, and embarking on a task I could do in a week of nights was surely insightful in determining whether I can do it, and what are the limitations of machine learning in certain tasks.

So, hold on to your horses, I will be getting there much sooner now.

#PDPC-Decisions #spaCy #NaturalLanguageProcessing

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

As mentioned in my previous post, I have not been able to spend time writing as much code as I wanted. I had to rewrite a lot of code due to the layout change in the PDPC Website. That was not the post I wanted to write. I have finally been able to write about my newest forays for this project.

Enforcement information#

I had noticed that the summary provided by the Personal Data Protection Commission provided an easy place to cull basic information. So, I have added enforcement information. Decisions now tell you whether a financial penalty or a warning was meted out.

Information is extracted from the summaries using RuleMatcher in spaCy. It isn’t perfect. Some text does not really fit the mould. However, due to the way the summaries are written, information is mostly extracted accurately.

Visualising the parts of speech in a typical sentence can allow you to write rules to extract information.

This is the first time I have used spaCy or any natural language processing for this purpose. Remarkably, it has been fast. Culling this information (as well as the other extra features) only added about two hundred seconds to building a database from scratch. I would like to find more avenues to use these newfound techniques!

spaCy · Industrial-strength Natural Language Processing in PythonspaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

References

Court decisions are special in that they often require references to leading cases. This is because they are either binding (stare decisis) or persuasive to the decision maker. Of course, previous PDPC Decisions are not binding on the PDPC. Lately, respondents have been referring to the body of cases to argue they should be treated alike. I have not read a decision where this argument has worked.

Nevertheless, the network of cases referring to and referred by offers remarkably interesting insights. To imagine, we are looking at a social network of cases. To establish a point, the Personal Data Protection Commission does refer to earlier cases. All things being equal, a case with more references is more influential.

pdpc-decisions now reads the text of the decisions to create a list of decisions it refers to in the decision (“ referring to “). From the list of decisions, we can also create a list of decisions which makes references to it (“ referred by “). Because of the haphazard way the PDPC has been writing its decisions and its citations, this is also not perfect, but it is still kind of accurate.

As I mentioned, compiling a network of decisions can offer some interesting insights. So here it is — the social network of PDPC decisions.

I guess this is the real pdpc decisions in one chart

Update (24/4/2020): The chart was lumping together the Aviva case in 2018 with the Aviva case in 2017. The graph has been updated. Not much has changed in the big picture though.

Of course, a more advanced visualisation tool would allow you to drill down to see which cases are more influential. However, a big diagram like the above shows you which are the big boys in this social network.

Before I leave this section, here’s a fun fact to take home. Based on the computer’s analysis, over 68% of PDPC decisions refer to one another. That’s a lot of chatter!

Moving On

I keep thinking I have finished my work here, but there seems to be new things coming up. Here is some interesting information I would like to find out:

  • In a breach, how many data subjects were affected? How exactly does this affect the penalty given by the PDPC?
  • What kinds of information are most often involved in a data breach? How does this affect the penalty given by the PDPC?
  • How long does it take for the PDPC to complete investigations? Can we create a timeline from the information in a decision?

You would just have to keep watching this space! What kind of information is interesting to you too?

#PDPC-Decisions #NaturalLanguageProcessing #spaCy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

At first, I thought I would have some free time to do some projects while being cooped up at home. Turns out, being cooped up at home with a child on a 24/7 basis leaves time for not much else.

Among the less noticeable things which are happening in the privacy space in Singapore is that the PDPC has sort of revamped their websites. It looks far more modern now. For one, it doesn’t use a crappy system font like Verdana; it now uses “Avenir” which is French for “future” . A much better use of color across the website. You can compare it if you like.

Overall, it looks great. If I wanted to quibble, it could use a lot less words on screen.

Actually, I have a lot more to quibble. After updating the site, they removed the RSS Feed to their announcements. This was the primary way I obtained updates on the site.

Furthermore pdpc-decisions was also wrecked by the new layout changes. Since I am still using the scraper for various personal projects, I’m working overtime to get it working again. To date, 484 lines of code have been affected. Oh my gosh!

So pardon for the lack of updates. I am just really out of time. Groan.

#PDPC-Decisions

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

It’s a milestone today! I wrote something that I felt is worthy of v 1.0. That version number is magical, because it means the software works. Yup pdpc-decisions is one!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

What’s the Problem?

What does pdpc-decisions do? Basically it goes through the PDPC Enforcement Decisions site and creates three things:

  1. A table of basic information of every decision published on the site
  2. Download every single decision on that site as a PDF
  3. Converts the PDF into a plain text file which can be used as a data set.

This means that you can get your own copy of the PDPC Enforcement Decision by running the code! Refer to the Readme for technical instructions on how to get going.

Why would you like to get your own copy of every decision ever delivered by the PDPC? If you are not going to do anything else with the data, I can see some uses already:

  1. If you don’t subscribe to LawNet, downloading your own library is the best way to read and review it anytime you like.
  2. The table referred of basic information can be really useful in letting you get a glance of all the decisions in one table. Let’s face it — there are many decisions now and it is difficult to keep up with it.
  3. Although it’s great the PDPC provided a search functionality, being able to view more than five decisions at one time is pretty nifty.

Unlike other jurisdictions with a Legal Information Institute, like Hong Kong and New Zealand, Singapore is not an easy place to get legal information easily for free. Furthermore, the legal profession’s obsession with PDF makes accessing such information difficult for computers. This tool makes it much easier to access such information for computers. The results had already allowed to make a time comparison of decisions pretty easily.

Show the number and average length of PDPC Decisions since April 2016

Things I learnt

pdpc-decisions uses Python, which is remarkable because it is a language I picked up less than a year with very little offline or online training. Besides dealing with a new programming language, I also had to figure out how to use web scraping tools like selenium and beautiful soup, as well as python testing tools such as pytest. (Coverage is 94%!)

I also got to experiment with distributing a software, primarily via docker. This tool is especially well suited to be run as a image, since it is best run only once. Not only did I try and succeed at getting automated builds done, I also managed to setup continuous integration through Travis-CI.

So, the unit tests and the automated testing and builds work. Hopefully I have made code that can be easier to maintain. Since reaching v 1.o, I will be leaving this code alone for a while.

What’s Next?

Of course the code is not perfect. I have spotted a few typos here and there. I might want to leave it alone to collect a few more bugs before I create a new version.

Furthermore, the site changes so I expect the code to break soon. During the course of writing this software, I have already notice some subtle changes to the website. Since I do use this package from time to time, I will be able to maintain as and when the code changes.

The ultimate goal of this code however leads to my slow going super-project which I called zeeker. It’s a database of personal data protection resources in the cloud and I hope to expand on the source material here to create an even richer database. So this will not be the last post I will make on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interests in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it useful so if there is an interest, please let me know!

For now though, I am going to sit back and enjoy my code. Let’s run it again! Haha!

#PDPC-Decisions #OpenSource

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu