Ready to mine free online legal materials in Singapore? Not so fast!
I have been mulling over developing an extensive online database of free legal materials in the flavour of OpenLawNZ or an LII for the longest time. Free access to such materials is one problem to solve, but I'm also hoping to compile a dataset to develop AI solutions. I have tried and demonstrated this with PDPC's data previously, and I am itching to expand the project sustainably.
However, being a lawyer, I am concerned about the legal implications of scraping government websites. Would using these materials be a breach of copyright law? In other countries, people accept that the public should generally be allowed to use such public materials. However, I am not very sure of this here.
Clearer guidelines for scraping legal materials?
I was thus genuinely excited about the amendments to the Copyright Act in Singapore this year. According to the press release, they will be operational in November, so they will be here soon.
Copyright Bill – Singapore Statutes OnlineSingapore Statutes Online is provided by the Legislation Division of the Singapore Attorney-General’s ChambersSingapore Statutes OnlineThe Copyright Bill is expected to be operationalised in November 2021.
[ Update 21 November 2021: The bill has, for the most part, been operationalised.]
Two amendments are particularly relevant in my context:
Using publicly disclosed materials from the government is allowed
In sections 280 to 282 of the Bill, it is now OK to copy or communicate public materials to facilitate more convenient viewing or hearing of the material. It should be noted that this is limited to copying and communicating it. Presumably, this means that I can share the materials I collected on my website as a collection.
Computational data analysis is allowed.
The amendments expressly say that using a computer to extract data from a work is now permitted. This is great! At some level, the extraction of the material is to perform some analysis or computation on it — searching or summarising a decision etc. I think some limits are reasonable, such as not communicating the material itself or using it for any other purpose.
A recent journal article points to “fair use” as the way forward
I was amazed to find an article in the SAL Journal titled “Copying Right in Copyright Law” by Prof David Tan and Mr Thomas Lee, which focused on the issue that was bothering me. The article focuses on data mining and predictive analytics, and it substantially concerns robots and scrapers.
Singapore Academy of Law Journale-First MenuLink to the journal article on E-First at SAL Journals Online.
On the new exception for computational data analysis, the article argues that the two illustrations I mentioned earlier were “inadequate and there is significant ambiguity of what lawful access means in many situations”. Furthermore, because the illustrations were not illuminating, it might create a situation where justified uses are prohibited. With much sadness, I agree.
More interestingly, based on some mathematics and a survey, the authors argue that an open-ended general fair use defence for data mining is the best way forward. As opposed to a rule-based exception, such a defence can adapt to changes better. Stakeholders (including owners) also prefer it because it appeals to their understanding of the economic basis of data mining.
You can quibble with the survey methodology and the mathematics (which I think is very brave for a law journal article). I guess it served its purpose in showing the opinion of stakeholders in the law and the cost analysis very well. I don’t suspect it will be cited in a court judgement soon, but hopefully, it sways someone influential.
We could use a more developer-friendly approach.
There was a time when web scraping was dangerous for a website. In those times, websites can be inundated with requests by automated robots, leading them to crash. Since then, web infrastructure has improved, and techniques to defeat malicious actors have been developed. The great days of “slashdotting” a website has not been heard of for a while. We’ve mostly migrated to more resilient infrastructure, and any serious website on the internet understands the value of having such infrastructure.
In any case, it is possible to scrape responsibly. Scrapy, for example, allows you to queue requests regularly or identify yourself as a robot or scraper, respecting robots.txt. If I agreed not to degrade a website’s performance, which seems quite reasonable, shouldn’t I be allowed to use it?
Being more developer-friendly would also help government agencies find more uses for their works. For now, most legal resources appear to cater exclusively for lawyers. Lawyers will, of course, find them most valuable because it’s part of their job. However, others may also need such resources because they can’t afford lawyers or have a different perspective on how information can be helpful. It’s not easy catering to a broader or other audience. If a government agency doesn’t have the resources to make something more useful, shouldn’t someone else have a go? Everyone benefits.
In all, I identified three broad categories of terms.
Totally Progressive: Singapore Statutes Online 👍👍👍
Source: https://sso.agc.gov.sg/Help/FAQ#FAQ_8 (Accessed 20 October 2021)
Things I like:
- They expressly mention the use of “automated means”. It looks like they were prepared for robots!
- Conditions appear reasonable. There’s a window for extraction and guidelines to help properly cite and identify the extracted materials.
Things I don’t like:
- The Singapore Statutes Online website is painful to extract from and doesn’t feature any API.
- Knowing what they expect scrapers to do gives me confidence in further exploring this resource.
Totally Bonkers: Personal Data Protection Commission 😖😖😖
Source: https://www.pdpc.gov.sg/Terms-and-Conditions (Accessed 20 October 2021)
Things I like:
- They expressly mention the use of “robots” and “spiders”. It looks like they were prepared!
Things I don’t like:
- It doesn’t allow you to use a “manual process” to monitor its Contents. You can’t visit our website to see if we have any updates!
- What is an automatic device? Like a feed reader? (Fun fact: The PDPC obliterated their news feed in the latest update to their website. The best way to keep track of their activities is to follow their LinkedIn)
- PDPC suggests that you get written permission but doesn’t tell you what circumstances they will give you such permission.
- I have no idea what an unreasonable or disproportionately large load is. It looks like I have to crash the server to find out! (Just kidding, I will not do that, OK.)
- I have no idea what happened to the PDPC, such that it had to impose such unreasonable conditions on this activity (I hope I am not involved in any way 😇). It might be possible that someone with little knowledge went a long way.
- At around paragraph 6, there is a somewhat complex set of terms allowing a visitor to share and use the contents of the PDPC website for non-commercial purposes. This, however, still does not gel with this paragraph 20, and the confusion is not user or developer-friendly, to say the least.
- You can’t contract out fair use or the computational data analysis exception, so forget it.
Totally Clueless: Strata Titles Board 🎈🎈🎈
Materials, including source code, pages, documents and online graphics, audio and video in The Website are protected by law. The intellectual property rights in the materials is owned by or licensed to us. All rights reserved. (Government of Singapore © 2006).
Apart from any fair dealings for the purposes of private study, research, criticism or review, as permitted in law, no part of The Website may be reproduced or reused for any commercial purposes whatsoever without our prior written permission.
Source: https://www.stratatb.gov.sg/terms-of-use.html# (Accessed 20 October 2021)
Things I like:
- Mentions fair dealing as permitted by law. However, they have to update to “fair use” or “permitted use” once the new Copyright Act is effective.
Things I don’t like:
- You can use the information for “commercial purposes” if you get written permission. It doesn’t tell you in what circumstances they will give you such permission. (This is less upsetting than PDPC’s terms.)
- It doesn’t mention robots, spiders or “automatic devices”.
- The owner probably has no idea what data mining, predictive analytics etc., are. They need to buy the new “Law and Technology” book.
One might be surprised to find that terms of using a website, even when supposedly managed by lawyers, feature unclear, problematic, misleading, and unreasonable terms. As I mentioned, very little thought goes into drafting such terms most of the time. However, they provide obstacles to others who may want to explore new uses of a website or resource. Hopefully, more owners will proactively clean up their sites once the new Copyright Act becomes effective. In the meantime, this area provides lots of risks for a developer.
Love.Law.Robots. – A blog by Ang Hou Fu