Love.Law.Robots. by Ang Hou Fu

DataScience

Feature image

☝🏼

Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.

Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.

There are thus a lot of articles which teach you how to scrape websites — here’s mine.

Automate Boring Stuff: Get Python and your Web Browser to download your judgementsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wroteLove.Law.Robots.Houfu

After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.

GitHub – houfu/pdpc-decisions: Data Protection Enforcement Cases in SingaporeData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufuThe code repository of the original web scraper is available on GitHub.

I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.

Levelling up... is not easy.

Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.

To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.

Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.

Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.

Enter scrapy

The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.

Enter scrapy.

The scrapy project is of some vintage. It reached 1.0 in 2015 and is currently at version 2.6.2 (as of August 2022). Scrapy’s age is screaming at you when it recommends you to install it in a “virtual environment” (who doesn’t install anything in Python except in a virtual environment?). On the other hand, scrapy is stable and production ready. It’s one of the best pieces of Python software I have encountered.

I decided to port my original web scraper to scrapy. I anticipated spending lots of time reading documentation, failing and then giving up. It turned out that I spent more time procrastinating, and the actual work was pretty straightforward.

Transitioning to scrapy

Here’s another thing you would notice about scrapy’s age. It encourages you to use a command line tool to generate code. This command creates a new project:

scrapy startproject tutorial

This reminds me of Angular and the ng command. (Do people still do that these days?)

While I found these commands convenient, it also reminded me that the learning curve of such frameworks is quite steep. Scrapy is no different. In the original web scraper, I defined the application's entry point through the command line function. This seemed the most obvious place to start for me.

@click.command() @click.argument('action') def pdpcdecision(csv, download, corpus, action, root, extras, extracorpus, verbose): starttime = time.time() scraperesults = Scraper.scrape() if (action == 'all') or (action == 'files'): downloadfiles(options, scraperesults) if (action == 'all') or (action == 'corpus'): createcorpus(options, scraperesults) if extras and ((action == 'all') or (action == 'csv')): scraperextras(scraperesults, options) if (action == 'all') or (action == 'csv'): savescraperesultstocsv(options, scraperesults) diff = time.time() – starttime logger.info('Finished. This took {}s.'.format(diff))

The original code was shortened to highlight the process.

The organisation of a scrapy project is different. You can generate a new project with the command above. However, the spider does the web crawling, and you have to create that within your project separately. If you started coding, you would not find this intuitive.

For the spider, the starting point is a function which generates or yields requests. The code example below does a few things. First, we find out how many pages there are on the website. We then yield a request for each page by submitting data on a web form.

import requests import scrapy from scrapy import FormRequest

class CommissionDecisionSpider(scrapy.Spider): name = “PDPCCommissionDecisions”

def startrequests(self): defaultform_data = { “keyword”: “”, “industry”: “all”, “nature”: “all”, “decision”: “all”, “penalty”: “all”, “page”: “1 }

response = requests.post(CASELISTINGURL, data=defaultformdata)

if response.statuscode == requests.codes.ok: responsejson = response.json() totalpages = responsejson[“totalPages”]

for page in range(1, totalpages + 1): yield FormRequest(CASELISTINGURL, formdata=createform_data(page=page))

Now, you need to write another function that deals with requests and yields items, the standard data format in scrapy.

def parse(self, response, **kwargs): responsejson = response.json() for item in responsejson[“items”]: from datetime import datetime nature = [DPObligations(nature.strip()) for nature in item[“nature”].split(',')] if item[ “nature”] else “None” decision = [DecisionType(decision.strip()) for decision in item[“decision”].split(',')] if item[ “decision”] else “None” yield CommissionDecisionItem( title=item[“title”], summaryurl=f”https://www.pdpc.gov.sg{item['url']}“, publisheddate=datetime.strptime(item[“date”], '%d %b %Y'), nature=nature, decision=decision )

You now have a spider! (Scrapy’s Quotesbot example is more minimal than this)

Run the spider using this command in the project directory:

scrapy crawl PDPCCommissionDecisions -o output.csv

Using its default settings, the spider scraped the PDPC website in a zippy 60 seconds. That’s because it already implements multithreading, so you are not waiting for tasks to complete one at a time. The command above even gets you a file containing all the items you scraped with no additional coding.

Transitioning from a pure Python codebase to a scrapy framework takes some time. It might be odd at first to realise you did not have to code the writing of a CSV file or manage web requests. This makes scrapy an excellent framework — you can focus on the code that makes your spider unique rather than reinventing the essential parts of a web scraper, probably very poorly.

It’s all in the pipeline.

If being forced to write spiders in a particular way isn’t irritating yet, dealing with pipelines might be the last straw. Pipelines deal with a request that doesn’t involve generating items. The most usual pipeline component checks an item to see if it’s a duplicate and then drops it if that’s true.

Pipelines look optional, and you can even avoid the complexity by incorporating everything into the main code. It turns out that many operations can be expressed as components in a timeline. Breaking them up into parts also helps the program implement multithreading and asynchronous operations effectively.

In pdpc-decisions, it wasn’t enough to grab the data from the filter or search page. You’d need to follow the link to the summary page, which makes additional information and a PDF download available. I wrote a pipeline component for that:

class CommissionDecisionSummaryPagePipeline: def processitem(self, item, spider): adapter = ItemAdapter(item) soup = bs4.BeautifulSoup(requests.get(adapter[“summaryurl”]).text, features=“html5lib”) article = soup.find('article')

# Gets the summary from the decision summary page paragraphs = article.find(class='rte').findall('p') result = '' for paragraph in paragraphs: if not paragraph.text == '': result += re.sub(r'\s+', ' ', paragraph.text) break adapter[“summary”] = result

# Gets the respondent in the decision adapter[“respondent”] = re.split(r”\s+[bB]y|[Aa]gainst\s+“, article.find('h2').text, re.I)[1].strip()

# Gets the link to the file to download the PDF decision decisionlink = article.find('a') adapter[“decisionurl”] = f”https://www.pdpc.gov.sg{decision_link['href']}”

adapter[“fileurls”] = [f”https://www.pdpc.gov.sg{decisionlink['href']}“]

return item

This component takes an item, visits the summary page and grabs the summary, respondent’s name and the link to the PDF, which contains the entire decision.

Note also the item has a field called file_urls. I did not create this data field. It’s a field used to tell scrapy to download a file from the web.

You can activate pipeline components as part of the spider’s settings.

ITEM_PIPELINES = { 'pdpcSpider.pipelines.CommissionDecisionSummaryPagePipeline': 300, 'pdpcSpider.pipelines.PDPCDecisionDownloadFilePipeline': 800, }

In this example, the pipeline has two components. Given a priority of 300, the CommissionDecisionSummaryPagePipeline goes first. PDPCDecisionDownloadFilePipeline then downloads the files listed in the file_urls field we referred to earlier.

Note also that PDPCDecisionDownloadFilePipeline is an implementation of the standard FilesPipeline component provided by scrapy, so I didn’t write any code to download files on the internet. Like the CSV feature, scrapy downloads the files when its files pipeline is activated.

Once again, it’s odd not to write code to download files. Furthermore, writing components for your pipeline and deciding on their seniority in a settings file isn’t very intuitive if you’re not sure what’s going on. Once again, I am grateful that I did not have to write my own pipeline.

I would note that “pipeline” is a fancy term for describing what your program is probably already doing. It’s true — in the original pdpc-decisions, the pages are scraped, the files are downloaded and the resulting items are saved in a CSV file. That’s a pipeline!

Settings, settings everywhere

Someone new to the scrapy framework will probably find the settings file daunting. In the previous section, we introduced the setting to define the seniority of the components in a pipeline. If you’re curious what else you can do in that file, the docs list over 50 items.

I am not going to go through each of them in this article. To me, though, the number of settings isn’t user-friendly. Still, it hints at the number of things you can do with scrapy, including randomising the delay before downloading files from the same site, logging and settings for storage adapters to common backends like AWS or FTP.

As a popular and established framework, you will also find an ecosystem. This includes scrapyd, a service you can run on your server to schedule scrapes and run your spiders. Proxy services are also available commercially if your operations are very demanding.

There are lots to explore here!

Conclusion

Do I have any regrets about doing pdpc-decisions? Nope. I learned a lot about programming in Python doing it. It made me appreciate what special considerations are involved in web scraping and how scrapy was helping me to do that.

I also found that following a framework made the code more maintainable. When I revisited the original pdpc-decisions while writing this article, I realised the code didn’t make sense. I didn’t name my files or function sensibly or write tests which showed what the code was doing.

Once I became familiar with the scrapy framework, I knew how to look for what I wanted in the code. This extends to sharing — if everyone is familiar with the framework, it’s easier for everyone to get on the same page rather than learning everything I wrote from scratch.

Scrapy could afford power and convenience, which is specialised for web scraping. I am going to keep using it for now. Learning such a challenging framework is already starting to pay dividends.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuRead more interesting adventures.

#Programming #Python #WebScraping #DataScience #Law #OpenSource #PDPC-Decisions #scrapy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Introduction

In January 2022, the 2020 Revised Edition of over 500 Acts of Parliament (the primary legislation in Singapore) was released. It’s a herculean effort to update so many laws in one go. A significant part of that effort is to “ensure Singapore’s laws are understandable and accessible to the public” and came out of an initiative named Plain Laws Understandable by Singaporeans (or Plus).

Keeping Singapore laws accessible to all – AGC, together with the Law Revision Committee, has completed a universal revision of Singapore’s Acts of Parliament! pic.twitter.com/76TnrNCMUq

— Attorney-General's Chambers Singapore (@agcsingapore) December 21, 2021

After reviewing the list of changes they made, such as replacing “notwithstanding” with “despite”, I frankly felt underwhelmed by the changes. An earlier draft of this article was titled “PLUS is LAME”. The revolution is not forthcoming.

I was bemused by my strong reaction to a harmless effort with noble intentions. It led me to wonder how to evaluate a claim, such as whether and how much changing a bunch of words would lead to a more readable statute. Did PLUS achieve its goals of creating plain laws that Singaporeans understand?

In this article, you will be introduced to well-known readability statistics such as Flesch Reading Ease and apply them to laws in Singapore. If you like to code, you will also be treated to some Streamlit, Altair-viz and Python Kung Fu, and all the code involved can be found in my Github Repository.

GitHub – houfu/plus-explorer: A streamlit app to explore changes made by PLUSA streamlit app to explore changes made by PLUS. Contribute to houfu/plus-explorer development by creating an account on GitHub.GitHubhoufuThe code used in this project is accessible in this public repository.

How would we evaluate the readability of legislation?

Photo by Jamie Street / Unsplash

When we say a piece of legislation is “readable”, we are saying that a certain class of people will be able to understand it when they read it. It also means that a person encountering the text will be able to read it with little pain. Thus, “Plain Laws Understandable by Singaporeans” suggests that most Singaporeans, not just lawyers, should be able to understand our laws.

In this light, I am not aware of any tool in Singapore or elsewhere which evaluates or computes how “understandable” or readable laws are. Most people, especially in the common law world, seem to believe in their gut that laws are hard and out of reach for most people except for lawyers.

In the meantime, we would have to rely on readability formulas such as Flesch Reading Ease to evaluate the text. These formulas rely on semantic and syntactic features to calculate a score or index, which shows how readable a text is. Like Gunning FOG and Chall Dale, some of these formulas map their scores to US Grade levels. Very approximately, these translate to years of formal education. A US Grade 10 student would, for example, be equivalent to a Secondary four student in Singapore.

After months of mulling about, I decided to write a pair of blog posts about readability: one that's more facts oriented: (https://t.co/xbgoDFKXXt) and one that's more personal observations (https://t.co/U4ENJO5pMs)

— brycew (@wowitisbryce) February 21, 2022

I found these articles to be a good summary and valuable evaluation of how readability scores work.

These formulas were created a long time ago and for different fields. For example, Flesch Reading Ease was developed under contract to the US Navy in 1975 for educational purposes. In particular, using a readability statistic like FRE, you can tell whether a book is suitable for your kid.

I first considered using these formulas when writing interview questions for docassemble. Sometimes, some feedback can help me avoid writing rubbish when working for too long in the playground. An interview question is entirely different from a piece of legislation, but hopefully, the scores will still act as a good proxy for readability.

Selecting the Sample

Browsing vinyl music at a fairPhoto by Artificial Photography / Unsplash

To evaluate the claim, two pieces of information regarding any particular section of legislation are needed – the section before the 2020 Edition and the section in the 2020 Edition. This would allow me to compare them and compute differences in scores when various formulas are applied.

I reckon it’s possible to scrape the entire website of statues online, create a list of sections, select a random sample and then delve into their legislative history to pick out the sections I need to compare. However, since there is no API to access statutes in Singapore, it would be a humongous and risky task to parse HTML programmatically and hope it is created consistently throughout the website.

Mining PDFs to obtain better text from DecisionsAfter several attempts at wrangling with PDFs, I managed to extract more text information from complicated documents using PDFMiner.Love.Law.Robots.HoufuIn one of my favourite programming posts, I extracted information from PDFs, even though the PDPC used at least three different formats to publish their decisions. Isn’t Microsoft Word fantastic?

I decided on an alternative method which I shall now present with more majesty:

The author visited the subject website and surveyed various acts of Parliament. When a particular act is chosen by the author through his natural curiosity, he evaluates the list of sections presented for novelty, variety and fortuity. Upon recognising his desired section, the author collects the 2020 Edition of the section and compares it with the last version immediately preceding the 2020 Edition. All this is performed using a series of mouse clicks, track wheel scrolling, control-Cs and control-Vs, as well as visual evaluation and checking on a computer screen by the author. When the author grew tired, he called it a day.

I collected over 150 sections as a sample and calculated and compared the readability scores and some linguistic features for them. I organised them using a pandas data frame and saved them to a CSV file so you can download them yourself if you want to play with them too.

Datacsv Gzipped file containing the source data of 152 sections, their content in the 2020 Rev Edn etc data.csv.gz 76 KB download-circle

Exploring the Data with Streamlit

You can explore the data associated with each section yourself using my PLUS Explorer! If you don’t know which section to start with, you can always click the Random button a few times to survey the different changes made and how they affect the readability scores.

Screenshot of PLUS Section Explorer: https://share.streamlit.io/houfu/plus-explorer/main/explorer.py

You can use my graph explorer to get a macro view of the data. For the readability scores, you will find two graphs:

  1. A graph that shows the distribution of the value changes amongst the sample
  2. A graph that shows an ordered list of the readability scores (from most readable to least readable) and the change in score (if any) that the section underwent in the 2020 Edition.

You can even click on a data point to go directly to its page on the section explorer.

Screenshot of PLUS graph explorer: https://share.streamlit.io/houfu/plus-explorer/main/graphs.py

This project allowed me to revisit Streamlit, and I am proud to report that it’s still easy and fun to use. I still like it more than Jupyter Notebooks. I tried using ipywidgets to create the form to input data for this project, but I found it downright ugly and not user-friendly. If my organisation forced me to use Jupyter, I might reconsider it, but I wouldn’t be using it for myself.

Streamlit — works out of the box and is pretty too. Here are some features that were new to me since I last used Streamlit probably a year ago:

Pretty Metric Display

Metric display from Streamlit

My dear friends, this is why Streamlit is awesome. You might not be able to create a complicated web app or a game using Streamlit. However, Steamlit’s creators know what is essential or useful for a data scientist and provide it with a simple function.

The code to make the wall of stats (including their changes) is pretty straightforward:

st.subheader('Readability Statistics') # Create three columns flesch, fog, ari = st.columns(3)

# Create each column flesch.metric(“Flesch Reading Ease”, dataset[“currentfleschreadingease”][sectionexplorerselect], dataset[“currentfleschreadingease”][sectionexplorer_select] - dataset[“previousfleschreadingease”][sectionexplorerselect])

# For Fog and ARI, the lower the better, so delta colour is inverse

fog.metric(“Fog Scale”, dataset[“currentgunningfog”][sectionexplorerselect], dataset[“currentgunningfog”][sectionexplorerselect] - dataset[“previousgunningfog”][sectionexplorerselect], delta_color=“inverse”)

ari.metric(“Automated Readability Index”, dataset[“currentari”][sectionexplorerselect], dataset[“currentari”][sectionexplorer_select] - dataset[“previousari”][sectionexplorerselect], delta_color=“inverse”)

Don’t lawyers deserve their own tools?

Now Accepting Arguments

Streamlit apps are very interactive (I came close to creating a board game using Streamlit). Streamlit used to suffer from a significant limitation — except for the consumption of external data, you can’t interact with it from outside the app.

It’s at an experimental state now, but you can access arguments in its address just like an HTML encoded form. Streamlit has also made this simple, so you don’t have to bother too much about encoding your HTML correctly.

I used it to communicate between the graphs and the section explorer. Each section has its address, and the section explorer gets the name of the act from the arguments to direct the visitor to the right section.

# Get and parse HTTP request queryparams = st.experimentalgetqueryparams()

# If the keyword is in the address, use it! if “section” in queryparams: sectionexplorerselect = queryparams.get(“section”)[0] else: sectionexplorerselect = 'Civil Law Act 1909 Section 6'

You can also set the address within the Streamlit app to reduce the complexity of your app.

# Once this callback is triggered, update the address def onselect(): st.experimentalsetqueryparams(section=st.session_state.selectbox)

# Select box to choose section as an alternative. # Note that the key keyword is used to specify # the information or supplies stored in that base. st.selectbox(“Select a Section to explore”, dataset.index, onchange=onselect, key='selectbox')

So all you need is a properly formed address for the page, and you can link it using a URL on any webpage. Sweet!

Key Takeaways

Changes? Not so much.

From the list of changes, most of the revisions amount to swapping words for others. For word count, most sections experienced a slight increase or decrease of up to 5 words, and a significant number of sections had no change at all. The word count heatmap lays this out visually.

Unsurprisingly, this produced little to no effect on the readability of the section as computed by the formulas. For Flesch Reading Ease, a vast majority fell within a band of ten points of change, which is roughly a grade or a year of formal education. This is shown in the graph showing the distribution of changes. Many sections are centred around no change in the score, and most are bound within the band as delimited by the red horizontal rulers.

This was similar across all the readability formulas used in this survey (Automated Readability Index, Gunning FOG and Dale Chall).

On the face of it, the 2020 Revision Edition of the laws had little to no effect on the readability of the legislation, as calculated by the readability formulas.

Laws remain out of reach to most people

I was also interested in the raw readability score of each section. This would show how readable a section is.

Since the readability formulas we are considering use years of formal schooling as a gauge, we can use the same measure to locate our target audience. If we use secondary school education as the minimum level of education (In 2020, this would cover over 75% of the resident population) or US Grade 10 for simplicity, we can see which sections fall in or out of this threshold.

Most if not all of the sections in my survey are out of reach for a US Grade 10 student or a person who attained secondary school education. This, I guess, proves the gut feeling of most lawyers that our laws are not readable to the general public in Singapore, and PLUS doesn’t change this.

Take readability scores with a pinch of salt

Suppose you are going to use the Automated Readability Index. In that case, you will need nearly 120 years of formal education to understand an interpretation section of the Point-to-Point Passenger Transport Industry Act.

Section 3 of the Point-to-Point Passenger Transport Industry Act makes for ridiculous reading.

We are probably stretching the limits of a tool made for processing prose in the late 60s. It turns out that many formulas try to average the number of words per sentence — it is based on the not so absurd notion that long sentences are hard to read. Unfortunately, many sections are made up of several words in 1 interminable sentence. This skews the scores significantly and makes the mapping to particular audiences unreliable.

The fact that some scores don’t make sense when applied in the context of legislation doesn’t invalidate its point that legislation is hard to read. Whatever historical reasons legislation have for being expressed the way they are, it harms people who have to use them.

In my opinion, the scores are useful to tell whether a person with a secondary school education can understand a piece. This was after all, what the score was made for. However, I am more doubtful whether we can derive any meaning from a score of, for example, ARI 120 compared to a score of ARI 40.

Improving readability scores can be easy. Should it?

Singaporean students know that there is no point in studying hard; you have to study smart.

Having realised that the number of words per sentence features heavily in readability formulas, the easiest thing to do to improve a score is to break long sentences up into several sentences.

True enough, breaking up one long sentence into two seems to affect the score profoundly: see Section 32 of the Defence Science and Technology Agency Act 2000. The detailed mark changes section shows that when the final part of subsection three is broken off into subsection 4, the scores improved by nearly 1 grade.

It’s curious why more sections were not broken up this way in the 2020 Revised Edition.

However, breaking long sentences into several short ones doesn’t always improve reading. It’s important to note that such scores focus on linguistic features, not content or meaning. So in trying to game the score, you might be losing sight of what you are writing for in the first place.

Here’s another reason why readability scores should not be the ultimate goal. One of PLUS’s revisions is to remove gendered nouns — chairperson instead of chairman, his or her instead of his only. Trying to replace “his” with “his or her” harms readability by generally increasing the length of the sentence. See, for example, section 32 of the Weights and Measures Act 1975.

You can agree or disagree whether legislation should reflect our values such as a society that doesn't discriminate between genders. (It's interesting to note that in 2013, frequent legislation users were not enthusiastic about this change.) I wouldn't suggest though that readability scores should be prioritised over such goals.

Here’s another point which shouldn’t be missed. Readability scores focus on linguistic features. They don’t consider things like the layout or even graphs or pictures.

A striking example of this is the interpretation section found in legislation. They aren’t perfect, but most legislation users are okay with them. You would use the various indents to find the term you need.

Example of an interpretation section and the use of indents to assist reading.

However, they are ignored because white space, including indents, are not visible to the formula. It appears to the computer like one long sentence, and readability is computed accordingly, read: terrible. This was the provision that required 120 years of formal education to read.

I am not satisfied that readability should be ignored in this context, though. Interpretation sections, despite the creative layout, remain very difficult to read. That’s because it is still text-heavy, and even when read alone, the definition is still a very long sentence.

A design that relies more on graphics and diagrams would probably use fewer words than this. Even though the scores might be meaningless in this context, they would still show up as an improvement.

Conclusion

PLUS might have a noble aim of making laws understandable to Singaporeans, but the survey of the clauses here shows that its effect is minimal. It would be great if drafters refer to readability scores in the future to get a good sense of whether the changes they are making will impact the text. Even if such scores have limitations, they still present a sound and objective proxy of the readability of the text.

I felt that the changes were too conservative this time. An opportunity to look back and revise old legislation will not return for a while (the last time such a project was undertaken was in 1985 ). Given the scarcity of opportunity, I am not convinced that we should (a) try to preserve historical nuances which very few people can appreciate, or (b) avoid superficial changes in meaning given the advances in statutory interpretation in the last few decades in Singapore.

Beyond using readability scores that focus heavily on text, it would be helpful to consider more legal design — I sincerely believe pictures and diagrams will help Singaporeans understand laws more than endlessly tweaking words and sentence structures.

This study also reveals that it might be helpful to have a readability score for legal documents. You will have to create a study group comprising people with varying education levels, test them on various texts or legislation, then create a machine model that predicts what level of difficulty a piece of legislation might be. A tool like that could probably use machine models that observe several linguistic features: see this, for example.

Finally, while this represents a lost opportunity for making laws more understandable to Singaporeans, the 2020 Revised Edition includes changes that improve the quality of life for frequent legislation users. This includes changing all the acts of parliaments to have a year rather than the historic and quaint chapter numbers and removing information that is no longer relevant today, such as provisions relating to the commencement of the legislation. As a frequent legislation user, I did look forward to these changes.

It’s just that I wouldn’t be showing them off to my mother any time soon.

#Features #DataScience #Law #Benchmarking #Government #LegalTech #NaturalLanguageProcessing #Python #Programming #Streamlit #JupyterNotebook #Visualisation #Legislation #AGC #Readability #AccesstoJustice #Singapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Way back in December 2021, I caught wind of the 2020 Revised Edition of the statutes in Singapore law:

The AGC highlighted that the revised legislation now uses “simpler language”. I was curious about this claim and looked over their list of changes. I was not very impressed with them.

However, I did not want to rely only on my subjective intuition to make that conclusion. I wanted to test it using data science. This meant I had to compare text, calculate the changes' readability statistics, and see what changed.

Read more...

Feature image

In one of my more popular posts last year, I remarked glibly that turning the outcome of 5 million random Monopoly JR games into a truth was magical. It wasn't funny because there was magic involved (there's none). It was funny because as a lawyer I couldn't wrap my head around it.

That's because this profession is very adverse to numbers and data. I don't know the reasons why, but you can witness the dismissive attitude towards it in a recent case heard at the US Supreme Court:

Roberts: Is there any evidence that 15 weeks is so much worse than viability?
Reproductive Rights lawyer: [data data data]
Roberts: “Putting the data aside…”

— Elie Mystal (@ElieNYC) December 1, 2021

Or the uproar when the Supreme Court of Canada tried to describe its reasons in a diagram:

I stand by my concerns! ;)

— Amy Salyzyn (@AmySalyzyn) November 23, 2021

A disturbing statistic fails to convince

the city.Photo by Tamara Gore / Unsplash

There's nothing funny about the death penalty in Singapore, though. A group of 17 Malays on death row for drug offences challenged their sentences. They don't allege that anything in particular happened to them. Instead, they point to statistics cobbled together from public sources showing that Malays were overrepresented in the death row — Malays made up 77% of Singaporeans on death row for drug offences, even though they only form 13.5% of the general population.

They thus alleged that the investigation and prosecution of drug offences discriminated against them, even if it was unconscious or not deliberate.

Unsurprisingly, the case was dismissed late last year. The judgement displays all the high watermarks of the scepticism the law has against statistics. Take this critical part of the judgement at [71] as an example:

Further, even if the plaintiffs’ statistical data is accepted as complete and accurate, the only variables reflected are the ethnic group and nationality of each offender. No account is taken of the multitude of other variables that would have contributed to the convictions and sentences in each case. The manner in which the plaintiffs’ statistics are presented therefore presupposes that all these offenders were equally situated and that the sole reason for differential treatment was their ethnicity, which are the very facts the plaintiffs bear the burden of showing.

Any statistics presented as evidence will always have these problems because it is in the nature of statistics. Take a simple linear regression below as an example. The blue dots are samples and the red line is a linear regression, calculated by minimising the distances among all the samples. Only two variables are presented. The majority of the samples actually do not “fit” the line. This might be caused by some particular circumstance unique to the sample. “Common sense and logic” still tell us that there is a trend.

Source: https://simple.wikipedia.org/wiki/File:Linear_regression.svg

As such, the fact that not all accused are given death sentences or some get reduced sentences does not invalidate the trend that the cases are showing. If there was no discrimination, we would see a random distribution, not a trend.

Even if we recognise that there is a trend, or in the context of the case that there is an overrepresentation of a particular community in sentencing, it doesn’t tell us why this is happening.

The problem starkly illustrates the conundrum that correlation does not imply causation.

Source: xkcd

We know how many people are given death sentences under the law, but there may be several reasons why there may be idiosyncrasies:

  • Police are over-policing a particular community
  • Prosecutors are less “lenient” towards a particular community
  • Courts are inclined to give particular sentences
  • A particular community is more “prone” to this type of criminal activity
  • A particular community is less able to fight charges due to fewer resources (e.g. access to good legal advice)

A statistic alone would not be able to differentiate the cause or how much.

Without saying as much, the court appeared to have a lot of difficulty grappling with what exactly is causing the trend. At once, it isn’t sure whether the plaintiff’s case of discrimination is direct or indirect (see paragraph 62). Earlier in the judgement, we are treated to a scintillating report of double-crossing witnesses and a potential smoking gun, which was ultimately excluded (see paragraphs 5 to 15). In conclusion, the statistic by itself was not sufficient to prove or ground any case in discrimination under constitutional law.

The prosecution also went over a list of complaints that are commonly associated with statistical data (see paragraph 33):

  • The makeup of the data does not explain itself — why from 2010? How is a particular offender considered as part of the Malay community or some other community based on the reported case alone?
  • The data is selective and biased. No unreported cases. No cases from persons who avoided the death penalty in certain circumstances.

There are other potential problems. We don't know how significant this survey was,(the judgement does not say) but given that only 8 death sentences were passed in 2020, the number of cases considered is not likely to be significant. This means that cases affected by outliers such as random prosecution or offender decisions are likely to have a more significant impact on the sample and the result. This doesn’t mean that there was no discrimination — it means measuring it using statistics is difficult.

Ultimately, the number of people sentenced to death alone is probably not nuanced enough to tell us how fair or unfair a law is.

One should not take this too far though — the statistics prepared by the applicants might be based on the only information publicly available. Without easy access to complete and accurate data, it’s unfair to blame its imperfections on the applicants. However, this might also be the case where information isn’t even collected. How do we express the decisions of courts, prosecutors or the police in data and quantify bias in that?

Another point — while the data may not be perfect, proving something in law is not the same as in science. For example, in the criminal standard of proof, an accused is convicted when there is no reasonable doubt, and we accept circumstantial evidence even when we pass the death penalty for murder. I would believe that it is possible to form a winning case using statistics in combination with other evidence.

However, an advocate will need to be able to explain numbers and statistical concepts to a judge. This will not be an easy task in most contexts, and will only be reserved for the most confident of advocates.

Conclusion

This was one of many bad outings for statistics in the law. It might have been caused by a poor understanding of statistics or the limitations of using statistics in the legal sphere. I have yet to see a judgement demonstrate a sound grasp of these issues. If you do, please share!

#Singapore #SupremeCourtSingapore #DataScience #Judgements #Law

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

As I continued with my project of dealing with 5 million Monopoly Junior games, I had a problem. Finding a way to play hundreds of games per second was one thing. How was I going to store all my findings?

Limitations of using a CSV File

Initially, I used a CSV (Comma Separated Values) file to store the results of every game. Using a CSV file is straightforward, as Python and pandas can load them quickly. I could even load them using Excel and edit them using a simple text editor.

However, every time my Streamlit app tried to load a 5GB CSV file as a pandas dataframe, the tiny computer started to gasp and pant. If you try to load a huge file, your application might suffer performance issues or failures. Using a CSV to store a single table seems fine. However, once I attempted anything more complicated, like the progress of several games, its limitations became painfully apparent.

Let’s use SQL instead!

The main alternative to using a CSV is to store data in a database. Of all the databases out there, SQL is the biggest dog of them all. Several applications, from web applications to games, use some flavour of SQL. Python, a “batteries included” programming language, even has a module for processing SQLite — basically a light SQL database you can store as a file.

SQL doesn’t require all its data to be loaded to start using the data. You can use indexes to search data. This means your searches are faster and less taxing on the computer. Doable for a little computer!

Most importantly, you use data in a SQL database by querying it. This means I can store all sorts of data in the database without worrying that it would bog me down. During data extraction, I can aim for the maximum amount of data I can find. Once I need a table from the database, I would ask the database to give me a table containing only the information I wanted. This makes preparing data quick. It also makes it possible to explore the data I have.

Why I procrastinated on learning SQL

To use a SQL database, you have to write operations in the SQL language, which looks like a sentence of gobbledygook to the untrained eye.

https://imgs.xkcd.com/comics/exploits_of_a_mom.pngI'll stick to symbols in my children's names please, thanks.

SQL is also in the top 10 on the TIOBE index of programming languages. Higher than Typescript, at the very least.

I have heard several things about SQL — it’s similar to Excel formulas. However, I dreaded learning a new language to dig a database. The landscape of SQL was also very daunting. There are several “flavours” of SQL, and I was not sure of the difference between PostgreSQL, MySQL or MariaDB.

There are ORM (Object-relational mapping) tools for SQL for people who want to stick to their programming languages. ORMs allow you to use SQL with your favourite programming language. For Python, the main ORM is SQLAlchemy. Seeing SQL operations in my everyday programming language was comforting at first, but I found the documentation difficult to follow.

Furthermore, I found other alternative databases easier to pick up and deploy. These included MongoDB, a NoSQL database. They rely on a different concept — data is stored in “documents” instead of tables, and came with a full-featured ORM. For many web applications, the “document” idiom applied well. It wouldn’t make sense, though, if I wanted a table.

Enter SQLModel!

A few months ago, an announcement from the author of FastAPI excited me — he was working on SQLModel. SQLModel would be a game-changer if it were anything like FastAPI. FastAPI featured excellent documentation. It was also so developer-friendly that I didn’t worry about the API aspects in my application. If SQLModel could reproduce such an experience for SQL as FastAPI did for APIs with Python, that would be great!

SQLModelSQLModel, SQL databases in Python, designed for simplicity, compatibility, and robustness.logo

As it turned out, SQLModel was a very gentle introduction to SQL. The following code creates a connection to an SQLite database you would create in your file system.

from sqlmodel import SQLModel, create_engine, Session from sqlalchemy.engine import Engine

engine: Optional[Engine] = None

def createDBengine(filename: str): global engine engine = createengine(f'sqlite:///{filename}') SQLModel.metadata.createall(engine) return engine

Before creating a connection to a database, you may want to make some “models”. These models get translated to a table in your SQL database. That way, you will be working with familiar Python objects in the rest of your code while the SQLModel library takes care of the SQL parts.

The following code defines a model for each game played.

from sqlmodel import SQLModel, Field

class Game(SQLModel, table=True): id: Optional[int] = Field(default=None, primarykey=True) name: str = Field() parent: Optional[str] = Field() numof_players: int = Field() rounds: Optional[int] = Field() turns: Optional[int] = Field() winner: Optional[int] = Field()

So, every time the computer finished a Monopoly Junior game, it would store the statistics as a Game. (It’s the line where the result is assigned.)

def playbasicgame(numofplayers: Literal[2, 3, 4], turns: bool) –> Tuple[Game, List[Turn]]: if numofplayers == 3: game = ThreePlayerGame() elif numofplayers == 4: game = FourPlayerGame() else: game = TwoPlayerGame() gameid = uuid.uuid4() logging.info(f'Game {gameid}: Started a new game.') endturn, gameturns = playrounds(game, gameid, turns) winner = decidewinner(endturn) result = Game(name=str(gameid), numofplayers=numofplayers, rounds=endturn.getvalue(GameData.ROUNDS), turns=endturn.getvalue(GameData.TURNS), winner=winner.value) logging.debug(f'Game {gameid}: {winner} is the winner.') return result, game_turns

After playing a bunch of games, these games get written to the database in an SQL session.

def write_games(games: List[Game]): with Session(engine) as session: for game in games: session.add(game) session.commit()

Reading your data from the SQLite file is relatively straightforward as well. If you were writing an API with FastAPI, you could pass the model as a response model, and you can get all the great features of FastAPI directly.

I had already stored some Game data in “games.db” in the following code. I created a backend server that would read these files and return all the Turns belonging to a Game. This required me to select all the turns in the game that matched a unique id. (As you might note in the previous section, this is a UUID.)

gamesengine = createengine('sqlite:///games.db')

@app.get(“/game/{gamename}“, responsemodel=List[Turn]) def getturns(gamename: str): “”” Get a list of turns from :param gamename. “”” with Session(gamesengine) as session: checkgameexists(session, gamename) return session.exec(select(Turn).where(Turn.game == gamename)).all()

Limitations of SQLModel

Of course, being marked as version “0.0.6”, this is still early days in the development of SQLModel. The documentation is also already helpful but still a work in progress. A key feature that I would be looking out for is migrating data, most notably for different software versions. This problem can get very complex quickly, so anything would be helpful.

I also found creating the initial tables very confusing. You create models by implementing descendants of the SQLModel class in the library, import these models into the main SQLModel class, and create the tables by calling SQLModel.metadata.create_all(engine). This doesn’t appear pretty pythonic to me.

Would I continue to use SQLModel?

There are use cases where SQLModel will now be applicable. My Monopoly Junior data project is one beneficiary. The library provided a quick, working solution to use a SQL database to store results and data without needing to get too deep and dirty into SQL. However, if your project is more complicated, such as a web application, you might seriously consider its current limitations.

Even if I might not have the opportunity to use SQLModel in the future, there were benefits from using it now:

  • I became more familiar with the SQL language after seeing it in action and also comparing an SQL statement with its SQLModel equivalent in the documentation. Now I can write simple SQL statements!
  • SQLModel is described as a combination of SQLAlchemy and Pydantic. Once I realised that many of the features of SQLModel are extensions of SQLAlchemy’s, I was able to read and digest SQLAlchemy’s documentation and library. If I can’t find what I want in SQLModel now, I could look into SQLAlchemy.

Conclusion

Storing my Monopoly Junior games’ game and turn data became straightforward with SQLModel. However, the most significant takeaway from playing with SQLModel is my increased experience using SQL and SQLAlchemy. If you ever had difficulty getting onto SQL, SQLModel can smoothen this road. How has your experience with SQLModel been? Feel free to share with me!

#Programming #DataScience #Monopoly #SQL #SQLModel #FastAPI #Python #Streamlit

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This is the story of my lockdown during a global pandemic. [ Cue post-apocalypse 🎼]

Amid a global pandemic, schools and workplaces shut down, and families had to huddle together at home. There was no escape. Everyone was getting on each others' nerves. Running out of options, we played Monopoly Junior, which my daughter recently learned how to play. As we went over several games, we found the beginner winning every game. Was it beginner's luck? I seethed with rage. Intuitively, I knew the odds were against me. I wasn't terrible at Monopoly. I had to prove it with science.

How would I prove it with science? This sounds ridiculous at first, but you'd do it by playing 5 million Monopoly Junior games. Yes, once you play enough games, suddenly your anecdotal observations become INSIGHT. (Senior lawyers might also call it experience, but I am pretty sure that they never experienced 5 million cases.)

This is the story of how I tried different ways to play 5 million Monopoly JR games.

What is Monopoly Junior?

Front cover for the box of the board game, Monopoly JR.The cover for the Monopoly Junior board game.

Most people know what the board game called Monopoly is. You start with a bunch of money, and the goal is to bankrupt everyone else. To bankrupt everyone else, you go round the board and purchase properties, build hotels and take everyone else's money through rent-seeking behaviour (the literal kind). It's a game for eight and up. Mainly because you have to count with hundreds and thousands, and you would need the stamina to last an entire night to crush your opposition.

If you are, like my daughter, five years old, Monopoly JR is available. Many complicated features in Monopoly (for example, counting past 30, auctions and building houses and hotels) are removed for young players to join the game. You also get a smaller board and less money. Unless you receive the right chance card, you don't have a choice once you roll the dice. You buy, or you pay rent on the space you land on.

As it's a pretty deterministic game, it's not fun for adults. On the other hand, the game spares you the ignominy of mano-a-mano battles with your kids by ending the game once anyone becomes bankrupt. Count the cash, and the winner is the richest. It ends much quicker.

Hasbro Monopoly Junior Game,A69843480 : Amazon.sg: ToysHasbro Monopoly Junior Game,A69843480 : Amazon.sg: ToysInstead of letting computers have all the fun, now you can play the game yourself! (I earn a commission when you buy from this amazon affiliate link)

Determining the Approach, Realising the Scale

Ombre BalloonsPhoto by Amy Shamblen / Unsplash

There is a pretty straightforward way to write this program. Write the rules of the game in code and then play it five million times. No sweat!

However, if you wanted to prove my hypothesis that players who go first are more likely to win, you would need to do more:

  • We need data. At the minimum, you would want to know who won in the end. This way, you can find out the odds you'd win if you were the one who goes first (like my daughter) or the last player (like me).
  • As I explored the data more, interesting questions began to pop up. For example, given any position in the game, what are the odds of winning? What kind of events cause the chances of winning to be altered significantly?
  • The data needs to be stored in a manner that allows me to process and analyse efficiently. CSV?
  • It'd be nice if the program would finish as quickly as possible. I'm excited to do some analysis, and waiting for days can be burdensome and expensive! (I'm running this on a DigitalOcean Droplet.)

The last point becomes very troublesome once you realise the scale of the project. Here's an illustration: you can run many games (20,000 is a large enough sample, maybe) to get the odds of a player winning a game. Imagine you decide to do this after every turn for each game. If the average game had, say, 100 turns (a random but plausible number), you'd be playing 20,000 X 100 = 2 million additional games already! Oh, let's say you also want to play three-player and four-player games too...

It looks like I have got my work cut out for me!

Route One: Always the best way?

I decided to go for the most obvious way to program the game. In hindsight, what seemed obvious isn't obvious at all. Having started programming using object-orientated languages (like Java), I decided to create classes for all my data.

An example of how I used a class to store my data. Not rocket science.

The classes also came in handy when I wrote the program to move the game.

Object notation makes programming easy.

It was also pretty fast, too — 20,000 two-player games took less than a minute. 5 million two-player games would take about 4 hours. I used python's standard multiprocessing modules so that several CPUs can work on running games by themselves.

Yes, my computers are named after cartoon characters.

After working this out, I decided to experiment with being more “Pythonic”. Instead of using classes, I would use Python dictionaries to store data. This also had the side effect of flattening the data, so you would not need to access an object within an object. With a dictionary, I could easily use pandas to save a CSV.

This snippet shows how the basic data for a whole game is created.

Instead of using object notation to find data about a player, the data is accessed by using its key in a python dictionary.

The same code to move player is implemented for a dictionary.

I didn't think it would make a difference honestly. However, I found the speed up was remarkable: almost 10 times! 20,000 two-player games now took 4 seconds. The difference between 4 seconds and less than a minute is a trip to the toilet, but for 5 million games, it was reduced from 4 hours to 16 mins. That might save me 20 cents in Droplet costs!

Colour me impressed.

Here are some lessons I learnt in the first part of my journey:

  • Your first idea might not always be the best one. It's best to iterate, learn new stuff and experiment!
  • Don't underestimate standard libraries and utilities. With less overhead, they might be able to do a job really fast.
  • Scale matters. A difference of 1 second multiplied by a million is a lot. (More than 11.5 days, FYI.)

A Common-Sense Guide to Data Structures and Algorithms, Second EditionBig O notation can make your code faster by orders of magnitude. Get the hands-on info you need to master data structures and algorithms for your daily work.Jay WengrowLearn new stuff – this book was useful in helping me dive deeper into the underlying work that programmers do.

Route 3: Don't play games, send messages

After my life-changing experiment, I got really ambitious and wanted to try something even harder. I then got the idea that playing a game from start to finish isn't the only way for a computer to play Monopoly JR.

https://media.giphy.com/media/eOAYCCymR0OqSN4aiF/giphy.gif

As time progressed in lockdown, I explored the idea of using microservices (more specifically, I read this book). So instead of following the rules of the game, the program would send “messages” depicting what was going on in a game. The computer would pick up these messages, process them and hand them off. In other words, instead of a pool of workers each playing their own game, the workers would get their commands (or jobs) from a queue. It didn't matter what game they were playing. They just need to perform their commands.

A schematic of what messages or commands a worker might have to process to play a game of Monopoly JR.When the NewGameCommand is executed, it writes some information to a database and then puts a PlayRoundCommand in the queue for the next worker.

So, what I had basically done was chop the game into several independent parts. This was in response to certain drawbacks I observed in the original code. Some games took longer than others and this held back a worker as it furiously tried to finish it before it could move on to the next one. I hoped that it could finish more games quickly by making independent parts. Anyway, since they were all easy jobs, the workers would be able to handle them quickly in rapid succession, right?

It turns out I was completely wrong.

It was so traumatically slow that I had to reduce the number of games from 20000 to 200 just to take this screenshot.

Instead of completing hundreds or thousands of games in a second, it took almost 1 second to complete a game. You wouldn't be able to imagine how long 5 million seconds would now take. How much I would have to pay DigitalOcean again for their droplets? (Maybe $24.)

What slowed the program?

Tortoise 🐢 Photo by Craig Pattenaude / Unsplash

I haven't been able to drill down the cause, but this is my educated guess: The original command might be simple and straightforward, but I might have introduced a laborious overhead: operating a messaging system. As the jobs finished pretty quickly, the main thread repeatedly asked what to do next. If you are a manager, you might know why this is terrible and inefficient.

Experimenting again, I found the program to improve its times substantively once I allowed it to play several rounds in one command rather than a single round. I pushed it so hard that it was literally completing 1 game in 1 command. It never reached the heights of the original program using the python dictionaries though. At scale, the overhead does matter.

Best Practices — Dask documentationThinking about multiprocessing can be counterintuitive for a beginner. I found Dask's documentation helpful in starting out.

So, is sending messages a lousy solution? The experiment shows that when using one computer, its performance is markedly poor. However, there are circumstances when sending messages is better. If several clients and servers are involved in the program, and you are unsure how reliable they are, then a program with several independent parts is likely to be more robust. Anyway, I also think it makes the code a bit more readable as it focuses on the process.

I now think a better solution is to improve my original program.

Conclusion: Final lessons

Writing code, it turns out, isn't just a matter of style. There are serious effects when different methods are applied to a problem. Did it make the program faster? Is it easier to read? Would it be more conducive to multiprocessing or asynchronous operations? It turns out that there isn't a silver bullet. It depends on your current situation and resources. After all this, I wouldn't believe I became an expert, but it's definitely given me a better appreciation of the issues from an engineering perspective.

Now I only have to work on my previous code😭. The fun's over, isn't it?

#Programming #blog #Python #DigitalOcean #Monopoly #DataScience

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

October is drawing to a close, and so the end of the year is almost upon us. It's hard to fathom that I have been stuck working from home for nearly 20 months now. Some countries seemed to have moved on, but I doubt we'd do so in Singapore. Nevertheless, it's time for reflection and thinking about what to do about the future.

What I am reading now

The Importance of Being AuthorisedA recent case shows that practising law as an unauthorised person can have serious effects. What does this hold for other people who may be interested in alternative legal services?Love.Law.Robots.HoufuAn in-depth analysis of a rare and recent local decision touching on this point.

CLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksCLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksLucy Endel BassliI earn a commission from purchases made with this link.

  • Do you need a lot of coding or technical skills to use AI? This commentator from Today Online highlights Hugging Face, Gradio and Streamlit and doesn't think so. So have we finally resolved the question of whether lawyers need to code? I still think the answer is very nuanced — one person can compile a graph using free tools quickly, but making it production-ready is tough and won't be free. I agree more with the premise that we need to better empower students and others to “seek out AI services and solutions on their own”. In the Legal field, this starts with having more data out there available for all to use.

Why you don’t need to be an expert to use AI any moreKeeping up with the latest developments in artificial intelligence is like drinking from the proverbial fire hose, as a recent 188-page overview by two tech investors Ian Hogarth and Nathan Benaich would attest.TODAYonline

Post Updates

This week saw the debut of my third feature — “It's Open. It's Free — Public Legal Information in Singapore”. I have been working on it for several months, and it's still a work in progress. I made it as part of my research into what materials to scrape, and I've hinted at the project several times recently. In due course, I want to add more obscure courts and tribunals, including the PDPC and others. You can check the page regularly, or I would mention it here from time to time. I welcome your comments and suggestions on what I should cover.

That's it!

Family Playing A Board Game. An Asian family \(adult male and female and two adolescents, male and female\) sitting around a coffee table playing a board game. Photographer Bill BransonPhoto by National Cancer Institute / Unsplash

At the start of this newsletter, I mentioned that November is the month to be looking forward. 😋 Unfortunately, for the time being, I would be racing to finish articles that I had wanted to write since the pandemic started. This includes my observations from playing Monopoly Junior 5 million times. You can look at a sneak peek of the work in my Streamlit app (if it runs).

In the meantime, I would be trying the weights and cons of using MongoDB or SQL for my scraping project. Storing text and downloads on S3 is pretty straightforward, but where should I store the metadata of the decisions? If anyone has an opinion, I could use some advice!

Thanks for reading, and feel free to reach out!

#Newsletter #ArtificalIntelligence #BookReview #Contracts #DataMining #Law #DataScience #LegalTech #Programming #Singapore #Streamlit #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

This Features article features many articles which may require a free subscription to read. Become a subscriber today and get access to all the articles!

This Features article is a work in progress. If you have any feedback or suggestions, please feel free to contact me!

What's the Point of this List?

Photo by Cris Tagupa on Unsplash

Unlike other jurisdictions, Singapore does not have a legal information institute like AustLII or CanLII. Legal Information institutes, as defined in the Free Access to Law Movement Declaration:

  • Publish via the internet public legal information originating from more than one public body;
  • Provide free and anonymous public access to that information;
  • Do not impede others from obtaining public legal information from its sources and publishing it; and
  • Support the objectives set out in this Declaration.

We do have an entry on CommonLII, but the resources are not always up to date. Furthermore, the difference in features and usability are worlds apart. (If you wanted to know what AustLII looked like over ten years ago, look at CommonLII.)

This does not mean that free legal resources are non-existent in Singapore. It's just that they are scattered around the internet, with varying levels of availability, coverage and features. Oh, there's also no guarantee they will be around now or in the future.

Ready to mine free online legal materials in Singapore? Not so fast!Amendments to Copyright Act might support better access to free online legal materials in Singapore by robots. I survey government websites to find out how friendly they are to this.Love.Law.Robots.HoufuAmendments to the Copyright Act have cleared some air regarding mining, but questions remain.

This post tries to gather all the resources I have found and benchmark them. With some idea of how to extract them, you can plausibly start a project like OpenLawNZ. If you're interested in, say, data protection commission decisions and are toying with the idea of NLPing them, you know where to find the source. Even if you aren't ambitious, you can browse them and add them to your bookmarks. Maybe even archive them if you are so inclined.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuIt might be surprising to some, but there's a wealth of material out there if you can find it!

Your comments are always welcome.

Options that aren't free or online

Photo by Iñaki del Olmo on Unsplash

The premier resource for research into Singapore law is LawNet. It offers a pay per use option, but it's not cheap (at minimum $57 for pay per use). There's one terminal available for LawNet at the LCK Library if you can travel to the National Library. I haven't used LawNet since I left practice several years ago. From following the news of its developments, it hasn't departed much from its core purpose and added several collections that can be very useful for practitioners.

Source: https://eresources.nlb.gov.sg/main/Browse?browseBy=type&filter=10&page=2 (accessed 22 October 2021)

There are also law libraries at the Supreme Court (Level 1) and State Courts (B1) if you're into physical things. There are reasonably good resources for its size, but if you were looking for something very specialized, you might be trying your luck here.

Supreme Court of Singapore

Photo by Vuitton Lim on Unsplash

As the apex court in Singapore, the resources available for free here are top-notch. The Supreme Court cover the entire gamut from the High Court, Court of Appeal, Singapore International Commercial Court and all other courts in between.

The Supreme Court has been steadily (and stealthily) expanding its judgements section. They now go back to 2000, and have basic search functionality and some tagging. Judgements only cover written judgements , which are “generally issued for more complex cases or where they involve questions of law which are of public interest”. In other words, High Courts prepare them for possible appeals, and the Court of Appeal prepares them for stare decisis. As such, they don't cover all the work that the courts here do. Relying on this to study the court's work (beyond the development of law) can be biased. There's no API access.

Hearing lists are available for the current week and the following week and then sorted by judges. You can download them in PDF. Besides information relating to when the hearing is fixed, you can see who the parties are and skeletal information on the purpose of the hearing. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

New homes for judgements in the UK... and Singapore?I look at envy in the UK while exploring some confusing changes in the Singapore Supreme Court website.Love.Law.Robots.HoufuThe Supreme Court may be the apex court in Singapore, but its judgements reveal that there is a real mess in here.

State Courts

A rung lower than the Supreme Court, the State Courts generally deal with more down to earth civil and criminal matters. It long felt neglected in an older building (though interesting for an architecture geek), but they changed their name (from Subordinate Courts to State Courts) and moved to a spanking new nineteen storey building in the last few years. If you watch a lot of local television, this is the court where embarrassed respondents dash past the media scrum.

Unfortunately, judgements are harder to find at this level. The only free resource is a LawNet section that covers written judgements for the last three months.

Written judgements are prepared pretty much only when they will be appealed to the Supreme Court. This means that the judgements you can see there represent a relatively small and biased microcosm of work in the State Courts. In summary, appeals at this level are restricted by law. These represent significant barriers for civil cases where costs are an issue. Such restrictions are less pronounced in criminal cases. The Public Prosecutor appeals every case that does not meet its expectations. Accused appeals every case... well, because they might want to see the written judgment so that they can decide if they're going to appeal. This might explain why there are several more criminal cases available than civil matters. On the other hand, the accused or litigant who wants to get this case over and done don't appeal.

NUS cases show why judge analytics is needed in SingaporeThrowing anecdotes around fails to convince any side of the situation in Singapore. The real solution is more data.Love.Law.Robots.HoufuDue to the lack of public information on how judges decide cases, it's difficult to get a common understanding of what they do.

Hearing lists are available for civil trials and applications, criminal trials and tribunal matters in the coming week. It looks like an ASP.Net frontend with a basic search function. Besides information relating to when the hearing is fixed, you can see who the parties are and very skeletal information on what the hearing is about. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

The State Court has expanded its scope with several new courts in recent years, such as the Protection from Harassment Courts, Community Dispute Resolution Centre and Labour Claims Tribunal. None of these courts publishes their judgements on a regular basis. As they rarely get appealed, you will also not find them in the free section of LawNet.

Legislation

Beautiful view from the Parliament of Singapore 🇸🇬Photo by Steven Lasry / Unsplash

Singapore Statutes Online is the place to get legislation in Singapore. It contains historical versions of legislation, current editions, repealed versions, subsidiary legislation and bills.

When the first version was released in 2001, it was quite a pioneer. Today many countries provide their legislations in snazzier forms. (I am a fan of the UK's version).

While there isn't API access (and extraction won't be easy due to the extensive use of not so semantic HTML), you can enjoy the several RSS feeds littered around every aspect of the site.

I consider SSO to be very fast and regularly updated. However, if you need an alternative site for bills and acts, you can consider Parliament's website.

#Features #DataMining #DataScience #Decisions #Government #Judgements #Law #OpenSource #Singapore #SupremeCourtSingapore #WebScraping #StateCourtsSingapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

I have been mulling over developing an extensive online database of free legal materials in the flavour of OpenLawNZ or an LII for the longest time. Free access to such materials is one problem to solve, but I'm also hoping to compile a dataset to develop AI solutions. I have tried and demonstrated this with PDPC's data previously, and I am itching to expand the project sustainably.

However, being a lawyer, I am concerned about the legal implications of scraping government websites. Would using these materials be a breach of copyright law? In other countries, people accept that the public should generally be allowed to use such public materials. However, I am not very sure of this here.

The text steps highlightedPhoto by Clayton Robbins / Unsplash

I was thus genuinely excited about the amendments to the Copyright Act in Singapore this year. According to the press release, they will be operational in November, so they will be here soon.

Copyright Bill – Singapore Statutes OnlineSingapore Statutes Online is provided by the Legislation Division of the Singapore Attorney-General’s ChambersSingapore Statutes OnlineThe Copyright Bill is expected to be operationalised in November 2021.

[ Update 21 November 2021: The bill has, for the most part, been operationalised.]

Two amendments are particularly relevant in my context:

Using publicly disclosed materials from the government is allowed

In sections 280 to 282 of the Bill, it is now OK to copy or communicate public materials to facilitate more convenient viewing or hearing of the material. It should be noted that this is limited to copying and communicating it. Presumably, this means that I can share the materials I collected on my website as a collection.

Computational data analysis is allowed.

The amendments expressly say that using a computer to extract data from a work is now permitted. This is great! At some level, the extraction of the material is to perform some analysis or computation on it — searching or summarising a decision etc. I think some limits are reasonable, such as not communicating the material itself or using it for any other purpose.

However, one condition stands out for me — I need “lawful access” to the material in the first place. The first illustration to explain this is circumventing paywalls, which isn’t directly relevant to me. The second illustration explains that obtaining the materials through a breach of the terms of use of a database is not “lawful access”.

That’s a bit iffy. As you will see in the section surveying terms, a website’s terms are not always clear about whether access is lawful or not. The “terms of use” of a website are usually given very little thought by its developers or implemented in a maximal way that is at once off-putting and misleading. Does trying to beat a captcha mean I did not get lawful access? Sure, it’s a barrier to thwart robots, but what does it mean? If a human helps a robot, would it still be lawful?

A recent journal article points to “fair use” as the way forward

I was amazed to find an article in the SAL Journal titled “Copying Right in Copyright Law” by Prof David Tan and Mr Thomas Lee, which focused on the issue that was bothering me. The article focuses on data mining and predictive analytics, and it substantially concerns robots and scrapers.

Singapore Academy of Law Journale-First MenuLink to the journal article on E-First at SAL Journals Online.

On the new exception for computational data analysis, the article argues that the two illustrations I mentioned earlier were “inadequate and there is significant ambiguity of what lawful access means in many situations”. Furthermore, because the illustrations were not illuminating, it might create a situation where justified uses are prohibited. With much sadness, I agree.

More interestingly, based on some mathematics and a survey, the authors argue that an open-ended general fair use defence for data mining is the best way forward. As opposed to a rule-based exception, such a defence can adapt to changes better. Stakeholders (including owners) also prefer it because it appeals to their understanding of the economic basis of data mining.

You can quibble with the survey methodology and the mathematics (which I think is very brave for a law journal article). I guess it served its purpose in showing the opinion of stakeholders in the law and the cost analysis very well. I don’t suspect it will be cited in a court judgement soon, but hopefully, it sways someone influential.

We could use a more developer-friendly approach.

Photo by Mimi Thian / Unsplash

There was a time when web scraping was dangerous for a website. In those times, websites can be inundated with requests by automated robots, leading them to crash. Since then, web infrastructure has improved, and techniques to defeat malicious actors have been developed. The great days of “slashdotting” a website has not been heard of for a while. We’ve mostly migrated to more resilient infrastructure, and any serious website on the internet understands the value of having such infrastructure.

In any case, it is possible to scrape responsibly. Scrapy, for example, allows you to queue requests regularly or identify yourself as a robot or scraper, respecting robots.txt. If I agreed not to degrade a website’s performance, which seems quite reasonable, shouldn’t I be allowed to use it?

Being more developer-friendly would also help government agencies find more uses for their works. For now, most legal resources appear to cater exclusively for lawyers. Lawyers will, of course, find them most valuable because it’s part of their job. However, others may also need such resources because they can’t afford lawyers or have a different perspective on how information can be helpful. It’s not easy catering to a broader or other audience. If a government agency doesn’t have the resources to make something more useful, shouldn’t someone else have a go? Everyone benefits.

Surveying the terms of use of government websites

RTK survey in quarryPhoto by Valeria Fursa / Unsplash

Since “lawful access” and, by extension, “terms of use” of a website will be important in considering the computational data analysis exceptions, I decided to survey the terms of use of various government agencies. After locating their treatment of the intellectual property rights of their materials, I gauge my appetite to extract them.

In all, I identified three broad categories of terms.

Totally Progressive: Singapore Statutes Online 👍👍👍

Source: https://sso.agc.gov.sg/Help/FAQ#FAQ_8 (Accessed 20 October 2021)

Things I like:

  • They expressly mention the use of “automated means”. It looks like they were prepared for robots!
  • Conditions appear reasonable. There’s a window for extraction and guidelines to help properly cite and identify the extracted materials.

Things I don’t like:

  • The Singapore Statutes Online website is painful to extract from and doesn’t feature any API.

Comments:

  • Knowing what they expect scrapers to do gives me confidence in further exploring this resource.
  • Maybe the key reason these terms of use are excellent is that it applies to a specific resource. If a resource owner wants to make things developer-friendly, they should consider their collections and specify their terms of use.

Totally Bonkers: Personal Data Protection Commission 😖😖😖

Source: https://www.pdpc.gov.sg/Terms-and-Conditions (Accessed 20 October 2021)

Things I like:

  • They expressly mention the use of “robots” and “spiders”. It looks like they were prepared!

Things I don’t like:

  • It doesn’t allow you to use a “manual process” to monitor its Contents. You can’t visit our website to see if we have any updates!
  • What is an automatic device? Like a feed reader? (Fun fact: The PDPC obliterated their news feed in the latest update to their website. The best way to keep track of their activities is to follow their LinkedIn)
  • PDPC suggests that you get written permission but doesn’t tell you what circumstances they will give you such permission.
  • I have no idea what an unreasonable or disproportionately large load is. It looks like I have to crash the server to find out! (Just kidding, I will not do that, OK.)

Comments:

  • I have no idea what happened to the PDPC, such that it had to impose such unreasonable conditions on this activity (I hope I am not involved in any way 😇). It might be possible that someone with little knowledge went a long way.
  • At around paragraph 6, there is a somewhat complex set of terms allowing a visitor to share and use the contents of the PDPC website for non-commercial purposes. This, however, still does not gel with this paragraph 20, and the confusion is not user or developer-friendly, to say the least.
  • You can’t contract out fair use or the computational data analysis exception, so forget it.
  • I’m a bit miffed when I encounter such terms. Let’s hope their technical infrastructure is as well thought out as their terms of use. (I’m being ironic.)

Totally Clueless: Strata Titles Board 🎈🎈🎈

Materials, including source code, pages, documents and online graphics, audio and video in The Website are protected by law. The intellectual property rights in the materials is owned by or licensed to us. All rights reserved. (Government of Singapore © 2006).
Apart from any fair dealings for the purposes of private study, research, criticism or review, as permitted in law, no part of The Website may be reproduced or reused for any commercial purposes whatsoever without our prior written permission.

Source: https://www.stratatb.gov.sg/terms-of-use.html# (Accessed 20 October 2021)

Things I like:

  • Mentions fair dealing as permitted by law. However, they have to update to “fair use” or “permitted use” once the new Copyright Act is effective.

Things I don’t like:

  • Not sure why it says “Government of Singapore ©️ 2006”. Maybe they copied this terms of use statement in 2006 and never updated it since?
  • You can use the information for “commercial purposes” if you get written permission. It doesn’t tell you in what circumstances they will give you such permission. (This is less upsetting than PDPC’s terms.)
  • It doesn’t mention robots, spiders or “automatic devices”.

Comments:

  • It’s less upsetting than a bonkers terms of use, but it doesn’t give me confidence or an idea of what to expect.
  • The owner probably has no idea what data mining, predictive analytics etc., are. They need to buy the new “Law and Technology” book.

Conclusion

One might be surprised to find that terms of using a website, even when supposedly managed by lawyers, feature unclear, problematic, misleading, and unreasonable terms. As I mentioned, very little thought goes into drafting such terms most of the time. However, they provide obstacles to others who may want to explore new uses of a website or resource. Hopefully, more owners will proactively clean up their sites once the new Copyright Act becomes effective. In the meantime, this area provides lots of risks for a developer.

#Law #tech #Copyright #DataScience #Government #WebScraping #scrapy #Singapore #PersonalDataProtectionCommission #StrataTitlesBoard #DataMining

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Everything changed about the pandemic — we now work from home, learn from home, and attend video conferences rather than phone calls or face to face meetings. Some of this is weird, but some of this is even weirder. One example of the latter is that I have become more used to seeing myself on the screen.

What does this mean for the blog? I will be exploring how to make videos to write posts. This is not a change I would have expected in June when I moved to Ghost. (Certainly, Ghost's membership functions have made it easier to convince me that the effort to learn and produce will be worth it.)

I originally believed that video posts take a lot of effort to create but are also lame. (I read faster than I watch someone.) Now, I have come to believe it can be more fun and engaging. So, you might hear my voice and see my face soon. I ain't a handsome fella, so please don't be turned away!

I know how to do it technically, and the equipment should not be difficult to get. I will be experimenting, though, so hang on.

What I am reading now

  • I am a big admirer of Suffolk Law School's LIT Lab. I might be biased because they pervasively use docassemble, a free and open-source LegalTech tool. During the pandemic, they have managed to take docassemble further and write a law review article about their experience. It's an encouraging story about building community and marshalling disparate resources for access to justice (A2J). I believe Open Source was an important factor in its success, so hopefully, it is a blueprint for other labs.

Digital Curb Cuts: Towards an Inclusive Open Forms EcosystemIn this paper we focus on digital curb cuts created during the pandemic: improvements designed to increase accessibility that benefit people beyond the populatiSee all articles by Quinten Steenhuis

  • What in the world is Moneyball? It's a strange story whereby a baseball team followed the data by hiring players based on their statistics rather than traditional indicators like reputation and overachieved. Can this be applied to hiring and retaining law firm associates? Legal Evolution suggests it can, but that's not what's interesting about the story. You will read about the intransigence of law firm leaders in the face of data, and you'd be convinced of the importance of leadership in innovation. This is especially the case where the results can be counterintuitive, upsetting, or confusing to leaders. On the other hand, I am sure a law firm leader will be willing to employ “sabermetrics” to achieve the best team on the cheap.

Moneyball for law firm associates: a 15-year retrospective (257) | Legal EvolutionPretty much everything was a counterintuitive curveball. In April of 2006, more than 15 years ago, I wrote a memo to file that would go on to exert aLegal EvolutionBill Henderson

TechLaw.Fest 2021TechLaw.Fest 2021TechLaw.Fest 2021

  • I have always wondered whether I should get Singapore Corporate Counsel Association membership. Quite frankly, the only benefits I see so far are the self-satisfaction of belonging and the somewhat discounted LawNet subscription. Here's something else to consider: the “First” Technology Law Course in Singapore from SCCA. It looks pretty, but I can't find the module details... oh wait, here it is. Since technology law is prevalent and not well taught in law schools (at least during my time), this will be of interest if you need to pick up some substantive knowledge.

SCCA | CoursesCoursesThe title of the course is “EXECUTIVE COURSE IN TECHNOLOGY LAW FOR IN-HOUSE COUNSEL.”

  • If you think it's ridiculous to cough out nearly $2,000 for a bunch of recorded videos (that's why I'm getting in the video business, baby!), you can wait a little longer for a book. It's coming out in October. The introduction to the book outlining its contents is available if you surrender your personal details. Its coverage is definitely broad, so it's useful for fun reading. That's about the only reason why I would get it. It's the second book in Singapore regarding the substantive legal issues of technology, along with several tomes of books on data protection. I am exhausted. Really.

Postscript

I finally managed to do some housekeeping and write a featured post containing all the content I have worked on for PDPC Decisions. I know it's not easy to find the “journey” on the website, so hopefully, you will have a better experience.

Post Updates

As mentioned above, videos are coming to this blog. I haven't decided exactly what kind of content should be in a video. However, I am sure that any tutorial or long-form video will be a full member's privilege. It takes away the vexing question of whether I have to “lock up” posts to provide value to full members and what kind of posts should be public. As I said, I am experimenting with this model, so I will be changing as I go.

Conclusion

I am using my laptop to obscure the mess on my table.

That's it for this newsletter. Maybe there's a chance you will see me in a video for the next one.

(For curious subscribers, there isn't a “swag” shop for this blog. However, if you would like a shiny sticker on your laptop, you can email me with your details, and I can send one for free to you.)

#Newsletter #COVID-19 #docassemble #TechLawFest #TechnologyLaw #DataScience #Singapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu