Love.Law.Robots. by Ang Hou Fu

DataMining

Feature image

In 2021, I discovered something exciting — an application of machine learning that was both mind-blowing and practical.

The premise was simple. Type a description of the code you want in your editor, and GitHub Copilot will generate the code. It was terrific, and many people, including myself, were excited to use it.

🚀 I just got access to @github Copilot and it's super amazing!!! This is going to save me so much time!! Check out the short video below! #GitHubCopilot I think I'll spend more time writing function descriptions now than the code itself :D pic.twitter.com/HKXJVtGffm

— abhishek (@abhi1thakur) June 30, 2021

The idea that you can prompt a machine to generate code for you is obviously interesting for contract lawyers. I believe we are getting closer every day. I am waiting for my early access to Spellbook.

As a poorly trained and very busy programmer, it feels like I am a target of Github Copilot. The costs was also not so ridiculous. (Spellbook Legal costs $89 a month compared to Copilot's $10 a month) Even so, I haven't tried it for over a year. I wasn’t comfortable enough with the idea and I wasn’t sure how to express it.

Now I can. I recently came across a website proposing to investigate Github Copilot. The main author is Matthew Butterick. He’s the author of Typography for Lawyers and this site proudly uses the Equity typeface.

GitHub Copilot investigation · Joseph Saveri Law Firm & Matthew ButterickGitHub Copilot investigation

In short, the training of GitHub Copilot on open source repositories it hosts probably raises questions on whether such use complies with its copyright licenses. Is it fair use to use publicly accessible code for computational analysis? You might recall that Singapore recently passed an amendment to the Copyright Act providing an exception for computational data analysis. If GitHub Copilot is right that it is fair use, any code anywhere is game to be consumed by the learning machine.

Of course, the idea that it might be illegal hasn’t exactly stopped me from trying.

The key objection to GitHub Copilot is that it is not open source. By packaging the world’s open-source code in an AI model, and spitting it out to its user with no context, a user only interacts with Github Copilot. It is, in essence, a coding walled garden.

Copi­lot intro­duces what we might call a more self­ish inter­face to open-source soft­ware: just give me what I want! With Copi­lot, open-source users never have to know who made their soft­ware. They never have to inter­act with a com­mu­nity. They never have to con­tribute.

For someone who wants to learn to code, this enticing idea is probably a double-edged sword. You could probably swim around using prompts with your AI pair programmer, but without any context, you are not learning much. If I wanted to know how something works, I would like to run it, read its code and interact with its community. I am a member of a group of people with shared goals, not someone who just wants to consume other people’s work.

Matthew Butterick might end up with enough material to sue Microsoft, and the legal issues raised will be interesting for the open-source community. For now, though, I am going to stick to programming the hard way.

#OpenSource #Programming #GitHubCopilot #DataMining #Copyright #MachineLearning #News #Newsletter #tech #TechnologyLaw

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

I love playing with legal data. For me, books specialising in legal data are uncommon, especially those dealing with what’s available on the wild world of the internet today.

That’s why I snapped up Sarah Sutherland’s “Legal Data and Information in Practice”. Ms Sutherland was CEO of CanLII, one of the most admirable LIIs. CanLII is extensive, comprehensive, and packed with great features like noting up and keywords. It even comes in two languages.

Legal Data and Information in Practice: How Data and the Law InteractLegal Data and Information in Practice provides readers with an understanding of how to facilitate the acquisition, management, and use of legal data in organizations such as libraries, courts, governments, universities, and start-ups.Presenting a synthesis of information about legal data that will…Routledge & CRC PressSarah A. Sutherland

The book’s blurb recommends that it is “ essential reading for those in the law library community who are based in English-speaking countries with a common law tradition ”.

Since finishing the book, I found the blurb’s focus way too narrow. This is a book for anyone who loves legal data.

For one, I enjoyed the approachable language. My interaction with legal data has always been pragmatic. Either I was studying for some course, or I needed to find an answer quickly. It will be enough to appreciate the book if you’ve done any of those things. I liked that it didn’t baffle me with impossible or theoretical language. I found myself nodding at several junctures as I reflected on my experience of interacting with legal data as well.

Furthermore, it’s effectively a primer:

  • It’s short. I took a month to finish it at a leisurely place (i.e., in between taking care of children, making sure the legal department runs smoothly, and programming). Oh, and unlike most law books, it has pictures.
  • It effectively explains a broad range of topics. It talks about the challenges of AI and the political and administrative backgrounds of how legal data is provided without overwhelming you. More impressively, I found new areas in this field that I didn’t know about before reading the book, such as the various strategies to acquire legal data and an overview of statistical and machine learning techniques on data.

So, even if you are not a librarian or a legal technologist by profession, this book is still handy for you. I would love more depth, and maybe that’s some scope for a 2nd edition. In any case, Sarah Sutherland’s “Legal Data and Information in Practice” is a great starting point for everyone. Reading it will level up your ability to discuss and evaluate what’s going on in this exciting field.

  • * *

I am sorry for being a sucker — I am the kind of guy who watches movies to swoon at sweeping visages of my home jurisdiction, Singapore. I enjoyed Crazy Rich Asians, even though it’s fake.

So, I couldn’t resist looking for references to Singapore in the book. Luckily for me, Singapore is mentioned several times in the book. It’s described as “an interesting example of what can happen if a government is willing to invest heavily in developing capacity in legal computing and data use”. I’m not convinced that LawNet is like an LII, but among other points raised, such as the infrastructure, availability and formats are still much better here than in the rest of the common law world.

The more interesting point is that Singapore, as a small jurisdiction, would usually find its dataset smaller. That’s why experimenting on making models trained on other kinds of data effective on yours is crucial. (I think the paper cited in the book is an excellent example of this.) Other facets are relevant when you have fewer data and resources: what kinds of legal data should one focus on and the strategies to acquire them.

The challenges of a smaller dataset seem to be less exciting because fewer people are staring at them. However, I would suggest that these challenges are more prevalent than you would expect — companies and organisations also have smaller datasets and fewer resources. What would work for Singapore should be of interest to many others.

There’s always something to be excited about in this field. What do you think?

#BookReview #ArtificalIntelligence #DataMining #Law #LegalTech #MachineLearning #NaturalLanguageProcessing #Singapore #TechnologyLaw

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

October is drawing to a close, and so the end of the year is almost upon us. It's hard to fathom that I have been stuck working from home for nearly 20 months now. Some countries seemed to have moved on, but I doubt we'd do so in Singapore. Nevertheless, it's time for reflection and thinking about what to do about the future.

What I am reading now

The Importance of Being AuthorisedA recent case shows that practising law as an unauthorised person can have serious effects. What does this hold for other people who may be interested in alternative legal services?Love.Law.Robots.HoufuAn in-depth analysis of a rare and recent local decision touching on this point.

CLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksCLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksLucy Endel BassliI earn a commission from purchases made with this link.

  • Do you need a lot of coding or technical skills to use AI? This commentator from Today Online highlights Hugging Face, Gradio and Streamlit and doesn't think so. So have we finally resolved the question of whether lawyers need to code? I still think the answer is very nuanced — one person can compile a graph using free tools quickly, but making it production-ready is tough and won't be free. I agree more with the premise that we need to better empower students and others to “seek out AI services and solutions on their own”. In the Legal field, this starts with having more data out there available for all to use.

Why you don’t need to be an expert to use AI any moreKeeping up with the latest developments in artificial intelligence is like drinking from the proverbial fire hose, as a recent 188-page overview by two tech investors Ian Hogarth and Nathan Benaich would attest.TODAYonline

Post Updates

This week saw the debut of my third feature — “It's Open. It's Free — Public Legal Information in Singapore”. I have been working on it for several months, and it's still a work in progress. I made it as part of my research into what materials to scrape, and I've hinted at the project several times recently. In due course, I want to add more obscure courts and tribunals, including the PDPC and others. You can check the page regularly, or I would mention it here from time to time. I welcome your comments and suggestions on what I should cover.

That's it!

Family Playing A Board Game. An Asian family \(adult male and female and two adolescents, male and female\) sitting around a coffee table playing a board game. Photographer Bill BransonPhoto by National Cancer Institute / Unsplash

At the start of this newsletter, I mentioned that November is the month to be looking forward. 😋 Unfortunately, for the time being, I would be racing to finish articles that I had wanted to write since the pandemic started. This includes my observations from playing Monopoly Junior 5 million times. You can look at a sneak peek of the work in my Streamlit app (if it runs).

In the meantime, I would be trying the weights and cons of using MongoDB or SQL for my scraping project. Storing text and downloads on S3 is pretty straightforward, but where should I store the metadata of the decisions? If anyone has an opinion, I could use some advice!

Thanks for reading, and feel free to reach out!

#Newsletter #ArtificalIntelligence #BookReview #Contracts #DataMining #Law #DataScience #LegalTech #Programming #Singapore #Streamlit #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

This Features article features many articles which may require a free subscription to read. Become a subscriber today and get access to all the articles!

This Features article is a work in progress. If you have any feedback or suggestions, please feel free to contact me!

What's the Point of this List?

Photo by Cris Tagupa on Unsplash

Unlike other jurisdictions, Singapore does not have a legal information institute like AustLII or CanLII. Legal Information institutes, as defined in the Free Access to Law Movement Declaration:

  • Publish via the internet public legal information originating from more than one public body;
  • Provide free and anonymous public access to that information;
  • Do not impede others from obtaining public legal information from its sources and publishing it; and
  • Support the objectives set out in this Declaration.

We do have an entry on CommonLII, but the resources are not always up to date. Furthermore, the difference in features and usability are worlds apart. (If you wanted to know what AustLII looked like over ten years ago, look at CommonLII.)

This does not mean that free legal resources are non-existent in Singapore. It's just that they are scattered around the internet, with varying levels of availability, coverage and features. Oh, there's also no guarantee they will be around now or in the future.

Ready to mine free online legal materials in Singapore? Not so fast!Amendments to Copyright Act might support better access to free online legal materials in Singapore by robots. I survey government websites to find out how friendly they are to this.Love.Law.Robots.HoufuAmendments to the Copyright Act have cleared some air regarding mining, but questions remain.

This post tries to gather all the resources I have found and benchmark them. With some idea of how to extract them, you can plausibly start a project like OpenLawNZ. If you're interested in, say, data protection commission decisions and are toying with the idea of NLPing them, you know where to find the source. Even if you aren't ambitious, you can browse them and add them to your bookmarks. Maybe even archive them if you are so inclined.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuIt might be surprising to some, but there's a wealth of material out there if you can find it!

Your comments are always welcome.

Options that aren't free or online

Photo by Iñaki del Olmo on Unsplash

The premier resource for research into Singapore law is LawNet. It offers a pay per use option, but it's not cheap (at minimum $57 for pay per use). There's one terminal available for LawNet at the LCK Library if you can travel to the National Library. I haven't used LawNet since I left practice several years ago. From following the news of its developments, it hasn't departed much from its core purpose and added several collections that can be very useful for practitioners.

Source: https://eresources.nlb.gov.sg/main/Browse?browseBy=type&filter=10&page=2 (accessed 22 October 2021)

There are also law libraries at the Supreme Court (Level 1) and State Courts (B1) if you're into physical things. There are reasonably good resources for its size, but if you were looking for something very specialized, you might be trying your luck here.

Supreme Court of Singapore

Photo by Vuitton Lim on Unsplash

As the apex court in Singapore, the resources available for free here are top-notch. The Supreme Court cover the entire gamut from the High Court, Court of Appeal, Singapore International Commercial Court and all other courts in between.

The Supreme Court has been steadily (and stealthily) expanding its judgements section. They now go back to 2000, and have basic search functionality and some tagging. Judgements only cover written judgements , which are “generally issued for more complex cases or where they involve questions of law which are of public interest”. In other words, High Courts prepare them for possible appeals, and the Court of Appeal prepares them for stare decisis. As such, they don't cover all the work that the courts here do. Relying on this to study the court's work (beyond the development of law) can be biased. There's no API access.

Hearing lists are available for the current week and the following week and then sorted by judges. You can download them in PDF. Besides information relating to when the hearing is fixed, you can see who the parties are and skeletal information on the purpose of the hearing. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

New homes for judgements in the UK... and Singapore?I look at envy in the UK while exploring some confusing changes in the Singapore Supreme Court website.Love.Law.Robots.HoufuThe Supreme Court may be the apex court in Singapore, but its judgements reveal that there is a real mess in here.

State Courts

A rung lower than the Supreme Court, the State Courts generally deal with more down to earth civil and criminal matters. It long felt neglected in an older building (though interesting for an architecture geek), but they changed their name (from Subordinate Courts to State Courts) and moved to a spanking new nineteen storey building in the last few years. If you watch a lot of local television, this is the court where embarrassed respondents dash past the media scrum.

Unfortunately, judgements are harder to find at this level. The only free resource is a LawNet section that covers written judgements for the last three months.

Written judgements are prepared pretty much only when they will be appealed to the Supreme Court. This means that the judgements you can see there represent a relatively small and biased microcosm of work in the State Courts. In summary, appeals at this level are restricted by law. These represent significant barriers for civil cases where costs are an issue. Such restrictions are less pronounced in criminal cases. The Public Prosecutor appeals every case that does not meet its expectations. Accused appeals every case... well, because they might want to see the written judgment so that they can decide if they're going to appeal. This might explain why there are several more criminal cases available than civil matters. On the other hand, the accused or litigant who wants to get this case over and done don't appeal.

NUS cases show why judge analytics is needed in SingaporeThrowing anecdotes around fails to convince any side of the situation in Singapore. The real solution is more data.Love.Law.Robots.HoufuDue to the lack of public information on how judges decide cases, it's difficult to get a common understanding of what they do.

Hearing lists are available for civil trials and applications, criminal trials and tribunal matters in the coming week. It looks like an ASP.Net frontend with a basic search function. Besides information relating to when the hearing is fixed, you can see who the parties are and very skeletal information on what the hearing is about. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

The State Court has expanded its scope with several new courts in recent years, such as the Protection from Harassment Courts, Community Dispute Resolution Centre and Labour Claims Tribunal. None of these courts publishes their judgements on a regular basis. As they rarely get appealed, you will also not find them in the free section of LawNet.

Legislation

Beautiful view from the Parliament of Singapore 🇸🇬Photo by Steven Lasry / Unsplash

Singapore Statutes Online is the place to get legislation in Singapore. It contains historical versions of legislation, current editions, repealed versions, subsidiary legislation and bills.

When the first version was released in 2001, it was quite a pioneer. Today many countries provide their legislations in snazzier forms. (I am a fan of the UK's version).

While there isn't API access (and extraction won't be easy due to the extensive use of not so semantic HTML), you can enjoy the several RSS feeds littered around every aspect of the site.

I consider SSO to be very fast and regularly updated. However, if you need an alternative site for bills and acts, you can consider Parliament's website.

#Features #DataMining #DataScience #Decisions #Government #Judgements #Law #OpenSource #Singapore #SupremeCourtSingapore #WebScraping #StateCourtsSingapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

I have been mulling over developing an extensive online database of free legal materials in the flavour of OpenLawNZ or an LII for the longest time. Free access to such materials is one problem to solve, but I'm also hoping to compile a dataset to develop AI solutions. I have tried and demonstrated this with PDPC's data previously, and I am itching to expand the project sustainably.

However, being a lawyer, I am concerned about the legal implications of scraping government websites. Would using these materials be a breach of copyright law? In other countries, people accept that the public should generally be allowed to use such public materials. However, I am not very sure of this here.

The text steps highlightedPhoto by Clayton Robbins / Unsplash

I was thus genuinely excited about the amendments to the Copyright Act in Singapore this year. According to the press release, they will be operational in November, so they will be here soon.

Copyright Bill – Singapore Statutes OnlineSingapore Statutes Online is provided by the Legislation Division of the Singapore Attorney-General’s ChambersSingapore Statutes OnlineThe Copyright Bill is expected to be operationalised in November 2021.

[ Update 21 November 2021: The bill has, for the most part, been operationalised.]

Two amendments are particularly relevant in my context:

Using publicly disclosed materials from the government is allowed

In sections 280 to 282 of the Bill, it is now OK to copy or communicate public materials to facilitate more convenient viewing or hearing of the material. It should be noted that this is limited to copying and communicating it. Presumably, this means that I can share the materials I collected on my website as a collection.

Computational data analysis is allowed.

The amendments expressly say that using a computer to extract data from a work is now permitted. This is great! At some level, the extraction of the material is to perform some analysis or computation on it — searching or summarising a decision etc. I think some limits are reasonable, such as not communicating the material itself or using it for any other purpose.

However, one condition stands out for me — I need “lawful access” to the material in the first place. The first illustration to explain this is circumventing paywalls, which isn’t directly relevant to me. The second illustration explains that obtaining the materials through a breach of the terms of use of a database is not “lawful access”.

That’s a bit iffy. As you will see in the section surveying terms, a website’s terms are not always clear about whether access is lawful or not. The “terms of use” of a website are usually given very little thought by its developers or implemented in a maximal way that is at once off-putting and misleading. Does trying to beat a captcha mean I did not get lawful access? Sure, it’s a barrier to thwart robots, but what does it mean? If a human helps a robot, would it still be lawful?

A recent journal article points to “fair use” as the way forward

I was amazed to find an article in the SAL Journal titled “Copying Right in Copyright Law” by Prof David Tan and Mr Thomas Lee, which focused on the issue that was bothering me. The article focuses on data mining and predictive analytics, and it substantially concerns robots and scrapers.

Singapore Academy of Law Journale-First MenuLink to the journal article on E-First at SAL Journals Online.

On the new exception for computational data analysis, the article argues that the two illustrations I mentioned earlier were “inadequate and there is significant ambiguity of what lawful access means in many situations”. Furthermore, because the illustrations were not illuminating, it might create a situation where justified uses are prohibited. With much sadness, I agree.

More interestingly, based on some mathematics and a survey, the authors argue that an open-ended general fair use defence for data mining is the best way forward. As opposed to a rule-based exception, such a defence can adapt to changes better. Stakeholders (including owners) also prefer it because it appeals to their understanding of the economic basis of data mining.

You can quibble with the survey methodology and the mathematics (which I think is very brave for a law journal article). I guess it served its purpose in showing the opinion of stakeholders in the law and the cost analysis very well. I don’t suspect it will be cited in a court judgement soon, but hopefully, it sways someone influential.

We could use a more developer-friendly approach.

Photo by Mimi Thian / Unsplash

There was a time when web scraping was dangerous for a website. In those times, websites can be inundated with requests by automated robots, leading them to crash. Since then, web infrastructure has improved, and techniques to defeat malicious actors have been developed. The great days of “slashdotting” a website has not been heard of for a while. We’ve mostly migrated to more resilient infrastructure, and any serious website on the internet understands the value of having such infrastructure.

In any case, it is possible to scrape responsibly. Scrapy, for example, allows you to queue requests regularly or identify yourself as a robot or scraper, respecting robots.txt. If I agreed not to degrade a website’s performance, which seems quite reasonable, shouldn’t I be allowed to use it?

Being more developer-friendly would also help government agencies find more uses for their works. For now, most legal resources appear to cater exclusively for lawyers. Lawyers will, of course, find them most valuable because it’s part of their job. However, others may also need such resources because they can’t afford lawyers or have a different perspective on how information can be helpful. It’s not easy catering to a broader or other audience. If a government agency doesn’t have the resources to make something more useful, shouldn’t someone else have a go? Everyone benefits.

Surveying the terms of use of government websites

RTK survey in quarryPhoto by Valeria Fursa / Unsplash

Since “lawful access” and, by extension, “terms of use” of a website will be important in considering the computational data analysis exceptions, I decided to survey the terms of use of various government agencies. After locating their treatment of the intellectual property rights of their materials, I gauge my appetite to extract them.

In all, I identified three broad categories of terms.

Totally Progressive: Singapore Statutes Online 👍👍👍

Source: https://sso.agc.gov.sg/Help/FAQ#FAQ_8 (Accessed 20 October 2021)

Things I like:

  • They expressly mention the use of “automated means”. It looks like they were prepared for robots!
  • Conditions appear reasonable. There’s a window for extraction and guidelines to help properly cite and identify the extracted materials.

Things I don’t like:

  • The Singapore Statutes Online website is painful to extract from and doesn’t feature any API.

Comments:

  • Knowing what they expect scrapers to do gives me confidence in further exploring this resource.
  • Maybe the key reason these terms of use are excellent is that it applies to a specific resource. If a resource owner wants to make things developer-friendly, they should consider their collections and specify their terms of use.

Totally Bonkers: Personal Data Protection Commission 😖😖😖

Source: https://www.pdpc.gov.sg/Terms-and-Conditions (Accessed 20 October 2021)

Things I like:

  • They expressly mention the use of “robots” and “spiders”. It looks like they were prepared!

Things I don’t like:

  • It doesn’t allow you to use a “manual process” to monitor its Contents. You can’t visit our website to see if we have any updates!
  • What is an automatic device? Like a feed reader? (Fun fact: The PDPC obliterated their news feed in the latest update to their website. The best way to keep track of their activities is to follow their LinkedIn)
  • PDPC suggests that you get written permission but doesn’t tell you what circumstances they will give you such permission.
  • I have no idea what an unreasonable or disproportionately large load is. It looks like I have to crash the server to find out! (Just kidding, I will not do that, OK.)

Comments:

  • I have no idea what happened to the PDPC, such that it had to impose such unreasonable conditions on this activity (I hope I am not involved in any way 😇). It might be possible that someone with little knowledge went a long way.
  • At around paragraph 6, there is a somewhat complex set of terms allowing a visitor to share and use the contents of the PDPC website for non-commercial purposes. This, however, still does not gel with this paragraph 20, and the confusion is not user or developer-friendly, to say the least.
  • You can’t contract out fair use or the computational data analysis exception, so forget it.
  • I’m a bit miffed when I encounter such terms. Let’s hope their technical infrastructure is as well thought out as their terms of use. (I’m being ironic.)

Totally Clueless: Strata Titles Board 🎈🎈🎈

Materials, including source code, pages, documents and online graphics, audio and video in The Website are protected by law. The intellectual property rights in the materials is owned by or licensed to us. All rights reserved. (Government of Singapore © 2006).
Apart from any fair dealings for the purposes of private study, research, criticism or review, as permitted in law, no part of The Website may be reproduced or reused for any commercial purposes whatsoever without our prior written permission.

Source: https://www.stratatb.gov.sg/terms-of-use.html# (Accessed 20 October 2021)

Things I like:

  • Mentions fair dealing as permitted by law. However, they have to update to “fair use” or “permitted use” once the new Copyright Act is effective.

Things I don’t like:

  • Not sure why it says “Government of Singapore ©️ 2006”. Maybe they copied this terms of use statement in 2006 and never updated it since?
  • You can use the information for “commercial purposes” if you get written permission. It doesn’t tell you in what circumstances they will give you such permission. (This is less upsetting than PDPC’s terms.)
  • It doesn’t mention robots, spiders or “automatic devices”.

Comments:

  • It’s less upsetting than a bonkers terms of use, but it doesn’t give me confidence or an idea of what to expect.
  • The owner probably has no idea what data mining, predictive analytics etc., are. They need to buy the new “Law and Technology” book.

Conclusion

One might be surprised to find that terms of using a website, even when supposedly managed by lawyers, feature unclear, problematic, misleading, and unreasonable terms. As I mentioned, very little thought goes into drafting such terms most of the time. However, they provide obstacles to others who may want to explore new uses of a website or resource. Hopefully, more owners will proactively clean up their sites once the new Copyright Act becomes effective. In the meantime, this area provides lots of risks for a developer.

#Law #tech #Copyright #DataScience #Government #WebScraping #scrapy #Singapore #PersonalDataProtectionCommission #StrataTitlesBoard #DataMining

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu