WebScraping

Introducing: zeekerscrapers

November 27, 2022

Feature image

I’ve wanted to pen down my thoughts on the next stage of the evolution of my projects for some time. Here I go!

What’s next after pdpc-decisions?

I had a lot of fun writing pdpc-decisions. It scraped data from the Personal Data Protection Commission’s enforcement decisions web page and produced a table, downloads and text. Now I got my copy of the database! From there, I made visualisations, analyses and fun graphs.

All for free.

The “free” includes the training I got coding in Python and trying out various stages of software development, from writing tests to distributing a package as a module and a docker container.

In the lofty “what’s next” section of the post, I wrote:

The ultimate goal of this code, however leads to my slow-going super-project, which I called zeeker. It’s a database of personal data protection resources in the cloud, and I hope to expand on the source material here to create an even richer database. So this will not be my last post on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interest in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it helpful so if there is an interest, please let me know!

What has happened since then?

For one, personal data protection commission decisions are not interesting enough for me. Since working on that project, the deluge of decisions has trickled as the PDPC appeared to have changed its focus to compliance and other cool techy projects.

Furthermore, there are much more interesting data out there: for example, the PDPC has created many valuable guidelines which are currently unsearchable. As Singapore’s rules and regulations grow in complexity, there’s much hidden beneath the surface. The zeeker project shouldn’t just focus on a narrow area of law or judgements and decisions.

In terms of system architecture, I made two other decisions.

Use more open-source libraries, and code less.

I grew more confident in my coding skills doing pdpc-decisions, but I used a few basic libraries and hacked my way through the data. When I look back at my code, it is unmaintainable. Any change can break the library, and the bog of whacked-up coding made it hard for me to understand what I was doing several months later. Tests, comments and other documentation help, but only if you’re a disciplined person. I’m not that kind of guy.

Besides writing code (which takes time and lots of motivation), I could also “piggyback” on the efforts of others to create a better stack. The stack I’ve decided so far has made coding more pleasant.

There are also other programs I would like to try — for example, I plan to deliver the data through an API, so I don’t need to use Python to code the front end. A Javascript framework like Next.JS would be more effective for developing websites.

Decoupling the project with the programming language also expands the palette of tools I can have. For example, instead of using a low-level Python library like pdfminer to “plumb” a PDF, I could use a self-hosted docker container like parsr to OCR or analyse the PDF and then convert it to text.

It’s about finding the best tool for the job, not depending only on my (mediocre) programming skills to bring results.

There’s, of course, an issue of technical debt (if parsr is not being developed anymore, my project can slow down as well). I think this is not so bad because all the projects I picked are open-source. I would also pick well-documented and popular projects to reduce this risk.

It’s all a pipeline, baby.

The only way the above is possible is a paradigm shift from making one single package of code to thinking about the work as a process. There are discrete parts to a task, and the code is suited for that particular task.

I was inspired to change the way I thought about zeeker when I saw the flow chart for OpenLaw NZ’s Data Pipeline.

OpenLaw NZ’s data pipeline structure looks complicated, but it’s easy to follow for me!

It’s made of several AWS components and services (with some Azure). The steps are small, like receiving an event, sending it to a serverless function, putting the data in an S3 bucket, and then running another serverless function.

The key insight is to avoid building a monolith. I am not committed to building a single program or website. Instead, a project is broken into smaller parts. Each part is only intended to do a small task well. In this instance, zeekerscrapers is only a scraper. It looks at the webpage, takes the information already present on the web page, and saves or downloads the information. It doesn't bother with machine learning, displaying the results or any other complicated processing.

Besides using the right tool for the job, it is also easier to maintain.

The modularity also makes it simple to chop and change for different types of data. For example, you need to OCR a scanned PDF but don’t need to do that for a digital PDF. If the workflow is a pipeline, you can take that task out of the pipeline. Furthermore, some tasks, such as downloading a file, are standard fare. If you have a code you can reuse over several pipelines, you can save much coding time.

On the other hand, I would be relying heavily on cloud infrastructure to accomplish this, which is by no means cheap or straightforward.

Experiments continue

Photo by Alex Kondratiev / Unsplash

I have been quite busy lately, so I have yet to develop this at the pace I would like. For now, I have been converting pdpc-decisions to seeker. It’s been a breeze even though I took so much time.

On the other hand, my leisurely pace also allowed me to think about more significant issues, like what I can generalise and whether I will get bad vibes from this code in the future. Hopefully, the other scrapers can develop at breakneck speed once I complete thinking through the issues.

I have also felt more and more troubled by what to prioritise. Should I write more scrapers? Scrape what? Should I focus on adding more features to existing scrapers (like extracting entities and summarisation etc.)? When should I start writing the front end? When should I start advertising this project?

It’d be great to hear your comments. Meanwhile, keep watching this space!

#zeeker #Programming #PDPC-Decisions #Ideas #CloudComputing #LegalTech #OpenSource #scrapy #SQLModel #spaCy #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

The travails of refactoring

August 18, 2022

Feature image

If you spent long enough coding, you would meet this term: refactoring. The Agile Alliance defines it as “improving the internal structure of an existing program’s source code, while preserving its external behavior”. To paint a picture, it's like tending your garden. Get rid of some leaves, trim the hedges, and maybe add some accessories. It's still a garden, but it's further away from ruin.

In real life, I don't have a garden, and I also hate gardening. It's not the dirt, it's the work.

Similarly, I am also averse to refactoring. The fun is bringing your idea to life and figuring out the means to get there. Improving the work? I will do that some other day.

Lately, I have had the chance to revisit some work. In my latest post, I transform my pdpc-decisions work to scrapy. It's something I put off for nearly a year because I was not looking forward to learning a new framework to do something I had already accomplished.

Take your web scraping to a new level: Let’s play with scrapyChanging my code to scrapy, a web scraping framework for Python, was challenging but reaped many dividends.Love.Law.Robots.HoufuPlease don't be too put off by the cute spider picture.

In the end, the procrastination didn't make sense. I was surprised I completed the main body of code within a few days. It turned out that my previous experience writing my web scraper helped me to understand the scrapy framework better.

On the other hand, revisiting my old code made me realise how anachronistic my old programming habits were. The programmer in me in 2020 was much different than I am now. The code I would like to write now should get the job done and be easy to read and maintain.

I reckon in many ways, wanting the code to be so perfect that I could leave it out of my mind forever grew from my foundation as a lawyer. Filings, once submitted, can't be recalled. Contracts, once signed, are frozen in its time.

My experience with technology made this way of thinking seem obsolete. Our products are moulded by our circumstances, by what is available at the time. As things change, the story doesn't end; it's only delineated in chapters. The truth is that there will always be another filing, and a contract can always be amended.

I reckon that lawyers shouldn't be stuck in their old ways, and we should actively consider how to improve the way we work. As time goes by, what we have worked on becomes forgotten because it's hard to read and maintain. I think we owe it to society to ensure that our skills and knowledge are not forgotten, or at least ensure that the next person doesn't need to walk the same path repeatedly.

As I look into what else in my code needs refactoring, I think: does the code need to be changed because the circumstances have changed, or because I have changed? Sometimes, I am not sure. Honestly. 🤔

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuHere's a target for more refactoring!

#Newsletter #Lawyers #scrapy #WebScraping #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Take your web scraping to a new level: Let's play with scrapy

August 17, 2022

Feature image

☝🏼

Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.

Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.

There are thus a lot of articles which teach you how to scrape websites — here’s mine.

Automate Boring Stuff: Get Python and your Web Browser to download your judgementsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wroteLove.Law.Robots.Houfu

After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.

GitHub – houfu/pdpc-decisions: Data Protection Enforcement Cases in SingaporeData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufuThe code repository of the original web scraper is available on GitHub.

I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.

Levelling up... is not easy.

Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.

To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.

Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.

Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.

Enter scrapy

The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.

Enter scrapy.

The scrapy project is of some vintage. It reached 1.0 in 2015 and is currently at version 2.6.2 (as of August 2022). Scrapy’s age is screaming at you when it recommends you to install it in a “virtual environment” (who doesn’t install anything in Python except in a virtual environment?). On the other hand, scrapy is stable and production ready. It’s one of the best pieces of Python software I have encountered.

I decided to port my original web scraper to scrapy. I anticipated spending lots of time reading documentation, failing and then giving up. It turned out that I spent more time procrastinating, and the actual work was pretty straightforward.

Transitioning to scrapy

Here’s another thing you would notice about scrapy’s age. It encourages you to use a command line tool to generate code. This command creates a new project:

scrapy startproject tutorial

This reminds me of Angular and the ng command. (Do people still do that these days?)

While I found these commands convenient, it also reminded me that the learning curve of such frameworks is quite steep. Scrapy is no different. In the original web scraper, I defined the application's entry point through the command line function. This seemed the most obvious place to start for me.

@click.command() @click.argument('action') def pdpcdecision(csv, download, corpus, action, root, extras, extracorpus, verbose): starttime = time.time() scraperesults = Scraper.scrape() if (action == 'all') or (action == 'files'): downloadfiles(options, scraperesults) if (action == 'all') or (action == 'corpus'): createcorpus(options, scraperesults) if extras and ((action == 'all') or (action == 'csv')): scraperextras(scraperesults, options) if (action == 'all') or (action == 'csv'): savescraperesultstocsv(options, scraperesults) diff = time.time() – starttime logger.info('Finished. This took {}s.'.format(diff))

The original code was shortened to highlight the process.

The organisation of a scrapy project is different. You can generate a new project with the command above. However, the spider does the web crawling, and you have to create that within your project separately. If you started coding, you would not find this intuitive.

For the spider, the starting point is a function which generates or yields requests. The code example below does a few things. First, we find out how many pages there are on the website. We then yield a request for each page by submitting data on a web form.

import requests import scrapy from scrapy import FormRequest

class CommissionDecisionSpider(scrapy.Spider): name = “PDPCCommissionDecisions”

def startrequests(self): defaultform_data = { “keyword”: “”, “industry”: “all”, “nature”: “all”, “decision”: “all”, “penalty”: “all”, “page”: “1 }

response = requests.post(CASELISTINGURL, data=defaultformdata)

if response.statuscode == requests.codes.ok: responsejson = response.json() totalpages = responsejson[“totalPages”]

for page in range(1, totalpages + 1): yield FormRequest(CASELISTINGURL, formdata=createform_data(page=page))

Now, you need to write another function that deals with requests and yields items, the standard data format in scrapy.

def parse(self, response, **kwargs): responsejson = response.json() for item in responsejson[“items”]: from datetime import datetime nature = [DPObligations(nature.strip()) for nature in item[“nature”].split(',')] if item[ “nature”] else “None” decision = [DecisionType(decision.strip()) for decision in item[“decision”].split(',')] if item[ “decision”] else “None” yield CommissionDecisionItem( title=item[“title”], summaryurl=f”https://www.pdpc.gov.sg{item['url']}“, publisheddate=datetime.strptime(item[“date”], '%d %b %Y'), nature=nature, decision=decision )

You now have a spider! (Scrapy’s Quotesbot example is more minimal than this)

Run the spider using this command in the project directory:

scrapy crawl PDPCCommissionDecisions -o output.csv

Using its default settings, the spider scraped the PDPC website in a zippy 60 seconds. That’s because it already implements multithreading, so you are not waiting for tasks to complete one at a time. The command above even gets you a file containing all the items you scraped with no additional coding.

Transitioning from a pure Python codebase to a scrapy framework takes some time. It might be odd at first to realise you did not have to code the writing of a CSV file or manage web requests. This makes scrapy an excellent framework — you can focus on the code that makes your spider unique rather than reinventing the essential parts of a web scraper, probably very poorly.

It’s all in the pipeline.

If being forced to write spiders in a particular way isn’t irritating yet, dealing with pipelines might be the last straw. Pipelines deal with a request that doesn’t involve generating items. The most usual pipeline component checks an item to see if it’s a duplicate and then drops it if that’s true.

Pipelines look optional, and you can even avoid the complexity by incorporating everything into the main code. It turns out that many operations can be expressed as components in a timeline. Breaking them up into parts also helps the program implement multithreading and asynchronous operations effectively.

In pdpc-decisions, it wasn’t enough to grab the data from the filter or search page. You’d need to follow the link to the summary page, which makes additional information and a PDF download available. I wrote a pipeline component for that:

class CommissionDecisionSummaryPagePipeline: def processitem(self, item, spider): adapter = ItemAdapter(item) soup = bs4.BeautifulSoup(requests.get(adapter[“summaryurl”]).text, features=“html5lib”) article = soup.find('article')

# Gets the summary from the decision summary page paragraphs = article.find(class='rte').findall('p') result = '' for paragraph in paragraphs: if not paragraph.text == '': result += re.sub(r'\s+', ' ', paragraph.text) break adapter[“summary”] = result

# Gets the respondent in the decision adapter[“respondent”] = re.split(r”\s+[bB]y|[Aa]gainst\s+“, article.find('h2').text, re.I)[1].strip()

# Gets the link to the file to download the PDF decision decisionlink = article.find('a') adapter[“decisionurl”] = f”https://www.pdpc.gov.sg{decision_link['href']}”

adapter[“fileurls”] = [f”https://www.pdpc.gov.sg{decisionlink['href']}“]

return item

This component takes an item, visits the summary page and grabs the summary, respondent’s name and the link to the PDF, which contains the entire decision.

Note also the item has a field called file_urls. I did not create this data field. It’s a field used to tell scrapy to download a file from the web.

You can activate pipeline components as part of the spider’s settings.

ITEM_PIPELINES = { 'pdpcSpider.pipelines.CommissionDecisionSummaryPagePipeline': 300, 'pdpcSpider.pipelines.PDPCDecisionDownloadFilePipeline': 800, }

In this example, the pipeline has two components. Given a priority of 300, the CommissionDecisionSummaryPagePipeline goes first. PDPCDecisionDownloadFilePipeline then downloads the files listed in the file_urls field we referred to earlier.

Note also that PDPCDecisionDownloadFilePipeline is an implementation of the standard FilesPipeline component provided by scrapy, so I didn’t write any code to download files on the internet. Like the CSV feature, scrapy downloads the files when its files pipeline is activated.

Once again, it’s odd not to write code to download files. Furthermore, writing components for your pipeline and deciding on their seniority in a settings file isn’t very intuitive if you’re not sure what’s going on. Once again, I am grateful that I did not have to write my own pipeline.

I would note that “pipeline” is a fancy term for describing what your program is probably already doing. It’s true — in the original pdpc-decisions, the pages are scraped, the files are downloaded and the resulting items are saved in a CSV file. That’s a pipeline!

Settings, settings everywhere

Someone new to the scrapy framework will probably find the settings file daunting. In the previous section, we introduced the setting to define the seniority of the components in a pipeline. If you’re curious what else you can do in that file, the docs list over 50 items.

I am not going to go through each of them in this article. To me, though, the number of settings isn’t user-friendly. Still, it hints at the number of things you can do with scrapy, including randomising the delay before downloading files from the same site, logging and settings for storage adapters to common backends like AWS or FTP.

As a popular and established framework, you will also find an ecosystem. This includes scrapyd, a service you can run on your server to schedule scrapes and run your spiders. Proxy services are also available commercially if your operations are very demanding.

There are lots to explore here!

Conclusion

Do I have any regrets about doing pdpc-decisions? Nope. I learned a lot about programming in Python doing it. It made me appreciate what special considerations are involved in web scraping and how scrapy was helping me to do that.

I also found that following a framework made the code more maintainable. When I revisited the original pdpc-decisions while writing this article, I realised the code didn’t make sense. I didn’t name my files or function sensibly or write tests which showed what the code was doing.

Once I became familiar with the scrapy framework, I knew how to look for what I wanted in the code. This extends to sharing — if everyone is familiar with the framework, it’s easier for everyone to get on the same page rather than learning everything I wrote from scratch.

Scrapy could afford power and convenience, which is specialised for web scraping. I am going to keep using it for now. Learning such a challenging framework is already starting to pay dividends.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuRead more interesting adventures.

#Programming #Python #WebScraping #DataScience #Law #OpenSource #PDPC-Decisions #scrapy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Love.Law.Robots — 29 October 2021

October 29, 2021

Feature image

October is drawing to a close, and so the end of the year is almost upon us. It's hard to fathom that I have been stuck working from home for nearly 20 months now. Some countries seemed to have moved on, but I doubt we'd do so in Singapore. Nevertheless, it's time for reflection and thinking about what to do about the future.

What I am reading now

“Law and Technology in Singapore” deals with this topic in only one passing sentence (it's on page 176). Still, the issue of unlicensed practice remains challenging for any innovative legal tech in Singapore. In Florida, the problem rears its ugly head when the highest start court found TKID Services engaging in the unauthorised practice of law. It's not precisely DoNotPay, but the app connects people with traffic tickets with lawyers. Some have questioned the court's premise that the app exposed the public to incompetent and unscrupulous unlicensed practitioners — are all persons without a license incompetent? Are they all unscrupulous? On the flip side, are all lawyers competent? All just and honourable? More clarity will be helpful rather than the “I call it when I see it” approach in Singapore's jurisprudence at present.

The Importance of Being AuthorisedA recent case shows that practising law as an unauthorised person can have serious effects. What does this hold for other people who may be interested in alternative legal services?Love.Law.Robots.HoufuAn in-depth analysis of a rare and recent local decision touching on this point.

I've started reading “CLM Simplified: Efficient Contracting for Law Departments” on my Kindle, and it's given me lots of food for thought. “Sign Here” was great in general, but Lucy Bassli's book provides more details of how it can work in a law department. Many of Lucy's ideas appear similar to Alex's at first glance, so this will be a good companion for each other. If you want to know what's it about but don't want to spend any money, Legal Evolution ran a series of excerpts from the book.

CLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksCLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksLucy Endel BassliI earn a commission from purchases made with this link.

Do you need a lot of coding or technical skills to use AI? This commentator from Today Online highlights Hugging Face, Gradio and Streamlit and doesn't think so. So have we finally resolved the question of whether lawyers need to code? I still think the answer is very nuanced — one person can compile a graph using free tools quickly, but making it production-ready is tough and won't be free. I agree more with the premise that we need to better empower students and others to “seek out AI services and solutions on their own”. In the Legal field, this starts with having more data out there available for all to use.

Why you don’t need to be an expert to use AI any moreKeeping up with the latest developments in artificial intelligence is like drinking from the proverbial fire hose, as a recent 188-page overview by two tech investors Ian Hogarth and Nathan Benaich would attest.TODAYonline

Post Updates

This week saw the debut of my third feature — “It's Open. It's Free — Public Legal Information in Singapore”. I have been working on it for several months, and it's still a work in progress. I made it as part of my research into what materials to scrape, and I've hinted at the project several times recently. In due course, I want to add more obscure courts and tribunals, including the PDPC and others. You can check the page regularly, or I would mention it here from time to time. I welcome your comments and suggestions on what I should cover.

That's it!

$Family Playing A Board Game. An Asian family $adult male and female and two adolescents, male and female$ sitting around a coffee table playing a board game. Photographer Bill Branson$ Photo by National Cancer Institute / Unsplash

At the start of this newsletter, I mentioned that November is the month to be looking forward. 😋 Unfortunately, for the time being, I would be racing to finish articles that I had wanted to write since the pandemic started. This includes my observations from playing Monopoly Junior 5 million times. You can look at a sneak peek of the work in my Streamlit app (if it runs).

In the meantime, I would be trying the weights and cons of using MongoDB or SQL for my scraping project. Storing text and downloads on S3 is pretty straightforward, but where should I store the metadata of the decisions? If anyone has an opinion, I could use some advice!

Thanks for reading, and feel free to reach out!

#Newsletter #ArtificalIntelligence #BookReview #Contracts #DataMining #Law #DataScience #LegalTech #Programming #Singapore #Streamlit #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

It's Open. It's Free — Public Legal Information in Singapore

October 27, 2021

This Features article features many articles which may require a free subscription to read. Become a subscriber today and get access to all the articles!

This Features article is a work in progress. If you have any feedback or suggestions, please feel free to contact me!

What's the Point of this List?

Photo by Cris Tagupa on Unsplash

Unlike other jurisdictions, Singapore does not have a legal information institute like AustLII or CanLII. Legal Information institutes, as defined in the Free Access to Law Movement Declaration:

Publish via the internet public legal information originating from more than one public body;
Provide free and anonymous public access to that information;
Do not impede others from obtaining public legal information from its sources and publishing it; and
Support the objectives set out in this Declaration.

We do have an entry on CommonLII, but the resources are not always up to date. Furthermore, the difference in features and usability are worlds apart. (If you wanted to know what AustLII looked like over ten years ago, look at CommonLII.)

This does not mean that free legal resources are non-existent in Singapore. It's just that they are scattered around the internet, with varying levels of availability, coverage and features. Oh, there's also no guarantee they will be around now or in the future.

Ready to mine free online legal materials in Singapore? Not so fast!Amendments to Copyright Act might support better access to free online legal materials in Singapore by robots. I survey government websites to find out how friendly they are to this.Love.Law.Robots.HoufuAmendments to the Copyright Act have cleared some air regarding mining, but questions remain.

This post tries to gather all the resources I have found and benchmark them. With some idea of how to extract them, you can plausibly start a project like OpenLawNZ. If you're interested in, say, data protection commission decisions and are toying with the idea of NLPing them, you know where to find the source. Even if you aren't ambitious, you can browse them and add them to your bookmarks. Maybe even archive them if you are so inclined.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuIt might be surprising to some, but there's a wealth of material out there if you can find it!

Your comments are always welcome.

Options that aren't free or online

Photo by Iñaki del Olmo on Unsplash

The premier resource for research into Singapore law is LawNet. It offers a pay per use option, but it's not cheap (at minimum $57 for pay per use). There's one terminal available for LawNet at the LCK Library if you can travel to the National Library. I haven't used LawNet since I left practice several years ago. From following the news of its developments, it hasn't departed much from its core purpose and added several collections that can be very useful for practitioners.

Source: https://eresources.nlb.gov.sg/main/Browse?browseBy=type&filter=10&page=2 (accessed 22 October 2021)

There are also law libraries at the Supreme Court (Level 1) and State Courts (B1) if you're into physical things. There are reasonably good resources for its size, but if you were looking for something very specialized, you might be trying your luck here.

Supreme Court of Singapore

Photo by Vuitton Lim on Unsplash

As the apex court in Singapore, the resources available for free here are top-notch. The Supreme Court cover the entire gamut from the High Court, Court of Appeal, Singapore International Commercial Court and all other courts in between.

The Supreme Court has been steadily (and stealthily) expanding its judgements section. They now go back to 2000, and have basic search functionality and some tagging. Judgements only cover written judgements , which are “generally issued for more complex cases or where they involve questions of law which are of public interest”. In other words, High Courts prepare them for possible appeals, and the Court of Appeal prepares them for stare decisis. As such, they don't cover all the work that the courts here do. Relying on this to study the court's work (beyond the development of law) can be biased. There's no API access.

Hearing lists are available for the current week and the following week and then sorted by judges. You can download them in PDF. Besides information relating to when the hearing is fixed, you can see who the parties are and skeletal information on the purpose of the hearing. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

New homes for judgements in the UK... and Singapore?I look at envy in the UK while exploring some confusing changes in the Singapore Supreme Court website.Love.Law.Robots.HoufuThe Supreme Court may be the apex court in Singapore, but its judgements reveal that there is a real mess in here.

State Courts

A rung lower than the Supreme Court, the State Courts generally deal with more down to earth civil and criminal matters. It long felt neglected in an older building (though interesting for an architecture geek), but they changed their name (from Subordinate Courts to State Courts) and moved to a spanking new nineteen storey building in the last few years. If you watch a lot of local television, this is the court where embarrassed respondents dash past the media scrum.

Unfortunately, judgements are harder to find at this level. The only free resource is a LawNet section that covers written judgements for the last three months.

Written judgements are prepared pretty much only when they will be appealed to the Supreme Court. This means that the judgements you can see there represent a relatively small and biased microcosm of work in the State Courts. In summary, appeals at this level are restricted by law. These represent significant barriers for civil cases where costs are an issue. Such restrictions are less pronounced in criminal cases. The Public Prosecutor appeals every case that does not meet its expectations. Accused appeals every case... well, because they might want to see the written judgment so that they can decide if they're going to appeal. This might explain why there are several more criminal cases available than civil matters. On the other hand, the accused or litigant who wants to get this case over and done don't appeal.

NUS cases show why judge analytics is needed in SingaporeThrowing anecdotes around fails to convince any side of the situation in Singapore. The real solution is more data.Love.Law.Robots.HoufuDue to the lack of public information on how judges decide cases, it's difficult to get a common understanding of what they do.

Hearing lists are available for civil trials and applications, criminal trials and tribunal matters in the coming week. It looks like an ASP.Net frontend with a basic search function. Besides information relating to when the hearing is fixed, you can see who the parties are and very skeletal information on what the hearing is about. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

The State Court has expanded its scope with several new courts in recent years, such as the Protection from Harassment Courts, Community Dispute Resolution Centre and Labour Claims Tribunal. None of these courts publishes their judgements on a regular basis. As they rarely get appealed, you will also not find them in the free section of LawNet.

Legislation

Beautiful view from the Parliament of Singapore 🇸🇬 Photo by Steven Lasry / Unsplash

Singapore Statutes Online is the place to get legislation in Singapore. It contains historical versions of legislation, current editions, repealed versions, subsidiary legislation and bills.

When the first version was released in 2001, it was quite a pioneer. Today many countries provide their legislations in snazzier forms. (I am a fan of the UK's version).

While there isn't API access (and extraction won't be easy due to the extensive use of not so semantic HTML), you can enjoy the several RSS feeds littered around every aspect of the site.

I consider SSO to be very fast and regularly updated. However, if you need an alternative site for bills and acts, you can consider Parliament's website.

#Features #DataMining #DataScience #Decisions #Government #Judgements #Law #OpenSource #Singapore #SupremeCourtSingapore #WebScraping #StateCourtsSingapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Ready to mine free online legal materials in Singapore? Not so fast!

October 20, 2021

I have been mulling over developing an extensive online database of free legal materials in the flavour of OpenLawNZ or an LII for the longest time. Free access to such materials is one problem to solve, but I'm also hoping to compile a dataset to develop AI solutions. I have tried and demonstrated this with PDPC's data previously, and I am itching to expand the project sustainably.

However, being a lawyer, I am concerned about the legal implications of scraping government websites. Would using these materials be a breach of copyright law? In other countries, people accept that the public should generally be allowed to use such public materials. However, I am not very sure of this here.

Clearer guidelines for scraping legal materials?

The text steps highlighted Photo by Clayton Robbins / Unsplash

I was thus genuinely excited about the amendments to the Copyright Act in Singapore this year. According to the press release, they will be operational in November, so they will be here soon.

Copyright Bill – Singapore Statutes OnlineSingapore Statutes Online is provided by the Legislation Division of the Singapore Attorney-General’s ChambersSingapore Statutes OnlineThe Copyright Bill is expected to be operationalised in November 2021.

[ Update 21 November 2021: The bill has, for the most part, been operationalised.]

Two amendments are particularly relevant in my context:

Using publicly disclosed materials from the government is allowed

In sections 280 to 282 of the Bill, it is now OK to copy or communicate public materials to facilitate more convenient viewing or hearing of the material. It should be noted that this is limited to copying and communicating it. Presumably, this means that I can share the materials I collected on my website as a collection.

Computational data analysis is allowed.

The amendments expressly say that using a computer to extract data from a work is now permitted. This is great! At some level, the extraction of the material is to perform some analysis or computation on it — searching or summarising a decision etc. I think some limits are reasonable, such as not communicating the material itself or using it for any other purpose.

However, one condition stands out for me — I need “lawful access” to the material in the first place. The first illustration to explain this is circumventing paywalls, which isn’t directly relevant to me. The second illustration explains that obtaining the materials through a breach of the terms of use of a database is not “lawful access”.

That’s a bit iffy. As you will see in the section surveying terms, a website’s terms are not always clear about whether access is lawful or not. The “terms of use” of a website are usually given very little thought by its developers or implemented in a maximal way that is at once off-putting and misleading. Does trying to beat a captcha mean I did not get lawful access? Sure, it’s a barrier to thwart robots, but what does it mean? If a human helps a robot, would it still be lawful?

A recent journal article points to “fair use” as the way forward

I was amazed to find an article in the SAL Journal titled “Copying Right in Copyright Law” by Prof David Tan and Mr Thomas Lee, which focused on the issue that was bothering me. The article focuses on data mining and predictive analytics, and it substantially concerns robots and scrapers.

Singapore Academy of Law Journale-First MenuLink to the journal article on E-First at SAL Journals Online.

On the new exception for computational data analysis, the article argues that the two illustrations I mentioned earlier were “inadequate and there is significant ambiguity of what lawful access means in many situations”. Furthermore, because the illustrations were not illuminating, it might create a situation where justified uses are prohibited. With much sadness, I agree.

More interestingly, based on some mathematics and a survey, the authors argue that an open-ended general fair use defence for data mining is the best way forward. As opposed to a rule-based exception, such a defence can adapt to changes better. Stakeholders (including owners) also prefer it because it appeals to their understanding of the economic basis of data mining.

You can quibble with the survey methodology and the mathematics (which I think is very brave for a law journal article). I guess it served its purpose in showing the opinion of stakeholders in the law and the cost analysis very well. I don’t suspect it will be cited in a court judgement soon, but hopefully, it sways someone influential.

We could use a more developer-friendly approach.

Photo by Mimi Thian / Unsplash

There was a time when web scraping was dangerous for a website. In those times, websites can be inundated with requests by automated robots, leading them to crash. Since then, web infrastructure has improved, and techniques to defeat malicious actors have been developed. The great days of “slashdotting” a website has not been heard of for a while. We’ve mostly migrated to more resilient infrastructure, and any serious website on the internet understands the value of having such infrastructure.

In any case, it is possible to scrape responsibly. Scrapy, for example, allows you to queue requests regularly or identify yourself as a robot or scraper, respecting robots.txt. If I agreed not to degrade a website’s performance, which seems quite reasonable, shouldn’t I be allowed to use it?

Being more developer-friendly would also help government agencies find more uses for their works. For now, most legal resources appear to cater exclusively for lawyers. Lawyers will, of course, find them most valuable because it’s part of their job. However, others may also need such resources because they can’t afford lawyers or have a different perspective on how information can be helpful. It’s not easy catering to a broader or other audience. If a government agency doesn’t have the resources to make something more useful, shouldn’t someone else have a go? Everyone benefits.

Surveying the terms of use of government websites

RTK survey in quarry Photo by Valeria Fursa / Unsplash

Since “lawful access” and, by extension, “terms of use” of a website will be important in considering the computational data analysis exceptions, I decided to survey the terms of use of various government agencies. After locating their treatment of the intellectual property rights of their materials, I gauge my appetite to extract them.

In all, I identified three broad categories of terms.

Totally Progressive: Singapore Statutes Online 👍👍👍

Source: https://sso.agc.gov.sg/Help/FAQ#FAQ_8 (Accessed 20 October 2021)

Things I like:

They expressly mention the use of “automated means”. It looks like they were prepared for robots!
Conditions appear reasonable. There’s a window for extraction and guidelines to help properly cite and identify the extracted materials.

Things I don’t like:

The Singapore Statutes Online website is painful to extract from and doesn’t feature any API.

Comments:

Knowing what they expect scrapers to do gives me confidence in further exploring this resource.
Maybe the key reason these terms of use are excellent is that it applies to a specific resource. If a resource owner wants to make things developer-friendly, they should consider their collections and specify their terms of use.

Totally Bonkers: Personal Data Protection Commission 😖😖😖

Source: https://www.pdpc.gov.sg/Terms-and-Conditions (Accessed 20 October 2021)

Things I like:

They expressly mention the use of “robots” and “spiders”. It looks like they were prepared!

Things I don’t like:

It doesn’t allow you to use a “manual process” to monitor its Contents. You can’t visit our website to see if we have any updates!
What is an automatic device? Like a feed reader? (Fun fact: The PDPC obliterated their news feed in the latest update to their website. The best way to keep track of their activities is to follow their LinkedIn)
PDPC suggests that you get written permission but doesn’t tell you what circumstances they will give you such permission.
I have no idea what an unreasonable or disproportionately large load is. It looks like I have to crash the server to find out! (Just kidding, I will not do that, OK.)

Comments:

I have no idea what happened to the PDPC, such that it had to impose such unreasonable conditions on this activity (I hope I am not involved in any way 😇). It might be possible that someone with little knowledge went a long way.
At around paragraph 6, there is a somewhat complex set of terms allowing a visitor to share and use the contents of the PDPC website for non-commercial purposes. This, however, still does not gel with this paragraph 20, and the confusion is not user or developer-friendly, to say the least.
You can’t contract out fair use or the computational data analysis exception, so forget it.
I’m a bit miffed when I encounter such terms. Let’s hope their technical infrastructure is as well thought out as their terms of use. (I’m being ironic.)

Totally Clueless: Strata Titles Board 🎈🎈🎈

Materials, including source code, pages, documents and online graphics, audio and video in The Website are protected by law. The intellectual property rights in the materials is owned by or licensed to us. All rights reserved. (Government of Singapore © 2006).
Apart from any fair dealings for the purposes of private study, research, criticism or review, as permitted in law, no part of The Website may be reproduced or reused for any commercial purposes whatsoever without our prior written permission.

Source: https://www.stratatb.gov.sg/terms-of-use.html# (Accessed 20 October 2021)

Things I like:

Mentions fair dealing as permitted by law. However, they have to update to “fair use” or “permitted use” once the new Copyright Act is effective.

Things I don’t like:

Not sure why it says “Government of Singapore ©️ 2006”. Maybe they copied this terms of use statement in 2006 and never updated it since?
You can use the information for “commercial purposes” if you get written permission. It doesn’t tell you in what circumstances they will give you such permission. (This is less upsetting than PDPC’s terms.)
It doesn’t mention robots, spiders or “automatic devices”.

Comments:

It’s less upsetting than a bonkers terms of use, but it doesn’t give me confidence or an idea of what to expect.
The owner probably has no idea what data mining, predictive analytics etc., are. They need to buy the new “Law and Technology” book.

Conclusion

One might be surprised to find that terms of using a website, even when supposedly managed by lawyers, feature unclear, problematic, misleading, and unreasonable terms. As I mentioned, very little thought goes into drafting such terms most of the time. However, they provide obstacles to others who may want to explore new uses of a website or resource. Hopefully, more owners will proactively clean up their sites once the new Copyright Act becomes effective. In the meantime, this area provides lots of risks for a developer.

#Law #tech #Copyright #DataScience #Government #WebScraping #scrapy #Singapore #PersonalDataProtectionCommission #StrataTitlesBoard #DataMining

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter