Love.Law.Robots. by Ang Hou Fu

OpenSource

Should I submit this as a talk to #geekcampsg? I originally reserved this for Tech.Law.Fest, but chickened out when I was not sure there was an opportunity to speak. (On hindsight, my hunch was probably right.)I am thinking about lots of things: is this an interesting topic to share, is it a 5-10 mins or 30-45 mins presentation? While I continue procrastinating on taking the plunge, I thought I would outline the topic here for future reference. Feel free to let me know of your comments!

A lawyer is drawing thick red lines on a document.

Read more...

A brief flirtation with viral success brought new attention to one of my #Python libraries and some real-world applications of the workings of #OpenSource.

Cover Photo by NEOM / Unsplash

Readers who have stuck around might recall my big project last year. It was a study on the readability of legislation in Singapore and how much “Plain Laws” drafting affected it. (Spoiler alert: limited, if any).

I wrote a Python library called “#redlines” while writing that post. I needed to represent changes in text like the “track changes” function in Microsoft Word, which was the most familiar method to my audience of lawyers. I couldn't find any libraries to do this in Markdown, so I created one and published it anyway.

You can read more on “how it works” in the original post on this library.

I publish most of my coding publicly on my GitHub. I do it with little expectation that anyone would use it. This post discusses my motivations for publicising almost all of my coding work.

So, something strange happened while I was suffering from an acute attack of imposter syndrome last week. My humble Python library suddenly got stars, and I even received a pull request.

Read more...

Feature image

I’ve wanted to pen down my thoughts on the next stage of the evolution of my projects for some time. Here I go!

What’s next after pdpc-decisions?

I had a lot of fun writing pdpc-decisions. It scraped data from the Personal Data Protection Commission’s enforcement decisions web page and produced a table, downloads and text. Now I got my copy of the database! From there, I made visualisations, analyses and fun graphs.

All for free.

The “free” includes the training I got coding in Python and trying out various stages of software development, from writing tests to distributing a package as a module and a docker container.

In the lofty “what’s next” section of the post, I wrote:

The ultimate goal of this code, however leads to my slow-going super-project, which I called zeeker. It’s a database of personal data protection resources in the cloud, and I hope to expand on the source material here to create an even richer database. So this will not be my last post on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interest in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it helpful so if there is an interest, please let me know!

What has happened since then?

For one, personal data protection commission decisions are not interesting enough for me. Since working on that project, the deluge of decisions has trickled as the PDPC appeared to have changed its focus to compliance and other cool techy projects.

Furthermore, there are much more interesting data out there: for example, the PDPC has created many valuable guidelines which are currently unsearchable. As Singapore’s rules and regulations grow in complexity, there’s much hidden beneath the surface. The zeeker project shouldn’t just focus on a narrow area of law or judgements and decisions.

In terms of system architecture, I made two other decisions.

Use more open-source libraries, and code less.

I grew more confident in my coding skills doing pdpc-decisions, but I used a few basic libraries and hacked my way through the data. When I look back at my code, it is unmaintainable. Any change can break the library, and the bog of whacked-up coding made it hard for me to understand what I was doing several months later. Tests, comments and other documentation help, but only if you’re a disciplined person. I’m not that kind of guy.

Besides writing code (which takes time and lots of motivation), I could also “piggyback” on the efforts of others to create a better stack. The stack I’ve decided so far has made coding more pleasant.

There are also other programs I would like to try — for example, I plan to deliver the data through an API, so I don’t need to use Python to code the front end. A Javascript framework like Next.JS would be more effective for developing websites.

Decoupling the project with the programming language also expands the palette of tools I can have. For example, instead of using a low-level Python library like pdfminer to “plumb” a PDF, I could use a self-hosted docker container like parsr to OCR or analyse the PDF and then convert it to text.

It’s about finding the best tool for the job, not depending only on my (mediocre) programming skills to bring results.

There’s, of course, an issue of technical debt (if parsr is not being developed anymore, my project can slow down as well). I think this is not so bad because all the projects I picked are open-source. I would also pick well-documented and popular projects to reduce this risk.

It’s all a pipeline, baby.

The only way the above is possible is a paradigm shift from making one single package of code to thinking about the work as a process. There are discrete parts to a task, and the code is suited for that particular task.

I was inspired to change the way I thought about zeeker when I saw the flow chart for OpenLaw NZ’s Data Pipeline.

OpenLaw NZ’s data pipeline structure looks complicated, but it’s easy to follow for me!

It’s made of several AWS components and services (with some Azure). The steps are small, like receiving an event, sending it to a serverless function, putting the data in an S3 bucket, and then running another serverless function.

The key insight is to avoid building a monolith. I am not committed to building a single program or website. Instead, a project is broken into smaller parts. Each part is only intended to do a small task well. In this instance, zeekerscrapers is only a scraper. It looks at the webpage, takes the information already present on the web page, and saves or downloads the information. It doesn't bother with machine learning, displaying the results or any other complicated processing.

Besides using the right tool for the job, it is also easier to maintain.

The modularity also makes it simple to chop and change for different types of data. For example, you need to OCR a scanned PDF but don’t need to do that for a digital PDF. If the workflow is a pipeline, you can take that task out of the pipeline. Furthermore, some tasks, such as downloading a file, are standard fare. If you have a code you can reuse over several pipelines, you can save much coding time.

On the other hand, I would be relying heavily on cloud infrastructure to accomplish this, which is by no means cheap or straightforward.

Experiments continue

Photo by Alex Kondratiev / Unsplash

I have been quite busy lately, so I have yet to develop this at the pace I would like. For now, I have been converting pdpc-decisions to seeker. It’s been a breeze even though I took so much time.

On the other hand, my leisurely pace also allowed me to think about more significant issues, like what I can generalise and whether I will get bad vibes from this code in the future. Hopefully, the other scrapers can develop at breakneck speed once I complete thinking through the issues.

I have also felt more and more troubled by what to prioritise. Should I write more scrapers? Scrape what? Should I focus on adding more features to existing scrapers (like extracting entities and summarisation etc.)? When should I start writing the front end? When should I start advertising this project?

It’d be great to hear your comments. Meanwhile, keep watching this space!

#zeeker #Programming #PDPC-Decisions #Ideas #CloudComputing #LegalTech #OpenSource #scrapy #SQLModel #spaCy #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Oh no, Elon Musk has acquired Twitter. Bad things will happen.

I had this premonition months ago when Elon Musk toyed with the idea. I thought it was a dumb idea for him, but people do dumb things all the time and still turn out fine.

Smart people (or so I’ve heard) started exploring alternatives to Twitter. Many of them were led to Mastodon. I am way ahead of these guys.

Months ago, when I received word (on Twitter!) that Elon Musk was entertaining his silly idea to buy Twitter, it was time for me to dive into Mastodon finally. I got my account and a client on my phone and figured out that the star is like, reblog etc.

I was ready to drop Twitter.

How to be part of the Mastodon

To be in Mastodon, you have to sign up at a server. This first step appears to be the most difficult for new users. Which server do I join? What difference does it make? Interestingly, this first step is probably the most inconsequential. Unless you plan to move fast and break some content or moderation policies, joining a public server like https://mstdn.social/ or https://mastodon.online is good enough for trying Mastodon out.

Currently, I am not aware of a public Singaporean Mastodon server (I have long been interested in running one, but I am not sure who is interested in using it). If you want to join one, you can quickly move your existing account to that server or create a new one.

While the above instructions work if you’re using your web browser, you can also do the same by installing a mobile app on your phone. There are dozens of clients you can use on your phone. I used the official one, Tusky was quite all right. Twidere sounded interesting too.

Once you have downloaded your phone app and signed up on a server, you will find Mastodon to be like a pretty basic Twitter to a great extent. It’s also pretty basic because you wouldn’t find any ads or algorithms curating content.

A futuristic Mastodon introduction for 2021:Focusing on things that come up frequently and I don’t see explained that often. Here’s the lede: You can’t ever see or search everything...! This post explains a lot about what Mastodon is for beginners.

The Wonders of the Fediverse

Most people get very confused with the first step because when they want to leave Twitter, they expect the alternative to be like Twitter. It raises good questions like the following:

I'm not sure I get Mastodon. The first step seems like you have to pick a tribe—which is the source of like 90% of the problems on Twitter. And then I have lots of unanswered questions. Who runs the servers? Who pays for it? Privacy? What happens if it gets big? And so on...

— David Beazley (@dabeaz) October 28, 2022

The easy answer is that each “tribe” or Mastodon instance is a tiny “Twitter” server. When you signed up to Twitter, you expected a service, not a tribe. On the other hand, because anyone can run a service, several instances have their own unique identity, whether it's a public or private instance, who they accept to join and what content moderation policies they have.

The idea that you can start your server and join the conversation with other servers is mind-blowing. This is why a big tech company or the wealthiest person in the world can’t buy the Fediverse, it’s made up of different disparate parts, and you can move to any server you want based on the same open-source program.

Once you start thinking harder about these issues, you’re one step away from discovering the Fediverse. You’re going to find an app for almost every social activity on the internet running on the same principles:

  • Pixelfed for Instagram
  • OwnCast for podcasting or streaming
  • BookWyrm for books
  • It looks possible even for WordPress

… But I still have not left Twitter

Notwithstanding my excitement about the Fediverse, I am still using Twitter.

And therein lies a pretty hard truth about social networks. Many of the networks I created following users on Twitter can’t be ported over to Mastodon simply because these users aren’t there. So, if I wanted to follow my latest insights on law twitter and journalists, it appears that Twitter is inescapable.

So, I would discount the reports of Twitter (or Facebook) meetings its demise without the wholesale movement of people off it. Far more than features and content, it appears that the actual value of a social network is the number of friends on it.

Heidi Li Feldman (@[email protected])With thanks to @konrad and @pedantka for the precedents, I have created a Google form to help build a list of #lawyers #lawprofs #legalacademics for a #LawFedi hashtag. Consider adding yourself to to help folks find you and each other. Google form: https://forms.gle/hzzbqzvq754NQzDv7 #LawFedi spr…MastodonIn the meantime, you can check out this post collecting lawyers on the Fediverse.

I’d be keeping my Mastodon app account shortcut pretty on my phone next to my Twitter app for now.

#blog #Fediverse #Mastodon #Twitter #SocialNetwork #OpenSource

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

In 2021, I discovered something exciting — an application of machine learning that was both mind-blowing and practical.

The premise was simple. Type a description of the code you want in your editor, and GitHub Copilot will generate the code. It was terrific, and many people, including myself, were excited to use it.

🚀 I just got access to @github Copilot and it's super amazing!!! This is going to save me so much time!! Check out the short video below! #GitHubCopilot I think I'll spend more time writing function descriptions now than the code itself :D pic.twitter.com/HKXJVtGffm

— abhishek (@abhi1thakur) June 30, 2021

The idea that you can prompt a machine to generate code for you is obviously interesting for contract lawyers. I believe we are getting closer every day. I am waiting for my early access to Spellbook.

As a poorly trained and very busy programmer, it feels like I am a target of Github Copilot. The costs was also not so ridiculous. (Spellbook Legal costs $89 a month compared to Copilot's $10 a month) Even so, I haven't tried it for over a year. I wasn’t comfortable enough with the idea and I wasn’t sure how to express it.

Now I can. I recently came across a website proposing to investigate Github Copilot. The main author is Matthew Butterick. He’s the author of Typography for Lawyers and this site proudly uses the Equity typeface.

GitHub Copilot investigation · Joseph Saveri Law Firm & Matthew ButterickGitHub Copilot investigation

In short, the training of GitHub Copilot on open source repositories it hosts probably raises questions on whether such use complies with its copyright licenses. Is it fair use to use publicly accessible code for computational analysis? You might recall that Singapore recently passed an amendment to the Copyright Act providing an exception for computational data analysis. If GitHub Copilot is right that it is fair use, any code anywhere is game to be consumed by the learning machine.

Of course, the idea that it might be illegal hasn’t exactly stopped me from trying.

The key objection to GitHub Copilot is that it is not open source. By packaging the world’s open-source code in an AI model, and spitting it out to its user with no context, a user only interacts with Github Copilot. It is, in essence, a coding walled garden.

Copi­lot intro­duces what we might call a more self­ish inter­face to open-source soft­ware: just give me what I want! With Copi­lot, open-source users never have to know who made their soft­ware. They never have to inter­act with a com­mu­nity. They never have to con­tribute.

For someone who wants to learn to code, this enticing idea is probably a double-edged sword. You could probably swim around using prompts with your AI pair programmer, but without any context, you are not learning much. If I wanted to know how something works, I would like to run it, read its code and interact with its community. I am a member of a group of people with shared goals, not someone who just wants to consume other people’s work.

Matthew Butterick might end up with enough material to sue Microsoft, and the legal issues raised will be interesting for the open-source community. For now, though, I am going to stick to programming the hard way.

#OpenSource #Programming #GitHubCopilot #DataMining #Copyright #MachineLearning #News #Newsletter #tech #TechnologyLaw

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

If you’re interested in technology, you will confront this question at some point: should I learn to code?

For many people, including lawyers, coding is something you can ignore without serious consequences. I don’t understand how my microwave oven works, but that will not stop me from using it. Attending that briefing and asking the product guys good questions is probably enough for most lawyers to do your work.

The truth, though, is that life is so much more. In the foreword of the book “Law and Technology in Singapore”, Chief Justice Sundaresh Menon remarked that technology today “permeates, interfaces with, and underpins all aspects of the legal system, and indeed, of society”.

I felt that myself during the pandemic when I had to rely on my familiarity with technology to get work done. Coincidentally, I also implemented my docassemble project at work, using technology to generate contracts 24/7. I needed all my coding skills to whip up the program and provide the cloud infrastructure to run it without supervision. It’s fast, easy to use and avoids many problems associated with do-it-yourself templates. I got my promotion and respect at work.

If you’re convinced that you need to code, the rest of this post contains tips on juggling coding and lawyering. They are based on my personal experiences, so I am also interested in how you’ve done it and any questions you might have.

Tip 1: Have realistic ambitions

Photo by Lucas Clara / Unsplash

Lawyering takes time and experience to master. Passing the bar is the first baby step to a lifetime of learning. PQE is the currency of a lawyer in the job market.

Well, guess what? Coding is very similar too!

There are many options and possibilities — programming languages, tools and methods. Unlike a law school degree, there are free options you can check out, which would give you a good foundation. (Learnpython for Python and W3Schools for the web come to mind.) I got my first break with Udemy, and if you are a Singaporean, you can make use of SkillsFuture Credits to make your online learning free.

Just as becoming a good lawyer is no mean feat, becoming a good coder needs a substantial investment of time and learning. When you are already a lawyer, you may not have enough time in your life to be as good a coder.

I believe the answer is a strong no. Lawyers need to know what is possible, not how to do it. Lawyers will never be as good as real, full-time coders. Why give them another thing to thing the are “special” at. Lawyers need to learn to collaborate with those do code. https://t.co/3EsPbnikzK

— Patrick Lamb (@ElevateLamb) September 9, 2022

So, this is my suggestion: don’t aim to conquer programming languages or produce full-blown applications to rival a LegalTech company you’ve always admired on your own. Focus instead on developing proof of concepts or pushing the tools you are already familiar with as far as you can go. In addition, look at no code or low code alternatives to get easy wins.

By limiting the scope of your ambitions, you’d be able to focus on learning the things you need to produce quick and impactful results. The reinforcement from such quick wins would improve your confidence in your coding skills and abilities.

There might be a day when your project has the makings of a killer app. When that time comes, I am sure that you will decide that going solo is not only impossible but also a bad idea as well. Apps are pretty complex today, so I honestly think it’s unrealistic to rely on yourself to make them.

Tip 2: Follow what interests you

Muddy HandsPhoto by Sandie Clarke / Unsplash

It’s related to tip 1 — you’d probably be able to learn faster and more effectively if you are doing things related to what you are already doing. For lawyers, this means doing your job, but with code. A great example of this is docassemble, which is an open-source system for guided interviews and document assembly.

When you do docassemble, you would try to mimic what you do in practice. For example, crafting questions to get the information you need from a client to file a document or create a contract. However, instead of interviewing a person directly, you will be doing this code.

In the course of my travails looking for projects which interest me, I found the following interesting:

  • Rules as Code: I found Blawx to be the most user-friendly way to get your hands dirty on the idea that legislation, codes and regulations can be code.
  • Docassemble: I mentioned this earlier in this tip
  • Natural Language Processing: Using Artificial Intelligence to process text will lead you to many of the most exciting fields these days: summarisation, search and question and answer. Many of these solutions are fascinating when used for legal text.

I wouldn’t suggest that law is the only subject that lawyers find interesting. I have also spent time trying to create an e-commerce website for my wife and getting a computer to play Monopoly Junior 5 million times a day.

Such “fun” projects might not have much relevance to your professional life, but I learned new things which could help me in the future. E-commerce websites are the life of the internet today, and I experiment with the latest cloud technologies. Running 5 million games in a day made me think harder about code performance and how to achieve more with a single computer.

Tip 3: Develop in the open

Waiting for the big show...Photo by Barry Weatherall / Unsplash

Not many people think about this, so please hang around.

When I was a kid, I had already dreamed of playing around with code and computers. In secondary school, a bunch of guys would race to make the best apps in the class (for some strange reason, they tend to revolve around computer games). I learned a lot about coding then.

As I grew up and my focus changed towards learning and building a career in law, my coding skills deteriorated rapidly. One of the obvious reasons is that I was doing something else, and working late nights in a law firm or law school is not conducive to developing hobbies.

I also found community essential for maintaining your coding skills and interest. The most straightforward reason is that a community will help you when encountering difficulties in your nascent journey as a coder. On the other hand, listening and helping other people with their coding problems also improves your knowledge and skills.

The best thing about the internet is that you can find someone with similar interests as you — lawyers who code. On days when I feel exhausted with my day job, it’s good to know that someone out there is interested in the same things I am interested in, even if they live in a different world. So it would be best if you found your tribe; the only way to do that is to develop in the open.

  • Get your own GitHub account, write some code and publish it. Here's mine!
  • Interact on social media with people with the same interests as you. You’re more likely to learn what’s hot and exciting from them. I found Twitter to be the most lively place. Here's mine!
  • Join mailing lists, newsletters and meetups.

I find that it’s vital to be open since lawyers who code are rare, and you have to make a special effort to find them. They are like unicorns🦄!

Conclusion

So, do lawyers need to code? To me, you need a lot of drive to learn to code and build a career in law in the meantime. For those set on setting themselves apart this way, I hope the tips above can point the way. What other projects or opportunities do you see that can help lawyers who would like to code?

#Lawyers #Programming #LegalTech #blog #docassemble #Ideas #Law #OpenSource #RulesAsCode

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

☝🏼

Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.

Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.

There are thus a lot of articles which teach you how to scrape websites — here’s mine.

Automate Boring Stuff: Get Python and your Web Browser to download your judgementsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wroteLove.Law.Robots.Houfu

After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.

GitHub – houfu/pdpc-decisions: Data Protection Enforcement Cases in SingaporeData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufuThe code repository of the original web scraper is available on GitHub.

I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.

Levelling up... is not easy.

Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.

To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.

Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.

Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.

Enter scrapy

The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.

Enter scrapy.

The scrapy project is of some vintage. It reached 1.0 in 2015 and is currently at version 2.6.2 (as of August 2022). Scrapy’s age is screaming at you when it recommends you to install it in a “virtual environment” (who doesn’t install anything in Python except in a virtual environment?). On the other hand, scrapy is stable and production ready. It’s one of the best pieces of Python software I have encountered.

I decided to port my original web scraper to scrapy. I anticipated spending lots of time reading documentation, failing and then giving up. It turned out that I spent more time procrastinating, and the actual work was pretty straightforward.

Transitioning to scrapy

Here’s another thing you would notice about scrapy’s age. It encourages you to use a command line tool to generate code. This command creates a new project:

scrapy startproject tutorial

This reminds me of Angular and the ng command. (Do people still do that these days?)

While I found these commands convenient, it also reminded me that the learning curve of such frameworks is quite steep. Scrapy is no different. In the original web scraper, I defined the application's entry point through the command line function. This seemed the most obvious place to start for me.

@click.command() @click.argument('action') def pdpcdecision(csv, download, corpus, action, root, extras, extracorpus, verbose): starttime = time.time() scraperesults = Scraper.scrape() if (action == 'all') or (action == 'files'): downloadfiles(options, scraperesults) if (action == 'all') or (action == 'corpus'): createcorpus(options, scraperesults) if extras and ((action == 'all') or (action == 'csv')): scraperextras(scraperesults, options) if (action == 'all') or (action == 'csv'): savescraperesultstocsv(options, scraperesults) diff = time.time() – starttime logger.info('Finished. This took {}s.'.format(diff))

The original code was shortened to highlight the process.

The organisation of a scrapy project is different. You can generate a new project with the command above. However, the spider does the web crawling, and you have to create that within your project separately. If you started coding, you would not find this intuitive.

For the spider, the starting point is a function which generates or yields requests. The code example below does a few things. First, we find out how many pages there are on the website. We then yield a request for each page by submitting data on a web form.

import requests import scrapy from scrapy import FormRequest

class CommissionDecisionSpider(scrapy.Spider): name = “PDPCCommissionDecisions”

def startrequests(self): defaultform_data = { “keyword”: “”, “industry”: “all”, “nature”: “all”, “decision”: “all”, “penalty”: “all”, “page”: “1 }

response = requests.post(CASELISTINGURL, data=defaultformdata)

if response.statuscode == requests.codes.ok: responsejson = response.json() totalpages = responsejson[“totalPages”]

for page in range(1, totalpages + 1): yield FormRequest(CASELISTINGURL, formdata=createform_data(page=page))

Now, you need to write another function that deals with requests and yields items, the standard data format in scrapy.

def parse(self, response, **kwargs): responsejson = response.json() for item in responsejson[“items”]: from datetime import datetime nature = [DPObligations(nature.strip()) for nature in item[“nature”].split(',')] if item[ “nature”] else “None” decision = [DecisionType(decision.strip()) for decision in item[“decision”].split(',')] if item[ “decision”] else “None” yield CommissionDecisionItem( title=item[“title”], summaryurl=f”https://www.pdpc.gov.sg{item['url']}“, publisheddate=datetime.strptime(item[“date”], '%d %b %Y'), nature=nature, decision=decision )

You now have a spider! (Scrapy’s Quotesbot example is more minimal than this)

Run the spider using this command in the project directory:

scrapy crawl PDPCCommissionDecisions -o output.csv

Using its default settings, the spider scraped the PDPC website in a zippy 60 seconds. That’s because it already implements multithreading, so you are not waiting for tasks to complete one at a time. The command above even gets you a file containing all the items you scraped with no additional coding.

Transitioning from a pure Python codebase to a scrapy framework takes some time. It might be odd at first to realise you did not have to code the writing of a CSV file or manage web requests. This makes scrapy an excellent framework — you can focus on the code that makes your spider unique rather than reinventing the essential parts of a web scraper, probably very poorly.

It’s all in the pipeline.

If being forced to write spiders in a particular way isn’t irritating yet, dealing with pipelines might be the last straw. Pipelines deal with a request that doesn’t involve generating items. The most usual pipeline component checks an item to see if it’s a duplicate and then drops it if that’s true.

Pipelines look optional, and you can even avoid the complexity by incorporating everything into the main code. It turns out that many operations can be expressed as components in a timeline. Breaking them up into parts also helps the program implement multithreading and asynchronous operations effectively.

In pdpc-decisions, it wasn’t enough to grab the data from the filter or search page. You’d need to follow the link to the summary page, which makes additional information and a PDF download available. I wrote a pipeline component for that:

class CommissionDecisionSummaryPagePipeline: def processitem(self, item, spider): adapter = ItemAdapter(item) soup = bs4.BeautifulSoup(requests.get(adapter[“summaryurl”]).text, features=“html5lib”) article = soup.find('article')

# Gets the summary from the decision summary page paragraphs = article.find(class='rte').findall('p') result = '' for paragraph in paragraphs: if not paragraph.text == '': result += re.sub(r'\s+', ' ', paragraph.text) break adapter[“summary”] = result

# Gets the respondent in the decision adapter[“respondent”] = re.split(r”\s+[bB]y|[Aa]gainst\s+“, article.find('h2').text, re.I)[1].strip()

# Gets the link to the file to download the PDF decision decisionlink = article.find('a') adapter[“decisionurl”] = f”https://www.pdpc.gov.sg{decision_link['href']}”

adapter[“fileurls”] = [f”https://www.pdpc.gov.sg{decisionlink['href']}“]

return item

This component takes an item, visits the summary page and grabs the summary, respondent’s name and the link to the PDF, which contains the entire decision.

Note also the item has a field called file_urls. I did not create this data field. It’s a field used to tell scrapy to download a file from the web.

You can activate pipeline components as part of the spider’s settings.

ITEM_PIPELINES = { 'pdpcSpider.pipelines.CommissionDecisionSummaryPagePipeline': 300, 'pdpcSpider.pipelines.PDPCDecisionDownloadFilePipeline': 800, }

In this example, the pipeline has two components. Given a priority of 300, the CommissionDecisionSummaryPagePipeline goes first. PDPCDecisionDownloadFilePipeline then downloads the files listed in the file_urls field we referred to earlier.

Note also that PDPCDecisionDownloadFilePipeline is an implementation of the standard FilesPipeline component provided by scrapy, so I didn’t write any code to download files on the internet. Like the CSV feature, scrapy downloads the files when its files pipeline is activated.

Once again, it’s odd not to write code to download files. Furthermore, writing components for your pipeline and deciding on their seniority in a settings file isn’t very intuitive if you’re not sure what’s going on. Once again, I am grateful that I did not have to write my own pipeline.

I would note that “pipeline” is a fancy term for describing what your program is probably already doing. It’s true — in the original pdpc-decisions, the pages are scraped, the files are downloaded and the resulting items are saved in a CSV file. That’s a pipeline!

Settings, settings everywhere

Someone new to the scrapy framework will probably find the settings file daunting. In the previous section, we introduced the setting to define the seniority of the components in a pipeline. If you’re curious what else you can do in that file, the docs list over 50 items.

I am not going to go through each of them in this article. To me, though, the number of settings isn’t user-friendly. Still, it hints at the number of things you can do with scrapy, including randomising the delay before downloading files from the same site, logging and settings for storage adapters to common backends like AWS or FTP.

As a popular and established framework, you will also find an ecosystem. This includes scrapyd, a service you can run on your server to schedule scrapes and run your spiders. Proxy services are also available commercially if your operations are very demanding.

There are lots to explore here!

Conclusion

Do I have any regrets about doing pdpc-decisions? Nope. I learned a lot about programming in Python doing it. It made me appreciate what special considerations are involved in web scraping and how scrapy was helping me to do that.

I also found that following a framework made the code more maintainable. When I revisited the original pdpc-decisions while writing this article, I realised the code didn’t make sense. I didn’t name my files or function sensibly or write tests which showed what the code was doing.

Once I became familiar with the scrapy framework, I knew how to look for what I wanted in the code. This extends to sharing — if everyone is familiar with the framework, it’s easier for everyone to get on the same page rather than learning everything I wrote from scratch.

Scrapy could afford power and convenience, which is specialised for web scraping. I am going to keep using it for now. Learning such a challenging framework is already starting to pay dividends.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuRead more interesting adventures.

#Programming #Python #WebScraping #DataScience #Law #OpenSource #PDPC-Decisions #scrapy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

This Features article features many articles which may require a free subscription to read. Become a subscriber today and get access to all the articles!

This Features article is a work in progress. If you have any feedback or suggestions, please feel free to contact me!

What's the Point of this List?

Photo by Cris Tagupa on Unsplash

Unlike other jurisdictions, Singapore does not have a legal information institute like AustLII or CanLII. Legal Information institutes, as defined in the Free Access to Law Movement Declaration:

  • Publish via the internet public legal information originating from more than one public body;
  • Provide free and anonymous public access to that information;
  • Do not impede others from obtaining public legal information from its sources and publishing it; and
  • Support the objectives set out in this Declaration.

We do have an entry on CommonLII, but the resources are not always up to date. Furthermore, the difference in features and usability are worlds apart. (If you wanted to know what AustLII looked like over ten years ago, look at CommonLII.)

This does not mean that free legal resources are non-existent in Singapore. It's just that they are scattered around the internet, with varying levels of availability, coverage and features. Oh, there's also no guarantee they will be around now or in the future.

Ready to mine free online legal materials in Singapore? Not so fast!Amendments to Copyright Act might support better access to free online legal materials in Singapore by robots. I survey government websites to find out how friendly they are to this.Love.Law.Robots.HoufuAmendments to the Copyright Act have cleared some air regarding mining, but questions remain.

This post tries to gather all the resources I have found and benchmark them. With some idea of how to extract them, you can plausibly start a project like OpenLawNZ. If you're interested in, say, data protection commission decisions and are toying with the idea of NLPing them, you know where to find the source. Even if you aren't ambitious, you can browse them and add them to your bookmarks. Maybe even archive them if you are so inclined.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuIt might be surprising to some, but there's a wealth of material out there if you can find it!

Your comments are always welcome.

Options that aren't free or online

Photo by Iñaki del Olmo on Unsplash

The premier resource for research into Singapore law is LawNet. It offers a pay per use option, but it's not cheap (at minimum $57 for pay per use). There's one terminal available for LawNet at the LCK Library if you can travel to the National Library. I haven't used LawNet since I left practice several years ago. From following the news of its developments, it hasn't departed much from its core purpose and added several collections that can be very useful for practitioners.

Source: https://eresources.nlb.gov.sg/main/Browse?browseBy=type&filter=10&page=2 (accessed 22 October 2021)

There are also law libraries at the Supreme Court (Level 1) and State Courts (B1) if you're into physical things. There are reasonably good resources for its size, but if you were looking for something very specialized, you might be trying your luck here.

Supreme Court of Singapore

Photo by Vuitton Lim on Unsplash

As the apex court in Singapore, the resources available for free here are top-notch. The Supreme Court cover the entire gamut from the High Court, Court of Appeal, Singapore International Commercial Court and all other courts in between.

The Supreme Court has been steadily (and stealthily) expanding its judgements section. They now go back to 2000, and have basic search functionality and some tagging. Judgements only cover written judgements , which are “generally issued for more complex cases or where they involve questions of law which are of public interest”. In other words, High Courts prepare them for possible appeals, and the Court of Appeal prepares them for stare decisis. As such, they don't cover all the work that the courts here do. Relying on this to study the court's work (beyond the development of law) can be biased. There's no API access.

Hearing lists are available for the current week and the following week and then sorted by judges. You can download them in PDF. Besides information relating to when the hearing is fixed, you can see who the parties are and skeletal information on the purpose of the hearing. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

New homes for judgements in the UK... and Singapore?I look at envy in the UK while exploring some confusing changes in the Singapore Supreme Court website.Love.Law.Robots.HoufuThe Supreme Court may be the apex court in Singapore, but its judgements reveal that there is a real mess in here.

State Courts

A rung lower than the Supreme Court, the State Courts generally deal with more down to earth civil and criminal matters. It long felt neglected in an older building (though interesting for an architecture geek), but they changed their name (from Subordinate Courts to State Courts) and moved to a spanking new nineteen storey building in the last few years. If you watch a lot of local television, this is the court where embarrassed respondents dash past the media scrum.

Unfortunately, judgements are harder to find at this level. The only free resource is a LawNet section that covers written judgements for the last three months.

Written judgements are prepared pretty much only when they will be appealed to the Supreme Court. This means that the judgements you can see there represent a relatively small and biased microcosm of work in the State Courts. In summary, appeals at this level are restricted by law. These represent significant barriers for civil cases where costs are an issue. Such restrictions are less pronounced in criminal cases. The Public Prosecutor appeals every case that does not meet its expectations. Accused appeals every case... well, because they might want to see the written judgment so that they can decide if they're going to appeal. This might explain why there are several more criminal cases available than civil matters. On the other hand, the accused or litigant who wants to get this case over and done don't appeal.

NUS cases show why judge analytics is needed in SingaporeThrowing anecdotes around fails to convince any side of the situation in Singapore. The real solution is more data.Love.Law.Robots.HoufuDue to the lack of public information on how judges decide cases, it's difficult to get a common understanding of what they do.

Hearing lists are available for civil trials and applications, criminal trials and tribunal matters in the coming week. It looks like an ASP.Net frontend with a basic search function. Besides information relating to when the hearing is fixed, you can see who the parties are and very skeletal information on what the hearing is about. There's no API access.

Court records aren't available to the public online. Inspection of case files by the public requires permission, and fees apply.

The State Court has expanded its scope with several new courts in recent years, such as the Protection from Harassment Courts, Community Dispute Resolution Centre and Labour Claims Tribunal. None of these courts publishes their judgements on a regular basis. As they rarely get appealed, you will also not find them in the free section of LawNet.

Legislation

Beautiful view from the Parliament of Singapore 🇸🇬Photo by Steven Lasry / Unsplash

Singapore Statutes Online is the place to get legislation in Singapore. It contains historical versions of legislation, current editions, repealed versions, subsidiary legislation and bills.

When the first version was released in 2001, it was quite a pioneer. Today many countries provide their legislations in snazzier forms. (I am a fan of the UK's version).

While there isn't API access (and extraction won't be easy due to the extensive use of not so semantic HTML), you can enjoy the several RSS feeds littered around every aspect of the site.

I consider SSO to be very fast and regularly updated. However, if you need an alternative site for bills and acts, you can consider Parliament's website.

#Features #DataMining #DataScience #Decisions #Government #Judgements #Law #OpenSource #Singapore #SupremeCourtSingapore #WebScraping #StateCourtsSingapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Speaking has always been a big part of being a lawyer. You use your voice to make submissions in the highest courts of the land. Even in client meetings, you are also using your voice to persuade. Hell, when I write my emails, I imagine saying what I am writing to make sure it is in my voice.

So, thinking about how a synthesized voice can be useful is going to be controversial. You might think that a computer's voice is soulless and not interesting enough to hold on its own against a lawyer. However, with advances led by smart assistants like Google Home and Siri, Text to Speech (TTS) is certainly worth exploring.

Why use robots?

Talking is really convenient, as you would open your mouth and start talking (though some babies will disagree). However, working from home shows how difficult it can be to record and transmit good quality sound. Feedback and distortions are just some problems people regularly face using basic equipment to have online meetings. It's frustrating.

If you think this is an issue that is resolved by having better equipment, it can get expensive very easily. You might notice that several people are involved in producing your favourite podcast. You are going to need all sorts of equipment, like microphones and DAC mixers. Hire Engineers? What does a mixer do, actually?

Furthermore, human performance can be subject to various imperfections. The pitch or tone is not right here. Sometimes you lose concentration or get interrupted in the middle of your speech. All this means you may have to record something several times and hopefully get the delivery you are happy with. If you aren't confident about your English or would like to say something in another language, getting a computer to voice will help overcome it.

So a synthesized voice can be cheap, fast and consistent. If the quality is good enough, you can focus on the script. For me, I am interested in improving the quality of my online training. Explaining stuff doesn't need Leonard Cohen quality delivery. It's probably far less distracting anyway.

Experiments with TTS

I will take two major Text to Speech (TTS) solutions for a spin — Google Cloud and Mozilla's TTS (open source). The Python code used to write these experiments are contained in my Github.

houfu/TTS-experimentsContribute to houfu/TTS-experiments development by creating an account on GitHub.GitHubhoufu

Google Cloud

It's quite easy to try Google Cloud's TTS. A demo allows you to set the text and then convert it with a click of a button. If you want to know how it sounds, try it!

Text-to-Speech: Lifelike Speech Synthesis | Google CloudTurn text into natural-sounding speech in 220+ voices across 40+ languages and variants with an API powered by Google’s machine learning technology.Google Cloud

To generate audio files, you're going to need a Google Cloud account and some programming knowledge. However, it's pretty straightforward, and I mostly copied from the quickstart. You can hear the first two paragraphs of this blog post here.

Here's my personal view of Google Cloud's TTS:

  • It ain't free. Pricing for premium voices is free only for the first 1 million characters. After that, it's a hefty USD16 for every 1 million characters. Do note that the first two paragraphs of the blog have 629 characters. If you are converting text, it's hard to bust that limit.
  • The voices sound nice, in my opinion. However, if you are listening to it for a long time, it might be not easy.
  • Developer experience is great, and as you can see, converting lines of text to speech is straightforward.

Mozilla's TTS

Using Mozilla's TTS, you get much closer to the machine-training aspects of the text to speech. This includes training your own model, that is, if you have roughly 24 hours of recordings of your voice to spare.

mozilla/TTS:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts) – mozilla/TTSGitHubmozilla

However, for this experiment, we don't need that as we will use pre-trained models. Using python's built-in subprocess module, we can run the command line command that comes with the package. This generates wave files. You can hear the first two paragraphs of this blog post here.

Here's my personal view of Mozilla's TTS:

  • It's open-source, and I am partial to open source.
  • It also teaches you to how to train a machine using your voice. So, this is a possibility.
  • It sounds terrible, but that's because the audio feels a bit more varied than Google's. So... some parts sound louder, making other parts sound softer. There is also quite a lot of noise, which may be due to the recording quality of the source data. I did normalise the loudness for this sample.
  • Leaving those two points aside, it sounds more interesting to me. The variation feels a tad more natural to me.
  • There aren't characters to choose from (male, female etc.), so this may not be practical.
  • Considering I was not doing much more than running a command line, it was OK. Notably, choosing a pre-trained model was confusing at first, and I had to experiment a lot. Also, based on what you choose, the model might take a bit of time and computing power to produce audio. It took roughly about 15 minutes, and my laptop was wheezing throughout.

Conclusion

If you thought robots would replace lawyers in court, this isn't the post to persuade you. However, thinking further, I think some usage cases are certainly worth trying, such as online training courses. In this regard, Google Cloud is production-ready so that you can get the most presentable solutions. Mozilla TTS is open source and definitely far more interesting but needs more time to develop. Do you think there are other ways to use TTS?

#tech #NaturalLanguageProcessing #OpenSource #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

I run a docassemble server at work, ostensibly introducing co-workers to a different way of using templates to generate their agreements. It's been pretty useful, so much so that I use it myself for my work. However, due to the pandemic, it's not been easy to go and sell it. Maybe I am going to have better luck soon.

In the meantime, I decided to move the server from AWS to DigitalOcean.

Why move?

I liked the wide variety of features available on AWS, such as CodeCommit, Lambda and SES. DigitalOcean is not comparable in that regard. If I wanted to create a whole suite of services for my application, I would probably find something on AWS's glorious one-page filled with services.

However, with great functions come great complexity. I had a headache trying to exploit them. I was not going to be able to make full use of their ecosystem. (I shall never scoff at AWS certification anymore.)

On the other hand, I was more familiar with DigitalOcean and liked their straightforward pricing. So, if I wanted to move my pet project somewhere, I would have liked it to be in my backyard.

Let's get moving!

Lesson 1: Respect the shutdown

The docassemble docs expressly ask you to shut down your docassemble server gracefully. This is not the usual docker stop <container> command but with a timeout flag. It isn't fatal to forget the timeout flag in many simple use cases, so you would never actually notice it.

However, there's another way to kill your server in the cloud — flip the switch on your cloud instance on the management console. It doesn't feel like that when you click the red button, but it has the same effect. The cloud instance is sent straight to heaven, and there is nothing you can do about it.

The shutdown is important because docassemble does quite a lot of work when it shuts down. It dumps the database records in your storage. If the storage is located in the cloud (like AWS's S3 or DigitalOcean's Spaces), there is some lag when sending all the files there. If the shutdown is not respected, the server's state is not saved, and you might not be able to restore it when you start the container.

So with my AWS container gone in a cloud of dust, I found my files in my S3 storage were not updated. The last copy was over several months ago — the last time I had shut down my container normally. This meant that several months of work was gone! 😲

Lesson 2: Restore from backup

This blog could have ended on that sad note. Luckily for CloudOps newbies like me, docassemble automatically stores backups of the server state. These are stored in the backup folder of your storage and are arranged by date.

If you, like me, borked your docassemble server and set it back to August 2020, you can grab your latest backup and replace the main directory files (outside backup). The process is described in the docassemble docs here. Instead of having no users back in August 2020, I managed to retrieve all my users in the Postgres database stored in the backups. Phew!

Lesson 3: Check your config.yml file

After this exercise, I decided to go with a DigitalOcean Droplet and AWS S3. Given that I was already on S3 and the costs of S3 are actually fairly negligible, this seems like a cost-effective combo. DigitalOcean spaces cost $5 no matter how big they are, whereas my S3 usage rarely comes up to more than a dollar.

Before giving your new docassemble server a spin, do check your config.yml file. You can specify environment variables when you start a container, but once the server is running free, it uses the config.yml file found in the storage. If the configuration file was specially set for AWS, your server might not be able to run properly on DigitalOcean. This means you have to download the config.yml file on the storage (I used the web interface of S3 to do this) and edit it manually to fit your new server.

In my setup, my original configuration file was set up for an AWS environment. This meant that my EC2 instance used security policies to access the S3. At the time, it simplified the set-up of the server. However, my Droplet cannot use these features. Generate an access key and secret key, and input these details and more in your updated config.yml file. Oh, and turn off ec2.

If you are going to use Spaces, you will transfer the files in your old S3 to Spaces (I used s4cmd) and fill in the details of your S3 in the configuration file.

Conclusion

To be honest, the migration was essentially painless. The design of the docassemble server allows it to be restored from a single source of truth — the storage method you choose. Except for the problems that come from hand-editing your old config.yml (I had to type my SecretKey a few times 😢), you probably don't need to enter the docker and read initialize error logs. Given my positive experience, I will be well prepared to move back to AWS again! (Just kidding for now.)

#tech #docassemble #AWS #DigitalOcean #docker #OpenSource #tutorial #CloudComputing

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu