Love.Law.Robots. by Ang Hou Fu


I saw this post the other day that got me thinking about how I learnt to #code.

It was a great opportunity for me to revisit a post I wrote nearly a year ago on the subject: So you want to code? (Lawyer Edition)


Feature image

I’ve wanted to pen down my thoughts on the next stage of the evolution of my projects for some time. Here I go!

What’s next after pdpc-decisions?

I had a lot of fun writing pdpc-decisions. It scraped data from the Personal Data Protection Commission’s enforcement decisions web page and produced a table, downloads and text. Now I got my copy of the database! From there, I made visualisations, analyses and fun graphs.

All for free.

The “free” includes the training I got coding in Python and trying out various stages of software development, from writing tests to distributing a package as a module and a docker container.

In the lofty “what’s next” section of the post, I wrote:

The ultimate goal of this code, however leads to my slow-going super-project, which I called zeeker. It’s a database of personal data protection resources in the cloud, and I hope to expand on the source material here to create an even richer database. So this will not be my last post on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interest in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it helpful so if there is an interest, please let me know!

What has happened since then?

For one, personal data protection commission decisions are not interesting enough for me. Since working on that project, the deluge of decisions has trickled as the PDPC appeared to have changed its focus to compliance and other cool techy projects.

Furthermore, there are much more interesting data out there: for example, the PDPC has created many valuable guidelines which are currently unsearchable. As Singapore’s rules and regulations grow in complexity, there’s much hidden beneath the surface. The zeeker project shouldn’t just focus on a narrow area of law or judgements and decisions.

In terms of system architecture, I made two other decisions.

Use more open-source libraries, and code less.

I grew more confident in my coding skills doing pdpc-decisions, but I used a few basic libraries and hacked my way through the data. When I look back at my code, it is unmaintainable. Any change can break the library, and the bog of whacked-up coding made it hard for me to understand what I was doing several months later. Tests, comments and other documentation help, but only if you’re a disciplined person. I’m not that kind of guy.

Besides writing code (which takes time and lots of motivation), I could also “piggyback” on the efforts of others to create a better stack. The stack I’ve decided so far has made coding more pleasant.

There are also other programs I would like to try — for example, I plan to deliver the data through an API, so I don’t need to use Python to code the front end. A Javascript framework like Next.JS would be more effective for developing websites.

Decoupling the project with the programming language also expands the palette of tools I can have. For example, instead of using a low-level Python library like pdfminer to “plumb” a PDF, I could use a self-hosted docker container like parsr to OCR or analyse the PDF and then convert it to text.

It’s about finding the best tool for the job, not depending only on my (mediocre) programming skills to bring results.

There’s, of course, an issue of technical debt (if parsr is not being developed anymore, my project can slow down as well). I think this is not so bad because all the projects I picked are open-source. I would also pick well-documented and popular projects to reduce this risk.

It’s all a pipeline, baby.

The only way the above is possible is a paradigm shift from making one single package of code to thinking about the work as a process. There are discrete parts to a task, and the code is suited for that particular task.

I was inspired to change the way I thought about zeeker when I saw the flow chart for OpenLaw NZ’s Data Pipeline.

OpenLaw NZ’s data pipeline structure looks complicated, but it’s easy to follow for me!

It’s made of several AWS components and services (with some Azure). The steps are small, like receiving an event, sending it to a serverless function, putting the data in an S3 bucket, and then running another serverless function.

The key insight is to avoid building a monolith. I am not committed to building a single program or website. Instead, a project is broken into smaller parts. Each part is only intended to do a small task well. In this instance, zeekerscrapers is only a scraper. It looks at the webpage, takes the information already present on the web page, and saves or downloads the information. It doesn't bother with machine learning, displaying the results or any other complicated processing.

Besides using the right tool for the job, it is also easier to maintain.

The modularity also makes it simple to chop and change for different types of data. For example, you need to OCR a scanned PDF but don’t need to do that for a digital PDF. If the workflow is a pipeline, you can take that task out of the pipeline. Furthermore, some tasks, such as downloading a file, are standard fare. If you have a code you can reuse over several pipelines, you can save much coding time.

On the other hand, I would be relying heavily on cloud infrastructure to accomplish this, which is by no means cheap or straightforward.

Experiments continue

Photo by Alex Kondratiev / Unsplash

I have been quite busy lately, so I have yet to develop this at the pace I would like. For now, I have been converting pdpc-decisions to seeker. It’s been a breeze even though I took so much time.

On the other hand, my leisurely pace also allowed me to think about more significant issues, like what I can generalise and whether I will get bad vibes from this code in the future. Hopefully, the other scrapers can develop at breakneck speed once I complete thinking through the issues.

I have also felt more and more troubled by what to prioritise. Should I write more scrapers? Scrape what? Should I focus on adding more features to existing scrapers (like extracting entities and summarisation etc.)? When should I start writing the front end? When should I start advertising this project?

It’d be great to hear your comments. Meanwhile, keep watching this space!

#zeeker #Programming #PDPC-Decisions #Ideas #CloudComputing #LegalTech #OpenSource #scrapy #SQLModel #spaCy #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

In 2021, I discovered something exciting — an application of machine learning that was both mind-blowing and practical.

The premise was simple. Type a description of the code you want in your editor, and GitHub Copilot will generate the code. It was terrific, and many people, including myself, were excited to use it.

🚀 I just got access to @github Copilot and it's super amazing!!! This is going to save me so much time!! Check out the short video below! #GitHubCopilot I think I'll spend more time writing function descriptions now than the code itself :D

— abhishek (@abhi1thakur) June 30, 2021

The idea that you can prompt a machine to generate code for you is obviously interesting for contract lawyers. I believe we are getting closer every day. I am waiting for my early access to Spellbook.

As a poorly trained and very busy programmer, it feels like I am a target of Github Copilot. The costs was also not so ridiculous. (Spellbook Legal costs $89 a month compared to Copilot's $10 a month) Even so, I haven't tried it for over a year. I wasn’t comfortable enough with the idea and I wasn’t sure how to express it.

Now I can. I recently came across a website proposing to investigate Github Copilot. The main author is Matthew Butterick. He’s the author of Typography for Lawyers and this site proudly uses the Equity typeface.

GitHub Copilot investigation · Joseph Saveri Law Firm & Matthew ButterickGitHub Copilot investigation

In short, the training of GitHub Copilot on open source repositories it hosts probably raises questions on whether such use complies with its copyright licenses. Is it fair use to use publicly accessible code for computational analysis? You might recall that Singapore recently passed an amendment to the Copyright Act providing an exception for computational data analysis. If GitHub Copilot is right that it is fair use, any code anywhere is game to be consumed by the learning machine.

Of course, the idea that it might be illegal hasn’t exactly stopped me from trying.

The key objection to GitHub Copilot is that it is not open source. By packaging the world’s open-source code in an AI model, and spitting it out to its user with no context, a user only interacts with Github Copilot. It is, in essence, a coding walled garden.

Copi­lot intro­duces what we might call a more self­ish inter­face to open-source soft­ware: just give me what I want! With Copi­lot, open-source users never have to know who made their soft­ware. They never have to inter­act with a com­mu­nity. They never have to con­tribute.

For someone who wants to learn to code, this enticing idea is probably a double-edged sword. You could probably swim around using prompts with your AI pair programmer, but without any context, you are not learning much. If I wanted to know how something works, I would like to run it, read its code and interact with its community. I am a member of a group of people with shared goals, not someone who just wants to consume other people’s work.

Matthew Butterick might end up with enough material to sue Microsoft, and the legal issues raised will be interesting for the open-source community. For now, though, I am going to stick to programming the hard way.

#OpenSource #Programming #GitHubCopilot #DataMining #Copyright #MachineLearning #News #Newsletter #tech #TechnologyLaw

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

If you’re interested in technology, you will confront this question at some point: should I learn to code?

For many people, including lawyers, coding is something you can ignore without serious consequences. I don’t understand how my microwave oven works, but that will not stop me from using it. Attending that briefing and asking the product guys good questions is probably enough for most lawyers to do your work.

The truth, though, is that life is so much more. In the foreword of the book “Law and Technology in Singapore”, Chief Justice Sundaresh Menon remarked that technology today “permeates, interfaces with, and underpins all aspects of the legal system, and indeed, of society”.

I felt that myself during the pandemic when I had to rely on my familiarity with technology to get work done. Coincidentally, I also implemented my docassemble project at work, using technology to generate contracts 24/7. I needed all my coding skills to whip up the program and provide the cloud infrastructure to run it without supervision. It’s fast, easy to use and avoids many problems associated with do-it-yourself templates. I got my promotion and respect at work.

If you’re convinced that you need to code, the rest of this post contains tips on juggling coding and lawyering. They are based on my personal experiences, so I am also interested in how you’ve done it and any questions you might have.

Tip 1: Have realistic ambitions

Photo by Lucas Clara / Unsplash

Lawyering takes time and experience to master. Passing the bar is the first baby step to a lifetime of learning. PQE is the currency of a lawyer in the job market.

Well, guess what? Coding is very similar too!

There are many options and possibilities — programming languages, tools and methods. Unlike a law school degree, there are free options you can check out, which would give you a good foundation. (Learnpython for Python and W3Schools for the web come to mind.) I got my first break with Udemy, and if you are a Singaporean, you can make use of SkillsFuture Credits to make your online learning free.

Just as becoming a good lawyer is no mean feat, becoming a good coder needs a substantial investment of time and learning. When you are already a lawyer, you may not have enough time in your life to be as good a coder.

I believe the answer is a strong no. Lawyers need to know what is possible, not how to do it. Lawyers will never be as good as real, full-time coders. Why give them another thing to thing the are “special” at. Lawyers need to learn to collaborate with those do code.

— Patrick Lamb (@ElevateLamb) September 9, 2022

So, this is my suggestion: don’t aim to conquer programming languages or produce full-blown applications to rival a LegalTech company you’ve always admired on your own. Focus instead on developing proof of concepts or pushing the tools you are already familiar with as far as you can go. In addition, look at no code or low code alternatives to get easy wins.

By limiting the scope of your ambitions, you’d be able to focus on learning the things you need to produce quick and impactful results. The reinforcement from such quick wins would improve your confidence in your coding skills and abilities.

There might be a day when your project has the makings of a killer app. When that time comes, I am sure that you will decide that going solo is not only impossible but also a bad idea as well. Apps are pretty complex today, so I honestly think it’s unrealistic to rely on yourself to make them.

Tip 2: Follow what interests you

Muddy HandsPhoto by Sandie Clarke / Unsplash

It’s related to tip 1 — you’d probably be able to learn faster and more effectively if you are doing things related to what you are already doing. For lawyers, this means doing your job, but with code. A great example of this is docassemble, which is an open-source system for guided interviews and document assembly.

When you do docassemble, you would try to mimic what you do in practice. For example, crafting questions to get the information you need from a client to file a document or create a contract. However, instead of interviewing a person directly, you will be doing this code.

In the course of my travails looking for projects which interest me, I found the following interesting:

  • Rules as Code: I found Blawx to be the most user-friendly way to get your hands dirty on the idea that legislation, codes and regulations can be code.
  • Docassemble: I mentioned this earlier in this tip
  • Natural Language Processing: Using Artificial Intelligence to process text will lead you to many of the most exciting fields these days: summarisation, search and question and answer. Many of these solutions are fascinating when used for legal text.

I wouldn’t suggest that law is the only subject that lawyers find interesting. I have also spent time trying to create an e-commerce website for my wife and getting a computer to play Monopoly Junior 5 million times a day.

Such “fun” projects might not have much relevance to your professional life, but I learned new things which could help me in the future. E-commerce websites are the life of the internet today, and I experiment with the latest cloud technologies. Running 5 million games in a day made me think harder about code performance and how to achieve more with a single computer.

Tip 3: Develop in the open

Waiting for the big show...Photo by Barry Weatherall / Unsplash

Not many people think about this, so please hang around.

When I was a kid, I had already dreamed of playing around with code and computers. In secondary school, a bunch of guys would race to make the best apps in the class (for some strange reason, they tend to revolve around computer games). I learned a lot about coding then.

As I grew up and my focus changed towards learning and building a career in law, my coding skills deteriorated rapidly. One of the obvious reasons is that I was doing something else, and working late nights in a law firm or law school is not conducive to developing hobbies.

I also found community essential for maintaining your coding skills and interest. The most straightforward reason is that a community will help you when encountering difficulties in your nascent journey as a coder. On the other hand, listening and helping other people with their coding problems also improves your knowledge and skills.

The best thing about the internet is that you can find someone with similar interests as you — lawyers who code. On days when I feel exhausted with my day job, it’s good to know that someone out there is interested in the same things I am interested in, even if they live in a different world. So it would be best if you found your tribe; the only way to do that is to develop in the open.

  • Get your own GitHub account, write some code and publish it. Here's mine!
  • Interact on social media with people with the same interests as you. You’re more likely to learn what’s hot and exciting from them. I found Twitter to be the most lively place. Here's mine!
  • Join mailing lists, newsletters and meetups.

I find that it’s vital to be open since lawyers who code are rare, and you have to make a special effort to find them. They are like unicorns🦄!


So, do lawyers need to code? To me, you need a lot of drive to learn to code and build a career in law in the meantime. For those set on setting themselves apart this way, I hope the tips above can point the way. What other projects or opportunities do you see that can help lawyers who would like to code?

#Lawyers #Programming #LegalTech #blog #docassemble #Ideas #Law #OpenSource #RulesAsCode

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

If you spent long enough coding, you would meet this term: refactoring. The Agile Alliance defines it as “improving the internal structure of an existing program’s source code, while preserving its external behavior”. To paint a picture, it's like tending your garden. Get rid of some leaves, trim the hedges, and maybe add some accessories. It's still a garden, but it's further away from ruin.

In real life, I don't have a garden, and I also hate gardening. It's not the dirt, it's the work.

Similarly, I am also averse to refactoring. The fun is bringing your idea to life and figuring out the means to get there. Improving the work? I will do that some other day.

Lately, I have had the chance to revisit some work. In my latest post, I transform my pdpc-decisions work to scrapy. It's something I put off for nearly a year because I was not looking forward to learning a new framework to do something I had already accomplished.

Take your web scraping to a new level: Let’s play with scrapyChanging my code to scrapy, a web scraping framework for Python, was challenging but reaped many dividends.Love.Law.Robots.HoufuPlease don't be too put off by the cute spider picture.

In the end, the procrastination didn't make sense. I was surprised I completed the main body of code within a few days. It turned out that my previous experience writing my web scraper helped me to understand the scrapy framework better.

On the other hand, revisiting my old code made me realise how anachronistic my old programming habits were. The programmer in me in 2020 was much different than I am now. The code I would like to write now should get the job done and be easy to read and maintain.

I reckon in many ways, wanting the code to be so perfect that I could leave it out of my mind forever grew from my foundation as a lawyer. Filings, once submitted, can't be recalled. Contracts, once signed, are frozen in its time.

My experience with technology made this way of thinking seem obsolete. Our products are moulded by our circumstances, by what is available at the time. As things change, the story doesn't end; it's only delineated in chapters. The truth is that there will always be another filing, and a contract can always be amended.

I reckon that lawyers shouldn't be stuck in their old ways, and we should actively consider how to improve the way we work. As time goes by, what we have worked on becomes forgotten because it's hard to read and maintain. I think we owe it to society to ensure that our skills and knowledge are not forgotten, or at least ensure that the next person doesn't need to walk the same path repeatedly.

As I look into what else in my code needs refactoring, I think: does the code need to be changed because the circumstances have changed, or because I have changed? Sometimes, I am not sure. Honestly. 🤔

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuHere's a target for more refactoring!

#Newsletter #Lawyers #scrapy #WebScraping #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image


Key takeaways:
Web scraping is a useful and unique project that is good for beginners.
Scrapy makes it easier to operationalise your web scraping and to implement them at scale, by providing the ability to reuse code and features that are useful for web scraping.
Making a transition to the scrapy framework is not always straightforward, but it will pay dividends.

Web scraping is fundamental to any data science journey. There's a lot of information out there on the world wide web. Very few of them are presented in a pretty interface which allows you just to take it. By scraping information off websites, you get structured information. It's a unique challenge that is doable for a beginner.

There are thus a lot of articles which teach you how to scrape websites — here’s mine.

Automate Boring Stuff: Get Python and your Web Browser to download your judgementsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wroteLove.Law.Robots.Houfu

After spending gobs of time plying through books and web articles, I created a web scraper that did the job right.

GitHub – houfu/pdpc-decisions: Data Protection Enforcement Cases in SingaporeData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufuThe code repository of the original web scraper is available on GitHub.

I was pretty pleased with the results and started feeling ambition in my loins. I wanted to take it to a new level.

Levelling up... is not easy.

Turning your hobby project production-ready isn’t so straightforward, though. If you plan to scan websites regularly, update your data or do several operations, you run into a brick wall.

To run it continuously, you will need a server that can schedule your scrapes, store the data and report the results.

Then, in production mode, problems like being blocked and burdening the websites you are dealing with become more significant.

Finally, scrapers share many standard features. I wasn’t prepared to write the same code over and over again. Reusing the code would be very important if I wanted to scrape several websites at scale.

Enter scrapy

The solutions to these problems are well-known, such as using multithreading, asynchronous operations, web proxies or throttling or randomising your web requests. Writing all these solutions from scratch? Suddenly your hobby project has turned into a chore.

Enter scrapy.

The scrapy project is of some vintage. It reached 1.0 in 2015 and is currently at version 2.6.2 (as of August 2022). Scrapy’s age is screaming at you when it recommends you to install it in a “virtual environment” (who doesn’t install anything in Python except in a virtual environment?). On the other hand, scrapy is stable and production ready. It’s one of the best pieces of Python software I have encountered.

I decided to port my original web scraper to scrapy. I anticipated spending lots of time reading documentation, failing and then giving up. It turned out that I spent more time procrastinating, and the actual work was pretty straightforward.

Transitioning to scrapy

Here’s another thing you would notice about scrapy’s age. It encourages you to use a command line tool to generate code. This command creates a new project:

scrapy startproject tutorial

This reminds me of Angular and the ng command. (Do people still do that these days?)

While I found these commands convenient, it also reminded me that the learning curve of such frameworks is quite steep. Scrapy is no different. In the original web scraper, I defined the application's entry point through the command line function. This seemed the most obvious place to start for me.

@click.command() @click.argument('action') def pdpcdecision(csv, download, corpus, action, root, extras, extracorpus, verbose): starttime = time.time() scraperesults = Scraper.scrape() if (action == 'all') or (action == 'files'): downloadfiles(options, scraperesults) if (action == 'all') or (action == 'corpus'): createcorpus(options, scraperesults) if extras and ((action == 'all') or (action == 'csv')): scraperextras(scraperesults, options) if (action == 'all') or (action == 'csv'): savescraperesultstocsv(options, scraperesults) diff = time.time() – starttime'Finished. This took {}s.'.format(diff))

The original code was shortened to highlight the process.

The organisation of a scrapy project is different. You can generate a new project with the command above. However, the spider does the web crawling, and you have to create that within your project separately. If you started coding, you would not find this intuitive.

For the spider, the starting point is a function which generates or yields requests. The code example below does a few things. First, we find out how many pages there are on the website. We then yield a request for each page by submitting data on a web form.

import requests import scrapy from scrapy import FormRequest

class CommissionDecisionSpider(scrapy.Spider): name = “PDPCCommissionDecisions”

def startrequests(self): defaultform_data = { “keyword”: “”, “industry”: “all”, “nature”: “all”, “decision”: “all”, “penalty”: “all”, “page”: “1 }

response =, data=defaultformdata)

if response.statuscode == responsejson = response.json() totalpages = responsejson[“totalPages”]

for page in range(1, totalpages + 1): yield FormRequest(CASELISTINGURL, formdata=createform_data(page=page))

Now, you need to write another function that deals with requests and yields items, the standard data format in scrapy.

def parse(self, response, **kwargs): responsejson = response.json() for item in responsejson[“items”]: from datetime import datetime nature = [DPObligations(nature.strip()) for nature in item[“nature”].split(',')] if item[ “nature”] else “None” decision = [DecisionType(decision.strip()) for decision in item[“decision”].split(',')] if item[ “decision”] else “None” yield CommissionDecisionItem( title=item[“title”], summaryurl=f”{item['url']}“, publisheddate=datetime.strptime(item[“date”], '%d %b %Y'), nature=nature, decision=decision )

You now have a spider! (Scrapy’s Quotesbot example is more minimal than this)

Run the spider using this command in the project directory:

scrapy crawl PDPCCommissionDecisions -o output.csv

Using its default settings, the spider scraped the PDPC website in a zippy 60 seconds. That’s because it already implements multithreading, so you are not waiting for tasks to complete one at a time. The command above even gets you a file containing all the items you scraped with no additional coding.

Transitioning from a pure Python codebase to a scrapy framework takes some time. It might be odd at first to realise you did not have to code the writing of a CSV file or manage web requests. This makes scrapy an excellent framework — you can focus on the code that makes your spider unique rather than reinventing the essential parts of a web scraper, probably very poorly.

It’s all in the pipeline.

If being forced to write spiders in a particular way isn’t irritating yet, dealing with pipelines might be the last straw. Pipelines deal with a request that doesn’t involve generating items. The most usual pipeline component checks an item to see if it’s a duplicate and then drops it if that’s true.

Pipelines look optional, and you can even avoid the complexity by incorporating everything into the main code. It turns out that many operations can be expressed as components in a timeline. Breaking them up into parts also helps the program implement multithreading and asynchronous operations effectively.

In pdpc-decisions, it wasn’t enough to grab the data from the filter or search page. You’d need to follow the link to the summary page, which makes additional information and a PDF download available. I wrote a pipeline component for that:

class CommissionDecisionSummaryPagePipeline: def processitem(self, item, spider): adapter = ItemAdapter(item) soup = bs4.BeautifulSoup(requests.get(adapter[“summaryurl”]).text, features=“html5lib”) article = soup.find('article')

# Gets the summary from the decision summary page paragraphs = article.find(class='rte').findall('p') result = '' for paragraph in paragraphs: if not paragraph.text == '': result += re.sub(r'\s+', ' ', paragraph.text) break adapter[“summary”] = result

# Gets the respondent in the decision adapter[“respondent”] = re.split(r”\s+[bB]y|[Aa]gainst\s+“, article.find('h2').text, re.I)[1].strip()

# Gets the link to the file to download the PDF decision decisionlink = article.find('a') adapter[“decisionurl”] = f”{decision_link['href']}”

adapter[“fileurls”] = [f”{decisionlink['href']}“]

return item

This component takes an item, visits the summary page and grabs the summary, respondent’s name and the link to the PDF, which contains the entire decision.

Note also the item has a field called file_urls. I did not create this data field. It’s a field used to tell scrapy to download a file from the web.

You can activate pipeline components as part of the spider’s settings.

ITEM_PIPELINES = { 'pdpcSpider.pipelines.CommissionDecisionSummaryPagePipeline': 300, 'pdpcSpider.pipelines.PDPCDecisionDownloadFilePipeline': 800, }

In this example, the pipeline has two components. Given a priority of 300, the CommissionDecisionSummaryPagePipeline goes first. PDPCDecisionDownloadFilePipeline then downloads the files listed in the file_urls field we referred to earlier.

Note also that PDPCDecisionDownloadFilePipeline is an implementation of the standard FilesPipeline component provided by scrapy, so I didn’t write any code to download files on the internet. Like the CSV feature, scrapy downloads the files when its files pipeline is activated.

Once again, it’s odd not to write code to download files. Furthermore, writing components for your pipeline and deciding on their seniority in a settings file isn’t very intuitive if you’re not sure what’s going on. Once again, I am grateful that I did not have to write my own pipeline.

I would note that “pipeline” is a fancy term for describing what your program is probably already doing. It’s true — in the original pdpc-decisions, the pages are scraped, the files are downloaded and the resulting items are saved in a CSV file. That’s a pipeline!

Settings, settings everywhere

Someone new to the scrapy framework will probably find the settings file daunting. In the previous section, we introduced the setting to define the seniority of the components in a pipeline. If you’re curious what else you can do in that file, the docs list over 50 items.

I am not going to go through each of them in this article. To me, though, the number of settings isn’t user-friendly. Still, it hints at the number of things you can do with scrapy, including randomising the delay before downloading files from the same site, logging and settings for storage adapters to common backends like AWS or FTP.

As a popular and established framework, you will also find an ecosystem. This includes scrapyd, a service you can run on your server to schedule scrapes and run your spiders. Proxy services are also available commercially if your operations are very demanding.

There are lots to explore here!


Do I have any regrets about doing pdpc-decisions? Nope. I learned a lot about programming in Python doing it. It made me appreciate what special considerations are involved in web scraping and how scrapy was helping me to do that.

I also found that following a framework made the code more maintainable. When I revisited the original pdpc-decisions while writing this article, I realised the code didn’t make sense. I didn’t name my files or function sensibly or write tests which showed what the code was doing.

Once I became familiar with the scrapy framework, I knew how to look for what I wanted in the code. This extends to sharing — if everyone is familiar with the framework, it’s easier for everyone to get on the same page rather than learning everything I wrote from scratch.

Scrapy could afford power and convenience, which is specialised for web scraping. I am going to keep using it for now. Learning such a challenging framework is already starting to pay dividends.

Data Science with Judgement Data – My PDPC Decisions JourneyAn interesting experiment to apply what I learnt in Data Science to the area of law.Love.Law.Robots.HoufuRead more interesting adventures.

#Programming #Python #WebScraping #DataScience #Law #OpenSource #PDPC-Decisions #scrapy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Presentations are a crucial part of your professional life. In my line of work, presentations function as introducers, reports, training, proposals, summaries and explainers. Sometimes, presentations are watched in person or through web conferencing. Other times, they appear on websites for browsing or e-learning.

To do the job, many people will start PowerPoint, plunk in many words in bullet points and let their “showmanship” do the rest.

Your old ways of doing PowerPoint are no longer enough

Montreal Design ClubPhoto by charlesdeluvio / Unsplash

Since the pandemic and remote work changed the ways we work today, the demands on presentation have also changed drastically:

  • A “presenter” may no longer be present. You might not even be visible through a Zoom Window. Some presentations are watched in a video. Some people may even view your presentation on “mute” or by continuously scrolling to the “interesting” points.
  • Your audience consumes content differently these days. If I wanted to read a wall of bullet points like this one), I would rather read an article. Worse, if the content doesn’t engage, I lose the audience. You also want the audience to come back to you too.
  • Some studies claim that an audience’s attention span is about 20 minutes. Context matters. You might be forced to sit through the entire presentation like at a conference, but your attention still has to be maintained. Viewing it from home or somewhere else, you must get to the point quickly.

As a lawyer, I seem to have endless things to say. Figuring out how to say it? That’s the tricky part. Your personality and style contribute to your best presentations. Reading a block of bullet points on a template PowerPoint presentation wouldn’t bring that out. You would need opportunity, practice and feedback—all the time.

If you want to look at new ways of doing things, I wrote this post for you.

You can express more on your slide.

I found that the easiest way to improve your presentation is to abandon bullet points. The first step towards approaching any slide is to think about its purpose:

  • Am I making a list?
  • Am I describing a process?
  • Am I highlighting a point which I want the audience to know?
  • And so on.

Once you know what you want, it becomes evident that bullet points aren’t the only way to achieve them. A process can be described with arrows, steps or even a map. A list can be made of post-its, boxes and maybe even articles (for example, a list of news items can be illustrated with “newspaper clippings”). Such design elements will help an audience, no matter how far they are from you, to get the point.

Since presentations aren’t my core area of work, I don’t find myself being able to come up with a design quickly. A quick reference could be helpful.

In the end, I found The Better Deck Deck by Nolan Haims. Essentially they are a bunch of flashcards showing deck ideas. This is not an essay about how to make great presentations, just a collection of examples which condenses excellent ideas and how to execute them.

The Better Deck Deck: 52 Alternatives to Bullet Points | Nolan Haims CreativeThis deck of cards gives you 52 proven design alternatives to the dreaded bullet point layout and over 150 professionally designed examples of slides using these techniques to spark your creativity and improve your next presentation.Nolan Haims CreativeShop

Before I embark on any presentation project, I flip through these cards to get those creative juices flowing. If you do lots of presentations, I recommend it!

SmartArt isn’t stupid once you tear it apart.

If you believe bullet points are not the right direction, fret not; Microsoft PowerPoint agrees with you too.

In recent years, Microsoft PowerPoint has been improving on a function named “SmartArt”. It is an excellent tool for converting your bullet points into anything else. Today, they have galleries of ideas for you to choose from, including 3D.

I’m just kidding. Please don’t choose a 3D SmartArt! I beg you!

Unfortunately, it’s too much of a good thing. SmartArt, as well as the fancier “Design Ideas”, doesn’t teach you which design is best for you, leading to stock templates and bullet points in various shapes and sizes. These days, I can tell which slides were designed by PowerPoint, not you.

Furthermore, because PowerPoint tightly merges design and content, it isn’t easy to customize SmartArt in any profound way. This includes translating any ideas you have from the reference guide in the previous tip to PowerPoint.

The key to making SmartArt do what you want is to convert it to shapes. Once your SmartArt loses its form, you can arrange and align your shapes in any way you please.

This is my current workflow: sketch the contours of my design in SmartArt, then refine it by converting it to shapes and working on them.

This opens many possibilities and speeds up your presentation production by quickly creating basic shapes and designs.

Don’t be a Presenter, Be a Coder: SlideDev

Everybody loves PowerPoint because it is WYSIWYG — what you see is what you get. It’s easy to understand the tools, and there’s comfort in getting what you want right from the development stage. Most alternatives to PowerPoint like Google Slides and Apple Keynote also rely heavily on WYSIWYG.

WYSIWYG is excellent for novice users, but it’s easy to outgrow them.

The first problem with PowerPoint is that it’s easy to be overwhelmed by the number of options. We had seen one of them already when we tried to create an ugly and unreadable 3D SmartArt. You can also wreck your presentation with different fonts and sizes and unprofessional-looking work.

If you’ve always wanted to approach a different approach, try Slidev.

SlideDev employs many features familiar to Coders — Markdown syntax, installation using npm and hosting on a website. You also get to write your presentation on your IDE, version control using git and include anything a website can do, such as embeds and CSS animations.

I can’t use Slidev at work because nobody knows how to read a markdown file, so don’t throw away your PowerPoint yet.

However, having tried it, I appreciated being able to condense the essential parts of making a presentation. Suppose you aren’t a coder but are very familiar with presentations. In that case, I recommend trying Slidev to get a taste of what it is like to code and increase your familiarity with web technologies.


I hope this post helped you with some new places to explore how to make your next presentation. During the pandemic, I searched high and low for ways to improve my presentations, and these are some of the things I found. Did you find something worth sharing? Feel free to let me know!

Would I still use Excel for Contract Management?Many people would like to use Excel to manage their contract data. After two years of operating such a system, would I still recommend it?Love.Law.Robots.HoufuAnother one of my Microsoft Office posts.

#tech #MicrosoftOffice #MicrosoftPowerPoint #Ideas #Slidev #Presentation #Books #Programming #Training

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image


You must have worked hard to get here! We are almost at the end now.

Our journey took us from providing the user experience, figuring out what should happen in the background, and interacting with an external service.

In this part, we ask docassemble to provide a file for the user to download.

Provisioning a File

When we left part 2, this was our result screen.

    event: final_screen
    question: |
      Download your sound file here.
    subquestion: |
      The audio on your text has been generated.
      You can preview it here too.
      <audio controls>
       <source type="audio/mpeg">
       Your browser does not support playing audio.
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
      - Exit: exit
      - Restart: restart

There are two places where you need an audio file.

  1. In the “question”, a link to the file is provided in “here” for download.
  2. The audio preview widget (the thing which you click to play) also needs a link to the file to function.

Luckily for us, docassemble provides a straightforward way to deal with files on the server. Simply stated, create a variable of type DAFile to hold a reference to the file, save the data to the file and then use it for those links in the results screen.

Let’s get started. Add this block to your main.yml file.

      - generated: DAFile

This block creates an object called “generated”, which is a DAFile. Now your interview code can use “generated”.

Add the new line in the mandatory block we created in Part 2.

    mandatory: True
    code: |
    # The next line is new
      if tts_task.ready():

This code initialises “generated” by getting the docassemble server to provision it. If you use “generated” before initialising it, docassemble raises an error. 👻 (You will only get this error if you use the DAFile to create a file)

Now your background action needs access to “generated”. Pass it in a keyword parameter in the background action you created in Part 3.

    code: |
      tts_task = background_action(
    # This is the new keyword parameter

Now that your background action has the file, use it to save the audio content. Add the new lines below to bg_task that you also created in Part 3.

    event: bg_task
    code: |
      audio = get_text_to_speech(
    # The next three lines are new
      file_output = action_argument('file') 
      file_output.write(audio, binary=True) 

We assign the file to a new variable in the background task and then use it to write the audio (make sure it is in binary format as MP3s are not text). After that, commit the file to save it in the server or your external storage, depending on your configuration. (The above method are from DAFile. You can read more details about what they do and other methods here.)

Now that the file is ready, we can plunk it into our results screen. We are providing URLs here so that your user can download them from the browser. If you used paths, that would not work because it is the server's file system. Modify the lines in the results screen block.

    event: final_screen
    question: |
    # Modify the next line
      Download your sound file **[here](${generated.url_for(attachment=True)}).** 
    subquestion: |
      The audio on your text has been generated.
      You can preview it here too.
      <audio controls>
    # Modify the next line
       <source src="${generated.url_for()}" type="audio/mpeg"> 
       Your browser does not support playing audio.
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
      - Exit: exit
      - Restart: restart

To get the URL for a DAFIle, use the url_for method. This lets you have an address you can use for downloading or the web browser.


Congratulations! You are now ready to run the interview. Give it a go and see if you can download the audio of a text you would like spoken. (If you are still at the Playground, you can click “Save and Run” to ensure your work is safe and test it around a bit.)

This Text to Speech docassemble interview is relatively straightforward to me. Nevertheless, its simplicity also showcases several functions which you may want to be familiar with. Hopefully, you now have an idea of dealing with external services. If you manage to hook up something interesting, please share it with the community!

Bonus: Trapping errors and alerting the users

The code so far is enough to provide users with hours of fun (hopefully not at your expense). However, there are edge cases which you should consider if you plan to make your interview more widely available.

Firstly, while it's pretty clear in this tutorial that you should have updated your Configuration so that this interview can find your service account, this doesn't always happen for others. Admins might have overlooked it.

Add this code as the first mandatory code block of main.yml (before the one we wrote in Part 3):

    mandatory: True
    code: |
      if get_config('google') is None or 'tts service account' not in get_config('google'):   
        if get_config('error notification email') is not None:
          send_email(to=get_config('error notification email'), 
            subject='docassemble-Google TTS raised an error', 
            body='You need to set service account credentials in your google configuration.' )
          log('docassemble-Google TTS raised an error -- You need to set service account credentials in your google configuration.')
        message('Error: No service account for Google TTS', 'Please contact your administrator.')

Take note that if you add more than one mandatory block, they are called in the order of their appearance in the interview file. So if you put this after the mandatory code block defining our processes, the process gets called before checking whether we should run this code in the first place.

This code block does a few things. Firstly it checks whether there is a “google” directive or a “tts service account” directive in the “google directive”. If it doesn't find any tts service account information, it checks whether the admin has set an error notification email in the Configuration. If it does, the server will send an email to the admin email to report the issue. If it doesn't, it prints the error on docassemble.log, one of the logs in the server. (If the admin doesn't check his email or logs, I am unsure how we can help the admin.)

This mandatory check before starting the interview is helpful to catch the most obvious error – no configuration. However, you can pass this check by putting nonsense in the “tts service account”. Google is not going to process this. There may be other errors, such as Google being offline.

Narrowing down every possible error will be very challenging. Instead, we will make one crucial check: the code did save a file at the end of the process. Even if we aren't going to be able to tell the user what went wrong, at least we spared the user the confusion of finding out that there was no file to download.

First, let's write the code that makes the check. Add this new code block.

    event: file_check
    code: |
      path = generated.path()
      if not os.path.exists(path):
        if get_config('error notification email') is not None:
          send_email(to=get_config('error notification email'), 
            subject='docassemble-Google TTS raised an error', 
            body='No file was saved in this interview.' )
          log('docassemble-Google TTS raised an error -- No audio file was saved in this interview.')
        message('Error: No audio file was saved', 'We are not sure why. Please try again. If the problem persists, contact your administrator.')

This code checks whether the audio file (generated, a DAFile) is an actual file or an apparition. If it doesn't exist, the admin receives a message. The user is also alerted to the failure.

We would need to add a need directive to our results screen so that the check is made before the final screen to download the file is shown.

    event: final_screen 
    need:  # Add this line
      - file_check  # Add this line
    question: |
      Download your sound file **[here](${generated.url_for(attachment=True)}).**
    subquestion: |
      The audio on your text has been generated.
      You can preview it here too.
      <audio controls>
       <source src="${generated.url_for()}" type="audio/mpeg">
       Your browser does not support playing audio.
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
      - Exit: exit
      - Restart: restart

We would also need to import the python os standard library to make the check on our system. Add this new block near the top of our main.yml file.

      - os.path

There you have it! The interview checks before you start whether there's a service account. It also checks before showing you the final screen whether your request succeeded and if an audio file is ready to download.

👈🏻 Go to the previous part.

☝🏻Return to the overview of this tutorial.

#tutorial #Python #Programming #docassemble #Google #TTS #LegalTech

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image


So far, all our work is on our docassemble install, which has been quite a breeze. Now we come to the most critical part of this tutorial: working with an external service, Google Text to Speech. Different considerations come into play when working with others.

In this part, we will install a client library from Google. We will then configure the setup to interact with Google’s servers and write this code in a separate module, At the end of this tutorial, your background action will be able to call the function get_text_to_speech and get the audio file from Google.

1. A quick word about APIs

The term “API” can be used loosely nowadays. Some people use it to describe how you can use a program or a software library. In this tutorial, an API refers to a connection between computer programs. Instead of a website or a desktop program, we’re using Python and docassemble to interact with Google Text to Speech. In some cases, like Google Text to Speech, an API is the only way to work with the program.

The two most common ways to work with an API over the internet are (1) using a client library or (2) RESTful APIs. There are pros and cons to working with any one of these options. In this tutorial we are going to go with a client library that Google provided. This allows us to work with the API in a programming language we are familiar with, Python. RESTful APIs can have more support and features than a client language (especially if the programming language is not popular). Still, you’d need to know your way around the requests and similar packages if you want to use them in Python.

2. Install the Client Library in your docassemble

Before we can start using the client library, we need to ensure that it’s there in our docassemble install. Programming in Python can be very challenging because of issues like this:


Luckily, you will not face this problem if you’re using docker for your docassemble install (which most people do). Do this instead:

  1. Leave the Playground and go to another page called “Package Management”. (If you don’t see this page, you need to be either an admin or a developer)
  2. Under Install or update a package, specify google-cloud-texttospeech as the package to find on PyPI
  3. Click Update, and wait for the screen to show that the install is OK. (This takes time as there are quite a few dependencies to install)
  4. Verify that the google-cloud-texttospeech package has been installed by checking out the list of packages installed.

3. Set up a Text To Speech service account in docassemble

At this point, you should have obtained your Google Cloud Platform service account so that you can access the Text to Speech API. If you haven’t done so, please follow the instructions here. Take note that we will need your key information in JSON format. You don’t need to “Set your authentication environment variable” for this tutorial.

If you have not realised it yet, the key information in JSON format is a secret. While Google’s Text to Speech platform has a generous free tier, the service is not free. So, expect to pay Google if somebody with access to your account tries to read The Lord of the Rings trilogy. In line with best practices, secrets should be kept in a private and secure place, which is not your code repository. Please don’t include your service account details in your playground files!

Luckily, you can store this information in docassemble’s Configuration, which someone can’t access without an admin role and is not generally publicly available. Let’s do that by creating a directive google with a sub-directive of tts service account. Go to your configuration page and add these directives. Then fill out the information in JSON format you received from Google when you set up the service account.

In this example, the lines you will add to the Configuration should look like lines 118 to 131.

4. Putting it all together in the module

Now that our environment is set up, it’s time to create our get_speech_from_text function.

Head back to the Playground, Look for the dropdown titled “Folders”, click it, then select “Modules”.

Look for the editor and rename the file as This is where you will enter the code to interact with Google Text to Speech. If you recall in part 3, we had left out a function named get_text_to_speech. We were also supposed to feed it with the answers we collected from the interviews we wrote in part 2. Let’s enter the signature of the function now.

    def get_text_to_speech(text_to_synthesize, voice, speaking_rate, pitch):
      //Enter more code here

Since our task is to convert text to speech, we can follow the code in the example provided by Google.

A. Create the Google Text-to-Speech client

Using the Python client library, we can create a client to interact with Google’s service.

We need credentials to use the client to access the service. This is the secret you set up in step 3 above. It’s in docassemble’s configuration, under the directive google with a sub-directive of tts service account. Use docassemble’s get_config to look into your configuration and get the secret tts service account as a JSON.

With the secret to the service account, you can pass it to the class factory function and let it do the work.

    def get_text_to_speech(text_to_synthesize, voice, speaking_rate, pitch):
        from import texttospeech
        import json
        from docassemble.base.util import get_config
        credential_info = json.loads(get_config('google').get('tts service account'), strict=False)
        client = texttospeech.TextToSpeechClient.from_service_account_info(credential_info)

Now that the client is ready with your service account details, let's get some audio.

B. Specify some options and submit the request

The primary function to request Google to turn text into speech is synthesize_speech. The function needs a bunch of stuff — the text to convert, a set of voice options, and options for your audio file. Let’s create some with the answers to the questions in part 2. Add these lines of code to your function.

The text to synthesise:

    input_text = texttospeech.SynthesisInput(text=text_to_synthesize)

The voice options:

    voice = texttospeech.VoiceSelectionParams(

The audio options:

    audio_config = texttospeech.AudioConfig(

Note that we did not allow all the options to be customised by the user. You can go through the documentation yourself to figure out what options you need or don’t need to worry the user. If you think the user should have more options, you’re free to write your questions and modify the code.

Finally, submit the request and return the audio.

    response = client.synthesize_speech(
            request={"input": input_text, "voice": voice, "audio_config": audio_config}
    return response.audio_content

Voila! The client library could call Google using your credentials and get your personalised result.

5. Let’s go back to our interview

Now that you have written your function, it’s time to let our interview know where to find it.

Go back to the playground, and add this new block in your main.yml file.

      - .google_tts

This block tells the interview that some of our functions (specifically, the get_text_to_speech function) is found in the google_tts module.


At the end of this part, you have written your module and included it in your main.yml. You should also know how to install your python package to docassemble and edit your configuration file.

Well, that leaves us with only one more thing to do. We’ve got our audio content; now we just need to get it to the user. How do we do that? What’s that? DAFile? Find out in the next part.

👉🏻 Go to the final part.

👈🏻 Go back to the previous part.

☝🏻 Check out the overview of this tutorial.

#tutorial #docassemble #LegalTech #Google #TTS #Programming #Python

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image


In Part 2, we managed to write a few questions and a result screen. Even with all that eye candy, you will notice that you can’t run this interview. There is no mandatory block, so docassemble does not know what it needs to do to run the interview. In this part, I will talk about the code block required to run the interview, forming the foundation of our next steps.

1. The backbone of this interview

In ordinary docassemble interviews, your endpoint is a template or a form. For this interview, the endpoint is a sound file. So let’s start with this code block. It tells the reader how the interview will run. Since it is a pretty important block, we should put it near the top of the interview file, maybe right under the meta block. (In this tutorial, the order of blocks does not affect how the interview will run. Or at least until you get to the bonus section.)

    mandatory: True
    code: |

If you recall, the user downloads the audio file in the final screen.

So this mandatory code block asks docassemble to “do” the tts_task and then show the final screen. Now we need to define tts_task, and your interview will be ready to run.

So what should tts_task be? The most straightforward answer is that it is the result of the API call to create the sound file. You can write a code block that gets and returns a sound file and assigns it to tts_task.

Well, don’t write that code block yet.

2. Introducing the background action

If you call an API on the other side of the internet, you should know that many things can happen along the way. For example, it takes time for your request to reach Google, then for Google to process it using its fancy robots, send the result back to your server, and then for your server to send it back to the user. In my experience, Google’s and docassemble’s latency is quite good, but it is still noticeable with large requests.

A user is not supposed to notice that a program lags when the interview runs well. If the user realises that the interview is stuck on the same page for too long, the user might get worried that the interview is broken. The truth is that we are waiting for the file to come back. Get back in your chair and wait for it!

To improve user experience, you should have a waiting screen where you tell the user to hold his horses. While this happens, the interview should work in the background. In this manner, your user is assured everything is well while your interview focuses on getting the file back from Google.

docassemble already provides a mechanism for the docassemble program to carry out background actions. It’s aptly called background_action().

Check out a sample of a background action by reading the first example block (”Return a value”) under the Background processes category. Modify our mandatory code block by following the mandatory code block in the example. It should look like this.

    mandatory: True
    code: |
      if tts_task.ready():

So now we tell docassemble to do the tts_task, our background task. Once it is ready, show the final screen. If the task is not ready, show the waiting screen.

3. Define the background action

Now that we have defined the interview flow, it’s time to do the background action. In the spirit of docassemble, we do this by defining tts_task.

The next code block defines the task. Adapt this example into our interview file as follows.

    code: |
      tts_task = background_action(

So we have defined tts_task as a background action. The background action function has two kinds of arguments.

The first positional argument (“bg_task”) is the name of the code block that the background action should execute in the background.

The other keyword arguments are the information you need to pass to this background action like the text_to_synthesize, voice etc. These options you answered earlier during this interview will now be used for this background action. Defining your variables here in a mandatory block indirectly also ensures that docassemble will look for the answers for these variables before performing this code block.

So why do you need to define all the variables in this way? Don’t forget that the background action is a separate process from the rest of your interview so they don’t share the same variables. To enable these processes to share their information, you pass on the variables from the main interview process to the background action.

4. Perform the background action

We have defined the background action. Now let’s code what happens inside the background action.

The background action is defined in an event called bg_task. Now add a new code block as follows:

    event: bg_task
    code: |
      audio = get_text_to_speech(

So in this code block, we say that the audio is obtained by calling a function named get_text_to_speech. For get_text_to_speech to produce an audio file, it requires the answers to the questions you asked the user earlier. As a background process, it gets access to the variables you defined earlier through the keywords of the background_action function by calling action_argument.

Once get_text_to_speech is completed, we call background_response(). Calling background_response is important for a background action as it tells docassemble that this is the endpoint for the background action. Make sure you don’t leave your background action without it.

5. Provide a waiting screen

Before we leave the example block for background processes, let’s add the question block that tells the user to wait for their audio file. Find the block which defines waiting_screen, and adapt it for your interview as follows.

    event: waiting_screen
    question: |
      Hang tight.
      Google is doing its magic.
    subquestion: |
      This screen will reload every
      few seconds until the file
      is available.
    reload: True

By adding reload: True to the block, you tell docassemble to refresh the screen every 10 seconds. This helps the user to believe that they only need to be patient and some “magic” is going on somewhere.


In the next part of the tutorial, we will dive into get_text_to_speech. (What else, right?) We will need to call Google’s Text-to-Speech API to do this. If you found it easy to follow the code blocks in this part of the tutorial, we will kick this up a notch — the next file we will be working on ends with a “.py”.

👉🏻 Go ahead to the next part

👈🏻 Go to the previous part

👈🏻 Check out the overview of this tutorial.

#tutorial #docassemble #Programming #Python #Google #TTS

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu