Love.Law.Robots. by Ang Hou Fu

Programming

Feature image

Introduction

In Part 1, we talked about what we will do and the things you need to follow in this tutorial. Let’s get our hands wet now!

We are going to get the groundwork done by creating four pages. The first page gets the text to be turned into speech. The second page chooses the voice which Google will use to generate the audio. The third page edits some attributes in the production of the audio. The last page is the results page to download the spoken text.

If you are familiar with docassemble, nothing here is exciting, so you can skip this part. If you’re very new to docassemble, this is a gentle way to introduce you to getting started.

1. All projects begin like this

Log in to your docassemble program and go to the Playground. We will be doing most of the work here.

If you’re working from a clean install, your screen probably looks like this.

The default new project in docassemble's Playground.

  1. Let’s change the name of the interview file from test.yml to main.yml.
  2. Delete all the default blocks/text in the interview file. We are going to replace it with the blocks for this project.

You will have a clean main.yml file at the end.

2. I never Meta an Interview like you

I like to start my interview file with a meta block that tells someone about this interview and who made it.

It’s not easy to remember what a meta block looks like every time. You can use the example blocks in the playground to insert template blocks and modify them.

The example blocks also link to the relevant part of the documentation for easy reference. (It’s the blue “View Documentation” button.)

You should also use the example blocks as much as possible when you’re new to docassemble and writing YAML files. If you keep using those example blocks, you will not forget to separate your blocks with --- and you will minimise errors about indents and lists. After some practice (and lots of mistakes), you should be familiar with the syntax of a YAML file.

So, even though the example blocks section is found below the fold, you should not leave home without it.

You can write anything you like in the meta block as it’s a reference for other users. The field title, for example, is shown as the name of the interview on the “Available Interviews” page.

For this project, this is the meta block I used.

    metadata:
      title: |
        Google TTS Interview
      short title: |
        Have Google read your text
      description: |
        This interview produces a sound file based 
        on the text input by the user and other options.
      revision_date: 2022-05-01

3. Let’s write some questions

This is probably the most visual part of the tutorial, so enjoy it!

An easy way to think about question blocks is that they represent a page in your interview. As long as docassemble can find question blocks that answer all the variables it needs to finish the interview, you can organise and write your question block as you prefer.

So, for example, you can add this text box block which asks you to provide the input text. You can find the example text box block under the Fields category. (Putting no label allows the block to appear as if only one variable is set in this question)

    question: |
      Tell me what text you would like Google to voice.
    fields:
      - no label: text_to_synthesize
        input type: area
      - note: |
          The limit is 5000 characters. (Short paragraphs should be fine)

You can also combine several questions on one page like this question for setting the audio options. Using the range slider example block under the Fields category, you can build this block.

    question: |
      Modify the way Google speaks your text.
    subquestion: |
      You can skip this if you don't need any modifications.
    fields:
      - Speaking pitch: pitch
        datatype: range
        min: -20.0
        max: 20.0
        default: 0
      - note: |
          20 means increase 20 semitones from the original pitch. 
          -20 means decrease 20 semitones from the original pitch. 
      - Speaking rate/speed: speaking_rate
        datatype: range
        min: 0.25
        max: 4.0
        default: 1.0
        step: 0.1
      - note: |
          1.0 is the normal native speed supported by the specific voice. 
          2.0 is twice as fast, and 0.5 is half as fast.

Notice that I have set constraints and defaults in this block based on the documentation of the various options. This will help the user avoid pesky and demoralising error messages from the external API by entering unacceptable values.

A common question for a newcomer is how should present a question to a user? You can use a list of choices like the one below. (Build this question using the Radio buttons example block under the Multiple Choice category.)

    question: |
      Choose the voice that Google will use.
    field: voice
    default: en-US-Wavenet-A
    choices:
      - en-US-Wavenet-A
      - en-US-Wavenet-B
      - en-US-Wavenet-C
      - en-US-Wavenet-D
      - en-US-Wavenet-E
      - en-US-Wavenet-F
      - en-US-Wavenet-G
      - en-US-Wavenet-H
      - en-US-Wavenet-I
      - en-US-Wavenet-J
    under: |
      You can preview the voices [here](<https://cloud.google.com/text-to-speech/docs/voices>).

An interesting side question: When do I use a slider or a text entry box?

It depends on the kind of information you want. If you input numbers, the field's datatype should be a number. If you’re making a choice, a list of options works better.

Honestly, it takes some experience to figure out what works best. Think about all the online forms you have experienced and what you liked or did not like. To gain experience quickly, you can experiment by trying different fields in docassemble and asking yourself whether it gets the job done.

4. The Result Screen

Now that you have asked all your questions, it’s time to give your user the answer.

The result screen is shown when Google’s API has processed the user’s request and sent over the mp3 file containing the synthesised speech. In the result screen, you will be able to download the file. It’s also helpful to allow the user to preview the sound file so that the user can go back and modify any options.

    event: final_screen
    question: |
      Download your sound file here.
    subquestion: |
      The audio on your text has been generated.
      
      You can preview it here too.
      
      <audio controls>
       <source type="audio/mpeg">
       Your browser does not support playing audio.
      </audio>
      
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
    buttons:
      - Exit: exit
      - Restart: restart

Note: This image shows the completed file with links on how to download it. The reference question block above does not contain any links.

You would notice that I used an audio HTML tag in the subquestion to provide my media previewer. Take note that you can use HTML tags in your markdown text if docassemble does not have an option that meets your needs. However, your HTML hack might vary since this is based on the browser, so try to test as much as possible and avoid complex HTML.

Preview: Let’s do some actual coding

If you followed this tutorial carefully, your main.yml will have a meta block, 3 question blocks and one results screen.

There are a few problems now:

  • You cannot run the interview. The main reason is that there’s no “mandatory” block, so docassemble does not know what it needs to execute to finish the job.
  • The results screen does not contain a link to download or a media to preview.
  • We haven’t even asked Google to provide us with a sound file.

In the next part, we will go through the overall logic of the interview and do some actual coding. Once you are ready, head on over there!

👉🏻 Head to the next part.

👈🏻 Go back to the previous part.

☝🏻 Check out the overview of this tutorial.

#tutorial #docassemble #TTS #Google #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Most people associate docassemble with assembling documents using guided interviews. That’s in the name, right? The program asks a few questions and out pops a completed form, contract or document. However, the documentation makes it quite clear that docassemble can do more:

Though the name emphasizes the document assembly feature, docassemble interviews do not need to assemble a document; they might submit an application, direct the user to other resources on the internet, store user input, interact with APIs, or simply provide the user with information.

In this post, let’s demonstrate how to use docassemble to call an API, get a response and provide it to a user. You can check out the completed code on Github (NB: the git branch I would recommend for following this post is blog. I am actively using this package, so I may add new features to the main branch that I don’t discuss here.)

Problem Statement

I do a lot of internal training on various legal and compliance topics. I think I am a pretty all right speaker, but I have my limitations — I can’t give presentations 24/7, and my performance varies in a particular session. Wouldn’t it be nice if I could give a presentation at any time engagingly and consistently?

I could record my voice, but I did not like the result.

I decided to use a text-to-speech program instead, like the one provided by Google Cloud Platform. I created a computerised version of my speech in the presentation. My audience welcomed this version as it was more engaging than a plain PowerPoint presentation. Staff whose first language was not (Singapore) English also found the voice clear and understandable.

The original code was terminal based. I detailed my early exploits in this blog post last year. The script was great for developing something fast. However, as more of my colleagues became interested in incorporating such speech in their presentations, I needed something more user-friendly.

I already have a docassemble installation at work, so it appears convenient to work on that. The program would have to do the following:

  • Ask the user what text it wants to transform into speech
  • Allow the user to modify some properties of the speech (speed, pitch etc.)
  • Call Google TTS API, grab the sound file and provide it to the user to download

Assumptions

To follow this tutorial, you will need the following:

  • A working docassemble install. You can start up an instance on your laptop by following these instructions.
  • A Google Cloud Platform (GCP) account with a service account enabled for Google TTS. You can follow Google’s instructions here to set one up.
  • Use the Playground provided in docassemble. If you'd like to use an IDE, you can, but I wouldn’t be providing instructions like creating files to follow a docassemble package's directory structure.
  • Some basic knowledge about docassemble. I wouldn’t be going through in detail how to write a block. If you can follow the Hello World example, you should have sufficient knowledge to follow this tutorial.

A Roadmap of this Tutorial

In the next part of this post, I talk about the thinking behind creating this interview and how I got the necessary information (off the web) to make it.

In Part 2, we get the groundwork done by creating four pages. This provides us with a visual idea of what happens in this interview.

In Part 3, I talk about docassemble's background action and why we should use it for this interview. Merging the visual requirements with code gives us a clearer picture of what we need to write.

In Part 4, we work with an external API by using a client library for Python. We install this client library in our docassemble's python environment and write a python module.

In Part 5, we finish the interview by coding the end product: an audio file in the guise of a DAFile. You can run the interview and get your text transformed into speech now! I also give some ideas of what else you might want to do in the project.

Part 1: Familiarise yourself with the requirements

To write a docassemble interview, it makes sense to develop it backwards. In a simple case, you would like docassemble to fill in a form. So you would get a form, figure out its requirements, and then write questions for each requirement.

An API call is not a contract or a form, but your process is the same.

Based on Google’s quickstart, this is the method in the Python library which synthesises speech.

    # Set the text input to be synthesized
        synthesis_input = texttospeech.SynthesisInput(text="Hello, World!")
    
    # Build the voice request, select the language code ("en-US") and the ssml
    # voice gender ("neutral")
        voice = texttospeech.VoiceSelectionParams(
            language_code="en-US", 
            ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
        )
    
    # Select the type of audio file you want returned
        audio_config = texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3
        )
    
    # Perform the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
        response = client.synthesize_speech(
            input=synthesis_input, voice=voice, audio_config=audio_config
        )

From this example code, you need to provide the program with the input text (synthesis input), the voice, and audio configuration options to synthesise speech.

That looks pretty straightforward, so you might be tempted to dive into it immediately.

However, I would recommend going through the documents provided online.

  • docassemble provides some of the most helpful documentation, great for varying proficiency levels.
  • Google’s Text To Speech’s documentation is more typical of a product offered by a big tech company. Demos, use cases and guides help you get started quickly. You’re going to have to dig deep to find the one for Python. It receives less love than the other programming languages.

Reading the documentation, especially if you want to use a third-party service, is vital to know what’s available and how to exploit it fully. For example, going through the docs is the best way to find out what docassemble is capable of and learn about existing features — such as transforming a python list of strings into a human-readable list complete with an “and”.

You don’t have to follow the quickstart if it does not meet your use case. Going through the documentation, I figured out that I wanted to give the user a choice of which voice to use rather than letting Google select that for me. Furthermore, audio options like how fast a speaker is will be handy since non-native listeners may appreciate slower speaking. Also, I don’t think I need the user to select a specific file format as mp3s should be fine.

Let’s move on!

This was a pretty short one. I hope I got you curious and excited about what comes next. Continue to the next part, where we get started on a project!

👉🏻 Head to the next part of this tutorial!

#tutorial #docassemble #Python #Programming #TTS #Google

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

While learning how to write computer programs after stepping away from practice, I discovered something odd about the legal profession. There are books for the legal profession on how to write a good submission, structure deals, and run a law firm. But there aren't any books on how to love being a lawyer.

Oddly, such books are more common for programmers. Maybe it's because programmers are more likely to share their knowledge with others. Or maybe there are far more programmers than lawyers.

This post is essentially a book review about The Passionate Programmer (2nd Edn). This is the third book about programmers I've read so far. The first two were The Developer's Codeand The Agile Samurai.

The Passionate Programmer (2nd edition)We improve the lives of professional developers. We create timely, practical books on classic and cutting-edge topics to help you learn and practice your craft, and accelerate your career. Come learn with us.Chad Fowler

I enjoyed reading such books. I wanted to learn more about how programmers think and what software development practices should look like.

Transfer learning for your career

But wait a second? Am I seriously considering changing my career from a lawyer to a programmer? No more crappy contracts to read. Let's switch to reading crappy code?

One of the subplots in The Passionate Programmer was how the author, Chad Fowler, managed to turn his experiences as a jazz musician (see the saxophone on the cover?) into insights to advance his career as a programmer. Such insights include hanging out with people who challenge you to do better, the value of mentorship, and what you should be looking for when practising coding or music.

In a world where books on how to develop your legal career are non-existent, transferring your learning from one sector to another is your best bet. With a bit of imagination, I started to see parallels between the advice being doled out to programmers and things I should have done as a young lawyer. In those dark days when I had no idea what to do after leaving legal practice, some light in a different colour was better than none.

Most, if not all, lawyers are intelligent. I am sure many young lawyers would have figured out some, if not most, of the things in the book. Ultimately though, one lifetime is too short to figure out everything on your own. You have to enjoy your career as well!

Here are three lessons I learned from the book which surprised me:

Tip 49: Fat man in the mirror

Like Chad Fowler, I found myself getting fat lately. The experience, as lucidly explained in the book, is as follows:

I can’t tell, because I see me too often. If you’re constantly exposed to something, it’s hard to see it changing unless change happens rapidly. If you sit and watch a flower bloom, it will take a long time to notice that anything has happened. However, if you leave and come back in two days, you’ll see something very noticeably different from when you left.

Our careers are very much like this. Did I find myself leaving the profession after one lousy hearing? Or one bad experience with my bosses? Maybe this is how I would like to remember it. But it doesn't explain why I left this time when I was resilient in others.

It turns out that frustration and unhappiness build up over time. One fine day you decide to take a hard look at yourself and realise this is not how you want your life to be. It happens quickly, and you can't stop or control it. Not a good way to go.

One of my strangest experiences going into the corporate world was performance reviews. I had no such thing in my law firms. The only indicator I had that I was doing a good job in a law firm was my pay raises and title changes. Without regular performance reviews, you don't check whether you are doing any good work. You don't even check on whether you are doing yourself any good.

It's essential to check on yourself regularly, and Chad Fowler suggests that you should do something more concrete, like putting your objectives in writing. You might suddenly find a fat man in the mirror if you don't.

Tip 50: The South Indian Monkey Trap

Photo by Jamie Haughton / Unsplash

Here's a funny story from the book.

A town in South India had a monkey problem 🐒. To cull the population, they designed a trap. Monkeys love snatching food from the townsfolk. So, they dug a hole in the ground and put some rice there, but made the entrance narrow. A monkey would stick its hand into the hole to grab the rice, making a fist. However, due to the narrowness of the opening, the fist could not come out of the hole, and the monkey was stuck. The trap works because the monkey refuses to open its fist of rice to escape—game over for the monkey.

The story tells us that there are some ideas we refuse to let go of, even though they would help us to escape our traps. Here's how the concept is described in the book:

Value rigidity is what happens when you believe in the value of something so strongly that you can no longer objectively question it. The monkeys valued the rice so highly that when forced to make the choice between the rice and captivity or death, they couldn’t see that losing the rice was the right thing to do at the time. The story makes the monkeys seem really stupid, but most of us have our own equivalents to the rice.

I started wondering what my equivalent to the rice was.

One of them is that even when I became in house counsel, I still cling to the idea that I am and will become a lawyer. My mid-term goal is to head a legal department and, at some time, maybe be a general counsel.

However, is this what my company needs? Am I only good at doling out legal advice, or did I develop leadership and other skills which would be beneficial in other areas? Or is this really about what my comfort zones are?

There's nothing wrong with having fixed goals, of course. However, you're probably doing yourself a disservice if they become a quagmire or restrict your imagination of what you could become.

Tip 51: Don't Plan your career like a Waterfall

Photo by Timothy Meinberg / Unsplash

Here's a sign of how much agile software development practices have held nowadays: You have no idea what “waterfall” is.

The book describes “waterfall” software development as a “top-down, heavily planned, rigorous process”. It's usually used in a pejorative sense.

Most young Singaporeans have probably developed their career ideas using a “waterfall”. Requirements are designed right from the start and handed to you to implement. To be a lawyer, you must get a law degree from NUS. To be a good lawyer, you need to join a big law firm.

I have to admit that the plan sounds natural, almost a responsible way to achieve your goals. However, the best laid plans are often laid to waste.

According to the Passionate Programmer , agile software practices work because changing code is, in its nature, cheap. Agile responds to that by developing processes that react well to constant change. The focus is not on thorough documentation or grand designs but on what users want.

Some people might baulk at the idea that changing careers is as cheap as changing code. Of course, going from lawyer to actor is a huge change, but other changes are less extreme, such as legal support or in house counsel. Instead, it's essential to develop processes for your career that react well to constant change.

Set big goals and make constant corrections along the way.

Such corrections were necessitated by life changes too. One of the prime reasons I had to leave the profession was to care for my young family. They needed me around more at this point. However, I also recognised the need to keep an eye on the profession. Maybe there's a day I would go back, but it's got to fit my ultimate goals.

There's another thing about planning your career like a waterfall. Very often, these grand plans are imposed by others. I learnt that the hard way. Of course, your bosses want a good associate. Of course, the legal profession wants more lawyers in practice. But those were other people's goals, not mine, and I was following their plans.

I didn't know what I wanted at the time (or even at this time), so going with the flow is a workable option. However, this is your career — being successful is great for others, but is it at your expense? Having a career plan that reacts quickly to change will be fantastic if you have no idea.

Conclusion

The 3 tips I shared were the ones I found more meaningful, but there are many others ideas big and small that are easy to understand and get to the action. As a book for programmers, it certainly is not one where a lawyer can apply straight away. However, it is written in an encouraging and insightful way. So, as I wondered about my own situation, I felt that I do have options in my current station. It’s a wonderful feeling.

#BookReview #Lawyers #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Introduction

In January 2022, the 2020 Revised Edition of over 500 Acts of Parliament (the primary legislation in Singapore) was released. It’s a herculean effort to update so many laws in one go. A significant part of that effort is to “ensure Singapore’s laws are understandable and accessible to the public” and came out of an initiative named Plain Laws Understandable by Singaporeans (or Plus).

Keeping Singapore laws accessible to all – AGC, together with the Law Revision Committee, has completed a universal revision of Singapore’s Acts of Parliament! pic.twitter.com/76TnrNCMUq

— Attorney-General's Chambers Singapore (@agcsingapore) December 21, 2021

After reviewing the list of changes they made, such as replacing “notwithstanding” with “despite”, I frankly felt underwhelmed by the changes. An earlier draft of this article was titled “PLUS is LAME”. The revolution is not forthcoming.

I was bemused by my strong reaction to a harmless effort with noble intentions. It led me to wonder how to evaluate a claim, such as whether and how much changing a bunch of words would lead to a more readable statute. Did PLUS achieve its goals of creating plain laws that Singaporeans understand?

In this article, you will be introduced to well-known readability statistics such as Flesch Reading Ease and apply them to laws in Singapore. If you like to code, you will also be treated to some Streamlit, Altair-viz and Python Kung Fu, and all the code involved can be found in my Github Repository.

GitHub – houfu/plus-explorer: A streamlit app to explore changes made by PLUSA streamlit app to explore changes made by PLUS. Contribute to houfu/plus-explorer development by creating an account on GitHub.GitHubhoufuThe code used in this project is accessible in this public repository.

How would we evaluate the readability of legislation?

Photo by Jamie Street / Unsplash

When we say a piece of legislation is “readable”, we are saying that a certain class of people will be able to understand it when they read it. It also means that a person encountering the text will be able to read it with little pain. Thus, “Plain Laws Understandable by Singaporeans” suggests that most Singaporeans, not just lawyers, should be able to understand our laws.

In this light, I am not aware of any tool in Singapore or elsewhere which evaluates or computes how “understandable” or readable laws are. Most people, especially in the common law world, seem to believe in their gut that laws are hard and out of reach for most people except for lawyers.

In the meantime, we would have to rely on readability formulas such as Flesch Reading Ease to evaluate the text. These formulas rely on semantic and syntactic features to calculate a score or index, which shows how readable a text is. Like Gunning FOG and Chall Dale, some of these formulas map their scores to US Grade levels. Very approximately, these translate to years of formal education. A US Grade 10 student would, for example, be equivalent to a Secondary four student in Singapore.

After months of mulling about, I decided to write a pair of blog posts about readability: one that's more facts oriented: (https://t.co/xbgoDFKXXt) and one that's more personal observations (https://t.co/U4ENJO5pMs)

— brycew (@wowitisbryce) February 21, 2022

I found these articles to be a good summary and valuable evaluation of how readability scores work.

These formulas were created a long time ago and for different fields. For example, Flesch Reading Ease was developed under contract to the US Navy in 1975 for educational purposes. In particular, using a readability statistic like FRE, you can tell whether a book is suitable for your kid.

I first considered using these formulas when writing interview questions for docassemble. Sometimes, some feedback can help me avoid writing rubbish when working for too long in the playground. An interview question is entirely different from a piece of legislation, but hopefully, the scores will still act as a good proxy for readability.

Selecting the Sample

Browsing vinyl music at a fairPhoto by Artificial Photography / Unsplash

To evaluate the claim, two pieces of information regarding any particular section of legislation are needed – the section before the 2020 Edition and the section in the 2020 Edition. This would allow me to compare them and compute differences in scores when various formulas are applied.

I reckon it’s possible to scrape the entire website of statues online, create a list of sections, select a random sample and then delve into their legislative history to pick out the sections I need to compare. However, since there is no API to access statutes in Singapore, it would be a humongous and risky task to parse HTML programmatically and hope it is created consistently throughout the website.

Mining PDFs to obtain better text from DecisionsAfter several attempts at wrangling with PDFs, I managed to extract more text information from complicated documents using PDFMiner.Love.Law.Robots.HoufuIn one of my favourite programming posts, I extracted information from PDFs, even though the PDPC used at least three different formats to publish their decisions. Isn’t Microsoft Word fantastic?

I decided on an alternative method which I shall now present with more majesty:

The author visited the subject website and surveyed various acts of Parliament. When a particular act is chosen by the author through his natural curiosity, he evaluates the list of sections presented for novelty, variety and fortuity. Upon recognising his desired section, the author collects the 2020 Edition of the section and compares it with the last version immediately preceding the 2020 Edition. All this is performed using a series of mouse clicks, track wheel scrolling, control-Cs and control-Vs, as well as visual evaluation and checking on a computer screen by the author. When the author grew tired, he called it a day.

I collected over 150 sections as a sample and calculated and compared the readability scores and some linguistic features for them. I organised them using a pandas data frame and saved them to a CSV file so you can download them yourself if you want to play with them too.

Datacsv Gzipped file containing the source data of 152 sections, their content in the 2020 Rev Edn etc data.csv.gz 76 KB download-circle

Exploring the Data with Streamlit

You can explore the data associated with each section yourself using my PLUS Explorer! If you don’t know which section to start with, you can always click the Random button a few times to survey the different changes made and how they affect the readability scores.

Screenshot of PLUS Section Explorer: https://share.streamlit.io/houfu/plus-explorer/main/explorer.py

You can use my graph explorer to get a macro view of the data. For the readability scores, you will find two graphs:

  1. A graph that shows the distribution of the value changes amongst the sample
  2. A graph that shows an ordered list of the readability scores (from most readable to least readable) and the change in score (if any) that the section underwent in the 2020 Edition.

You can even click on a data point to go directly to its page on the section explorer.

Screenshot of PLUS graph explorer: https://share.streamlit.io/houfu/plus-explorer/main/graphs.py

This project allowed me to revisit Streamlit, and I am proud to report that it’s still easy and fun to use. I still like it more than Jupyter Notebooks. I tried using ipywidgets to create the form to input data for this project, but I found it downright ugly and not user-friendly. If my organisation forced me to use Jupyter, I might reconsider it, but I wouldn’t be using it for myself.

Streamlit — works out of the box and is pretty too. Here are some features that were new to me since I last used Streamlit probably a year ago:

Pretty Metric Display

Metric display from Streamlit

My dear friends, this is why Streamlit is awesome. You might not be able to create a complicated web app or a game using Streamlit. However, Steamlit’s creators know what is essential or useful for a data scientist and provide it with a simple function.

The code to make the wall of stats (including their changes) is pretty straightforward:

st.subheader('Readability Statistics') # Create three columns flesch, fog, ari = st.columns(3)

# Create each column flesch.metric(“Flesch Reading Ease”, dataset[“currentfleschreadingease”][sectionexplorerselect], dataset[“currentfleschreadingease”][sectionexplorer_select] - dataset[“previousfleschreadingease”][sectionexplorerselect])

# For Fog and ARI, the lower the better, so delta colour is inverse

fog.metric(“Fog Scale”, dataset[“currentgunningfog”][sectionexplorerselect], dataset[“currentgunningfog”][sectionexplorerselect] - dataset[“previousgunningfog”][sectionexplorerselect], delta_color=“inverse”)

ari.metric(“Automated Readability Index”, dataset[“currentari”][sectionexplorerselect], dataset[“currentari”][sectionexplorer_select] - dataset[“previousari”][sectionexplorerselect], delta_color=“inverse”)

Don’t lawyers deserve their own tools?

Now Accepting Arguments

Streamlit apps are very interactive (I came close to creating a board game using Streamlit). Streamlit used to suffer from a significant limitation — except for the consumption of external data, you can’t interact with it from outside the app.

It’s at an experimental state now, but you can access arguments in its address just like an HTML encoded form. Streamlit has also made this simple, so you don’t have to bother too much about encoding your HTML correctly.

I used it to communicate between the graphs and the section explorer. Each section has its address, and the section explorer gets the name of the act from the arguments to direct the visitor to the right section.

# Get and parse HTTP request queryparams = st.experimentalgetqueryparams()

# If the keyword is in the address, use it! if “section” in queryparams: sectionexplorerselect = queryparams.get(“section”)[0] else: sectionexplorerselect = 'Civil Law Act 1909 Section 6'

You can also set the address within the Streamlit app to reduce the complexity of your app.

# Once this callback is triggered, update the address def onselect(): st.experimentalsetqueryparams(section=st.session_state.selectbox)

# Select box to choose section as an alternative. # Note that the key keyword is used to specify # the information or supplies stored in that base. st.selectbox(“Select a Section to explore”, dataset.index, onchange=onselect, key='selectbox')

So all you need is a properly formed address for the page, and you can link it using a URL on any webpage. Sweet!

Key Takeaways

Changes? Not so much.

From the list of changes, most of the revisions amount to swapping words for others. For word count, most sections experienced a slight increase or decrease of up to 5 words, and a significant number of sections had no change at all. The word count heatmap lays this out visually.

Unsurprisingly, this produced little to no effect on the readability of the section as computed by the formulas. For Flesch Reading Ease, a vast majority fell within a band of ten points of change, which is roughly a grade or a year of formal education. This is shown in the graph showing the distribution of changes. Many sections are centred around no change in the score, and most are bound within the band as delimited by the red horizontal rulers.

This was similar across all the readability formulas used in this survey (Automated Readability Index, Gunning FOG and Dale Chall).

On the face of it, the 2020 Revision Edition of the laws had little to no effect on the readability of the legislation, as calculated by the readability formulas.

Laws remain out of reach to most people

I was also interested in the raw readability score of each section. This would show how readable a section is.

Since the readability formulas we are considering use years of formal schooling as a gauge, we can use the same measure to locate our target audience. If we use secondary school education as the minimum level of education (In 2020, this would cover over 75% of the resident population) or US Grade 10 for simplicity, we can see which sections fall in or out of this threshold.

Most if not all of the sections in my survey are out of reach for a US Grade 10 student or a person who attained secondary school education. This, I guess, proves the gut feeling of most lawyers that our laws are not readable to the general public in Singapore, and PLUS doesn’t change this.

Take readability scores with a pinch of salt

Suppose you are going to use the Automated Readability Index. In that case, you will need nearly 120 years of formal education to understand an interpretation section of the Point-to-Point Passenger Transport Industry Act.

Section 3 of the Point-to-Point Passenger Transport Industry Act makes for ridiculous reading.

We are probably stretching the limits of a tool made for processing prose in the late 60s. It turns out that many formulas try to average the number of words per sentence — it is based on the not so absurd notion that long sentences are hard to read. Unfortunately, many sections are made up of several words in 1 interminable sentence. This skews the scores significantly and makes the mapping to particular audiences unreliable.

The fact that some scores don’t make sense when applied in the context of legislation doesn’t invalidate its point that legislation is hard to read. Whatever historical reasons legislation have for being expressed the way they are, it harms people who have to use them.

In my opinion, the scores are useful to tell whether a person with a secondary school education can understand a piece. This was after all, what the score was made for. However, I am more doubtful whether we can derive any meaning from a score of, for example, ARI 120 compared to a score of ARI 40.

Improving readability scores can be easy. Should it?

Singaporean students know that there is no point in studying hard; you have to study smart.

Having realised that the number of words per sentence features heavily in readability formulas, the easiest thing to do to improve a score is to break long sentences up into several sentences.

True enough, breaking up one long sentence into two seems to affect the score profoundly: see Section 32 of the Defence Science and Technology Agency Act 2000. The detailed mark changes section shows that when the final part of subsection three is broken off into subsection 4, the scores improved by nearly 1 grade.

It’s curious why more sections were not broken up this way in the 2020 Revised Edition.

However, breaking long sentences into several short ones doesn’t always improve reading. It’s important to note that such scores focus on linguistic features, not content or meaning. So in trying to game the score, you might be losing sight of what you are writing for in the first place.

Here’s another reason why readability scores should not be the ultimate goal. One of PLUS’s revisions is to remove gendered nouns — chairperson instead of chairman, his or her instead of his only. Trying to replace “his” with “his or her” harms readability by generally increasing the length of the sentence. See, for example, section 32 of the Weights and Measures Act 1975.

You can agree or disagree whether legislation should reflect our values such as a society that doesn't discriminate between genders. (It's interesting to note that in 2013, frequent legislation users were not enthusiastic about this change.) I wouldn't suggest though that readability scores should be prioritised over such goals.

Here’s another point which shouldn’t be missed. Readability scores focus on linguistic features. They don’t consider things like the layout or even graphs or pictures.

A striking example of this is the interpretation section found in legislation. They aren’t perfect, but most legislation users are okay with them. You would use the various indents to find the term you need.

Example of an interpretation section and the use of indents to assist reading.

However, they are ignored because white space, including indents, are not visible to the formula. It appears to the computer like one long sentence, and readability is computed accordingly, read: terrible. This was the provision that required 120 years of formal education to read.

I am not satisfied that readability should be ignored in this context, though. Interpretation sections, despite the creative layout, remain very difficult to read. That’s because it is still text-heavy, and even when read alone, the definition is still a very long sentence.

A design that relies more on graphics and diagrams would probably use fewer words than this. Even though the scores might be meaningless in this context, they would still show up as an improvement.

Conclusion

PLUS might have a noble aim of making laws understandable to Singaporeans, but the survey of the clauses here shows that its effect is minimal. It would be great if drafters refer to readability scores in the future to get a good sense of whether the changes they are making will impact the text. Even if such scores have limitations, they still present a sound and objective proxy of the readability of the text.

I felt that the changes were too conservative this time. An opportunity to look back and revise old legislation will not return for a while (the last time such a project was undertaken was in 1985 ). Given the scarcity of opportunity, I am not convinced that we should (a) try to preserve historical nuances which very few people can appreciate, or (b) avoid superficial changes in meaning given the advances in statutory interpretation in the last few decades in Singapore.

Beyond using readability scores that focus heavily on text, it would be helpful to consider more legal design — I sincerely believe pictures and diagrams will help Singaporeans understand laws more than endlessly tweaking words and sentence structures.

This study also reveals that it might be helpful to have a readability score for legal documents. You will have to create a study group comprising people with varying education levels, test them on various texts or legislation, then create a machine model that predicts what level of difficulty a piece of legislation might be. A tool like that could probably use machine models that observe several linguistic features: see this, for example.

Finally, while this represents a lost opportunity for making laws more understandable to Singaporeans, the 2020 Revised Edition includes changes that improve the quality of life for frequent legislation users. This includes changing all the acts of parliaments to have a year rather than the historic and quaint chapter numbers and removing information that is no longer relevant today, such as provisions relating to the commencement of the legislation. As a frequent legislation user, I did look forward to these changes.

It’s just that I wouldn’t be showing them off to my mother any time soon.

#Features #DataScience #Law #Benchmarking #Government #LegalTech #NaturalLanguageProcessing #Python #Programming #Streamlit #JupyterNotebook #Visualisation #Legislation #AGC #Readability #AccesstoJustice #Singapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Way back in December 2021, I caught wind of the 2020 Revised Edition of the statutes in Singapore law:

The AGC highlighted that the revised legislation now uses “simpler language”. I was curious about this claim and looked over their list of changes. I was not very impressed with them.

However, I did not want to rely only on my subjective intuition to make that conclusion. I wanted to test it using data science. This meant I had to compare text, calculate the changes' readability statistics, and see what changed.

Read more...

Feature image

As I continued with my project of dealing with 5 million Monopoly Junior games, I had a problem. Finding a way to play hundreds of games per second was one thing. How was I going to store all my findings?

Limitations of using a CSV File

Initially, I used a CSV (Comma Separated Values) file to store the results of every game. Using a CSV file is straightforward, as Python and pandas can load them quickly. I could even load them using Excel and edit them using a simple text editor.

However, every time my Streamlit app tried to load a 5GB CSV file as a pandas dataframe, the tiny computer started to gasp and pant. If you try to load a huge file, your application might suffer performance issues or failures. Using a CSV to store a single table seems fine. However, once I attempted anything more complicated, like the progress of several games, its limitations became painfully apparent.

Let’s use SQL instead!

The main alternative to using a CSV is to store data in a database. Of all the databases out there, SQL is the biggest dog of them all. Several applications, from web applications to games, use some flavour of SQL. Python, a “batteries included” programming language, even has a module for processing SQLite — basically a light SQL database you can store as a file.

SQL doesn’t require all its data to be loaded to start using the data. You can use indexes to search data. This means your searches are faster and less taxing on the computer. Doable for a little computer!

Most importantly, you use data in a SQL database by querying it. This means I can store all sorts of data in the database without worrying that it would bog me down. During data extraction, I can aim for the maximum amount of data I can find. Once I need a table from the database, I would ask the database to give me a table containing only the information I wanted. This makes preparing data quick. It also makes it possible to explore the data I have.

Why I procrastinated on learning SQL

To use a SQL database, you have to write operations in the SQL language, which looks like a sentence of gobbledygook to the untrained eye.

https://imgs.xkcd.com/comics/exploits_of_a_mom.pngI'll stick to symbols in my children's names please, thanks.

SQL is also in the top 10 on the TIOBE index of programming languages. Higher than Typescript, at the very least.

I have heard several things about SQL — it’s similar to Excel formulas. However, I dreaded learning a new language to dig a database. The landscape of SQL was also very daunting. There are several “flavours” of SQL, and I was not sure of the difference between PostgreSQL, MySQL or MariaDB.

There are ORM (Object-relational mapping) tools for SQL for people who want to stick to their programming languages. ORMs allow you to use SQL with your favourite programming language. For Python, the main ORM is SQLAlchemy. Seeing SQL operations in my everyday programming language was comforting at first, but I found the documentation difficult to follow.

Furthermore, I found other alternative databases easier to pick up and deploy. These included MongoDB, a NoSQL database. They rely on a different concept — data is stored in “documents” instead of tables, and came with a full-featured ORM. For many web applications, the “document” idiom applied well. It wouldn’t make sense, though, if I wanted a table.

Enter SQLModel!

A few months ago, an announcement from the author of FastAPI excited me — he was working on SQLModel. SQLModel would be a game-changer if it were anything like FastAPI. FastAPI featured excellent documentation. It was also so developer-friendly that I didn’t worry about the API aspects in my application. If SQLModel could reproduce such an experience for SQL as FastAPI did for APIs with Python, that would be great!

SQLModelSQLModel, SQL databases in Python, designed for simplicity, compatibility, and robustness.logo

As it turned out, SQLModel was a very gentle introduction to SQL. The following code creates a connection to an SQLite database you would create in your file system.

from sqlmodel import SQLModel, create_engine, Session from sqlalchemy.engine import Engine

engine: Optional[Engine] = None

def createDBengine(filename: str): global engine engine = createengine(f'sqlite:///{filename}') SQLModel.metadata.createall(engine) return engine

Before creating a connection to a database, you may want to make some “models”. These models get translated to a table in your SQL database. That way, you will be working with familiar Python objects in the rest of your code while the SQLModel library takes care of the SQL parts.

The following code defines a model for each game played.

from sqlmodel import SQLModel, Field

class Game(SQLModel, table=True): id: Optional[int] = Field(default=None, primarykey=True) name: str = Field() parent: Optional[str] = Field() numof_players: int = Field() rounds: Optional[int] = Field() turns: Optional[int] = Field() winner: Optional[int] = Field()

So, every time the computer finished a Monopoly Junior game, it would store the statistics as a Game. (It’s the line where the result is assigned.)

def playbasicgame(numofplayers: Literal[2, 3, 4], turns: bool) –> Tuple[Game, List[Turn]]: if numofplayers == 3: game = ThreePlayerGame() elif numofplayers == 4: game = FourPlayerGame() else: game = TwoPlayerGame() gameid = uuid.uuid4() logging.info(f'Game {gameid}: Started a new game.') endturn, gameturns = playrounds(game, gameid, turns) winner = decidewinner(endturn) result = Game(name=str(gameid), numofplayers=numofplayers, rounds=endturn.getvalue(GameData.ROUNDS), turns=endturn.getvalue(GameData.TURNS), winner=winner.value) logging.debug(f'Game {gameid}: {winner} is the winner.') return result, game_turns

After playing a bunch of games, these games get written to the database in an SQL session.

def write_games(games: List[Game]): with Session(engine) as session: for game in games: session.add(game) session.commit()

Reading your data from the SQLite file is relatively straightforward as well. If you were writing an API with FastAPI, you could pass the model as a response model, and you can get all the great features of FastAPI directly.

I had already stored some Game data in “games.db” in the following code. I created a backend server that would read these files and return all the Turns belonging to a Game. This required me to select all the turns in the game that matched a unique id. (As you might note in the previous section, this is a UUID.)

gamesengine = createengine('sqlite:///games.db')

@app.get(“/game/{gamename}“, responsemodel=List[Turn]) def getturns(gamename: str): “”” Get a list of turns from :param gamename. “”” with Session(gamesengine) as session: checkgameexists(session, gamename) return session.exec(select(Turn).where(Turn.game == gamename)).all()

Limitations of SQLModel

Of course, being marked as version “0.0.6”, this is still early days in the development of SQLModel. The documentation is also already helpful but still a work in progress. A key feature that I would be looking out for is migrating data, most notably for different software versions. This problem can get very complex quickly, so anything would be helpful.

I also found creating the initial tables very confusing. You create models by implementing descendants of the SQLModel class in the library, import these models into the main SQLModel class, and create the tables by calling SQLModel.metadata.create_all(engine). This doesn’t appear pretty pythonic to me.

Would I continue to use SQLModel?

There are use cases where SQLModel will now be applicable. My Monopoly Junior data project is one beneficiary. The library provided a quick, working solution to use a SQL database to store results and data without needing to get too deep and dirty into SQL. However, if your project is more complicated, such as a web application, you might seriously consider its current limitations.

Even if I might not have the opportunity to use SQLModel in the future, there were benefits from using it now:

  • I became more familiar with the SQL language after seeing it in action and also comparing an SQL statement with its SQLModel equivalent in the documentation. Now I can write simple SQL statements!
  • SQLModel is described as a combination of SQLAlchemy and Pydantic. Once I realised that many of the features of SQLModel are extensions of SQLAlchemy’s, I was able to read and digest SQLAlchemy’s documentation and library. If I can’t find what I want in SQLModel now, I could look into SQLAlchemy.

Conclusion

Storing my Monopoly Junior games’ game and turn data became straightforward with SQLModel. However, the most significant takeaway from playing with SQLModel is my increased experience using SQL and SQLAlchemy. If you ever had difficulty getting onto SQL, SQLModel can smoothen this road. How has your experience with SQLModel been? Feel free to share with me!

#Programming #DataScience #Monopoly #SQL #SQLModel #FastAPI #Python #Streamlit

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This is the story of my lockdown during a global pandemic. [ Cue post-apocalypse 🎼]

Amid a global pandemic, schools and workplaces shut down, and families had to huddle together at home. There was no escape. Everyone was getting on each others' nerves. Running out of options, we played Monopoly Junior, which my daughter recently learned how to play. As we went over several games, we found the beginner winning every game. Was it beginner's luck? I seethed with rage. Intuitively, I knew the odds were against me. I wasn't terrible at Monopoly. I had to prove it with science.

How would I prove it with science? This sounds ridiculous at first, but you'd do it by playing 5 million Monopoly Junior games. Yes, once you play enough games, suddenly your anecdotal observations become INSIGHT. (Senior lawyers might also call it experience, but I am pretty sure that they never experienced 5 million cases.)

This is the story of how I tried different ways to play 5 million Monopoly JR games.

What is Monopoly Junior?

Front cover for the box of the board game, Monopoly JR.The cover for the Monopoly Junior board game.

Most people know what the board game called Monopoly is. You start with a bunch of money, and the goal is to bankrupt everyone else. To bankrupt everyone else, you go round the board and purchase properties, build hotels and take everyone else's money through rent-seeking behaviour (the literal kind). It's a game for eight and up. Mainly because you have to count with hundreds and thousands, and you would need the stamina to last an entire night to crush your opposition.

If you are, like my daughter, five years old, Monopoly JR is available. Many complicated features in Monopoly (for example, counting past 30, auctions and building houses and hotels) are removed for young players to join the game. You also get a smaller board and less money. Unless you receive the right chance card, you don't have a choice once you roll the dice. You buy, or you pay rent on the space you land on.

As it's a pretty deterministic game, it's not fun for adults. On the other hand, the game spares you the ignominy of mano-a-mano battles with your kids by ending the game once anyone becomes bankrupt. Count the cash, and the winner is the richest. It ends much quicker.

Hasbro Monopoly Junior Game,A69843480 : Amazon.sg: ToysHasbro Monopoly Junior Game,A69843480 : Amazon.sg: ToysInstead of letting computers have all the fun, now you can play the game yourself! (I earn a commission when you buy from this amazon affiliate link)

Determining the Approach, Realising the Scale

Ombre BalloonsPhoto by Amy Shamblen / Unsplash

There is a pretty straightforward way to write this program. Write the rules of the game in code and then play it five million times. No sweat!

However, if you wanted to prove my hypothesis that players who go first are more likely to win, you would need to do more:

  • We need data. At the minimum, you would want to know who won in the end. This way, you can find out the odds you'd win if you were the one who goes first (like my daughter) or the last player (like me).
  • As I explored the data more, interesting questions began to pop up. For example, given any position in the game, what are the odds of winning? What kind of events cause the chances of winning to be altered significantly?
  • The data needs to be stored in a manner that allows me to process and analyse efficiently. CSV?
  • It'd be nice if the program would finish as quickly as possible. I'm excited to do some analysis, and waiting for days can be burdensome and expensive! (I'm running this on a DigitalOcean Droplet.)

The last point becomes very troublesome once you realise the scale of the project. Here's an illustration: you can run many games (20,000 is a large enough sample, maybe) to get the odds of a player winning a game. Imagine you decide to do this after every turn for each game. If the average game had, say, 100 turns (a random but plausible number), you'd be playing 20,000 X 100 = 2 million additional games already! Oh, let's say you also want to play three-player and four-player games too...

It looks like I have got my work cut out for me!

Route One: Always the best way?

I decided to go for the most obvious way to program the game. In hindsight, what seemed obvious isn't obvious at all. Having started programming using object-orientated languages (like Java), I decided to create classes for all my data.

An example of how I used a class to store my data. Not rocket science.

The classes also came in handy when I wrote the program to move the game.

Object notation makes programming easy.

It was also pretty fast, too — 20,000 two-player games took less than a minute. 5 million two-player games would take about 4 hours. I used python's standard multiprocessing modules so that several CPUs can work on running games by themselves.

Yes, my computers are named after cartoon characters.

After working this out, I decided to experiment with being more “Pythonic”. Instead of using classes, I would use Python dictionaries to store data. This also had the side effect of flattening the data, so you would not need to access an object within an object. With a dictionary, I could easily use pandas to save a CSV.

This snippet shows how the basic data for a whole game is created.

Instead of using object notation to find data about a player, the data is accessed by using its key in a python dictionary.

The same code to move player is implemented for a dictionary.

I didn't think it would make a difference honestly. However, I found the speed up was remarkable: almost 10 times! 20,000 two-player games now took 4 seconds. The difference between 4 seconds and less than a minute is a trip to the toilet, but for 5 million games, it was reduced from 4 hours to 16 mins. That might save me 20 cents in Droplet costs!

Colour me impressed.

Here are some lessons I learnt in the first part of my journey:

  • Your first idea might not always be the best one. It's best to iterate, learn new stuff and experiment!
  • Don't underestimate standard libraries and utilities. With less overhead, they might be able to do a job really fast.
  • Scale matters. A difference of 1 second multiplied by a million is a lot. (More than 11.5 days, FYI.)

A Common-Sense Guide to Data Structures and Algorithms, Second EditionBig O notation can make your code faster by orders of magnitude. Get the hands-on info you need to master data structures and algorithms for your daily work.Jay WengrowLearn new stuff – this book was useful in helping me dive deeper into the underlying work that programmers do.

Route 3: Don't play games, send messages

After my life-changing experiment, I got really ambitious and wanted to try something even harder. I then got the idea that playing a game from start to finish isn't the only way for a computer to play Monopoly JR.

https://media.giphy.com/media/eOAYCCymR0OqSN4aiF/giphy.gif

As time progressed in lockdown, I explored the idea of using microservices (more specifically, I read this book). So instead of following the rules of the game, the program would send “messages” depicting what was going on in a game. The computer would pick up these messages, process them and hand them off. In other words, instead of a pool of workers each playing their own game, the workers would get their commands (or jobs) from a queue. It didn't matter what game they were playing. They just need to perform their commands.

A schematic of what messages or commands a worker might have to process to play a game of Monopoly JR.When the NewGameCommand is executed, it writes some information to a database and then puts a PlayRoundCommand in the queue for the next worker.

So, what I had basically done was chop the game into several independent parts. This was in response to certain drawbacks I observed in the original code. Some games took longer than others and this held back a worker as it furiously tried to finish it before it could move on to the next one. I hoped that it could finish more games quickly by making independent parts. Anyway, since they were all easy jobs, the workers would be able to handle them quickly in rapid succession, right?

It turns out I was completely wrong.

It was so traumatically slow that I had to reduce the number of games from 20000 to 200 just to take this screenshot.

Instead of completing hundreds or thousands of games in a second, it took almost 1 second to complete a game. You wouldn't be able to imagine how long 5 million seconds would now take. How much I would have to pay DigitalOcean again for their droplets? (Maybe $24.)

What slowed the program?

Tortoise 🐢 Photo by Craig Pattenaude / Unsplash

I haven't been able to drill down the cause, but this is my educated guess: The original command might be simple and straightforward, but I might have introduced a laborious overhead: operating a messaging system. As the jobs finished pretty quickly, the main thread repeatedly asked what to do next. If you are a manager, you might know why this is terrible and inefficient.

Experimenting again, I found the program to improve its times substantively once I allowed it to play several rounds in one command rather than a single round. I pushed it so hard that it was literally completing 1 game in 1 command. It never reached the heights of the original program using the python dictionaries though. At scale, the overhead does matter.

Best Practices — Dask documentationThinking about multiprocessing can be counterintuitive for a beginner. I found Dask's documentation helpful in starting out.

So, is sending messages a lousy solution? The experiment shows that when using one computer, its performance is markedly poor. However, there are circumstances when sending messages is better. If several clients and servers are involved in the program, and you are unsure how reliable they are, then a program with several independent parts is likely to be more robust. Anyway, I also think it makes the code a bit more readable as it focuses on the process.

I now think a better solution is to improve my original program.

Conclusion: Final lessons

Writing code, it turns out, isn't just a matter of style. There are serious effects when different methods are applied to a problem. Did it make the program faster? Is it easier to read? Would it be more conducive to multiprocessing or asynchronous operations? It turns out that there isn't a silver bullet. It depends on your current situation and resources. After all this, I wouldn't believe I became an expert, but it's definitely given me a better appreciation of the issues from an engineering perspective.

Now I only have to work on my previous code😭. The fun's over, isn't it?

#Programming #blog #Python #DigitalOcean #Monopoly #DataScience

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

October is drawing to a close, and so the end of the year is almost upon us. It's hard to fathom that I have been stuck working from home for nearly 20 months now. Some countries seemed to have moved on, but I doubt we'd do so in Singapore. Nevertheless, it's time for reflection and thinking about what to do about the future.

What I am reading now

The Importance of Being AuthorisedA recent case shows that practising law as an unauthorised person can have serious effects. What does this hold for other people who may be interested in alternative legal services?Love.Law.Robots.HoufuAn in-depth analysis of a rare and recent local decision touching on this point.

CLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksCLM Simplified: Efficient Contracting for Law Departments : Bassli, Lucy Endel: Amazon.sg: BooksLucy Endel BassliI earn a commission from purchases made with this link.

  • Do you need a lot of coding or technical skills to use AI? This commentator from Today Online highlights Hugging Face, Gradio and Streamlit and doesn't think so. So have we finally resolved the question of whether lawyers need to code? I still think the answer is very nuanced — one person can compile a graph using free tools quickly, but making it production-ready is tough and won't be free. I agree more with the premise that we need to better empower students and others to “seek out AI services and solutions on their own”. In the Legal field, this starts with having more data out there available for all to use.

Why you don’t need to be an expert to use AI any moreKeeping up with the latest developments in artificial intelligence is like drinking from the proverbial fire hose, as a recent 188-page overview by two tech investors Ian Hogarth and Nathan Benaich would attest.TODAYonline

Post Updates

This week saw the debut of my third feature — “It's Open. It's Free — Public Legal Information in Singapore”. I have been working on it for several months, and it's still a work in progress. I made it as part of my research into what materials to scrape, and I've hinted at the project several times recently. In due course, I want to add more obscure courts and tribunals, including the PDPC and others. You can check the page regularly, or I would mention it here from time to time. I welcome your comments and suggestions on what I should cover.

That's it!

Family Playing A Board Game. An Asian family \(adult male and female and two adolescents, male and female\) sitting around a coffee table playing a board game. Photographer Bill BransonPhoto by National Cancer Institute / Unsplash

At the start of this newsletter, I mentioned that November is the month to be looking forward. 😋 Unfortunately, for the time being, I would be racing to finish articles that I had wanted to write since the pandemic started. This includes my observations from playing Monopoly Junior 5 million times. You can look at a sneak peek of the work in my Streamlit app (if it runs).

In the meantime, I would be trying the weights and cons of using MongoDB or SQL for my scraping project. Storing text and downloads on S3 is pretty straightforward, but where should I store the metadata of the decisions? If anyone has an opinion, I could use some advice!

Thanks for reading, and feel free to reach out!

#Newsletter #ArtificalIntelligence #BookReview #Contracts #DataMining #Law #DataScience #LegalTech #Programming #Singapore #Streamlit #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Introduction

Over the course of 2019 and 2020, I embarked on a quest to apply the new things I was learning in data science to my field of work in law.

The dataset I chose was the enforcement decisions from the Personal Data Protection Commission in Singapore. The reason I chose it was quite simple. I wanted a simple dataset covering a limited number of issues and is pretty much independent (not affected by stare decisis or extensive references to legislation or other cases). Furthermore, during that period, the PDPC was furiously issuing several decisions.

This experiment proved to be largely successful, and I learned a lot from the experience. This post gathers all that I have written on the subject at the time. I felt more confident to move on to more complicated datasets like the Supreme Court Decisions, which feature several of the same problems faced in the PDPC dataset.

Since then, the dataset has changed a lot, such as the website has changed, so your extraction methods would be different. I haven't really maintained the code, so they are not intended to create your own dataset and analysis today. However, techniques are still relevant, and I hope they still point you in a good direction.

Extracting Judgement Data

Dog & Baltic SeaPhoto by Janusz Maniak / Unsplash

The first step in any data science journey is to extract data from a source. In Singapore, one can find judgements from courts on websites for free. You can use such websites as the source of your data. API access is usually unavailable, so you have to look at the webpage to get your data.

It's still possible to download everything by clicking on it. However, you wouldn't be able to do this for an extended period of time. Automate the process by scraping it!

Automate Boring Stuff: Get Python and your Web Browser to download your judgements]

I used Python and Selenium to access the website and download the data I want. This included the actual judgement. Metadata, such as the hearing date etc., are also available conveniently from the website, so you should try and grab them simultaneously. In Automate Boring Stuff, I discussed my ideas on how to obtain such data.

Processing Judgement Data in PDF

Photo by Pablo Lancaster Jones / Unsplash

Many judgements which are available online are usually in #PDF format. They look great on your screen but are very difficult for robots to process. You will have to transform this data into a format that you can use for natural language processing.

I took a lot of time on this as I wanted the judgements to read like a text. The raw text that most (free) PDF tools can provide you consists of joining up various text boxes the PDF tool can find. This worked all right for the most part, but if the text ran across the page, it would get mixed up with the headers and footers. Furthermore, the extraction revealed lines of text, not paragraphs. As such, additional work was required.

Firstly, I used regular expressions. This allowed me to detect unwanted data such as carriage returns, headers and footers in the raw text matched by the search term.

I then decided to use machine learning to train my computer to decide whether to keep a line or reject it. This required me to create a training dataset and tag which lines should be kept as the text. This was probably the fastest machine-learning exercise I ever came up with.

However, I didn't believe I was getting significant improvements from these methods. The final solution was actually fairly obvious. Using the formatting information of how the text boxes were laid out in the PDF , I could make reasonable conclusions about which text was a header or footer, a quote or a start of a paragraph. It was great!

Natural Language Processing + PDPC Decisions = 💕

Photo by Moritz Kindler / Unsplash

With a dataset ready to be processed, I decided that I could finally use some of the cutting-edge libraries I have been raring to use, such as #spaCy and #HuggingFace.

One of the first experiments was to use spaCy's RuleMatcher to extract enforcement information from the summary provided by the authorities. As the summary was fairly formulaic, it was possible to extract whether the authorities imposed a penalty or the authority took other enforcement actions.

I also wanted to undertake key NLP tasks using my prepared data. This included tasks like Named Entity Recognition (does the sentence contain any special entities), summarisation (extract key points in the decision) and question answering (if you ask the machine a question, can it find the answer in the source?). To experiment, I used the default pipelines from Hugging Face to evaluate the results. There are clearly limitations, but very exciting as well!

Visualisations

Photo by Annie Spratt / Unsplash

Visualisations are very often the result of the data science journey. Extracting and processing data can be very rewarding, but you would like to show others how your work is also useful.

One of my first aims in 2019 was to show how PDPC decisions have been different since they were issued in 2016. Decisions became greater in number, more frequent, and shorter in length. There was clearly a shift and an intensifying of effort in enforcement.

I also wanted to visualise how the PDPC was referring to its own decisions. Such visualisation would allow one to see which decisions the PDPC was relying on to explain its decisions. This would definitely help to narrow down which decisions are worth reading in a deluge of information. As such, I created a network graph and visualised it. I called the result my “Star Map”.

Data continued to be very useful in leading the conclusion I made about the enforcement situation in Singapore. For example, how great an impact would the increase in maximum penalties in the latest amendments to the law have? Short answer: Probably not much, but they still have a symbolic effect.

What's Next?

As mentioned, I have been focusing on other priorities, so I haven't been working on PDPC-Decisions for a while. However, my next steps were:

  • I wanted to train a machine to process judgements for named entity recognition and summarization. For the second task, one probably needs to use a transformer in a pipeline and experiment with what works best.
  • Instead of using Selenium and Beautiful Soup, I wanted to use scrapy to create a sustainable solution to extract information regularly.

Feel free to let me know if you have any comments!

#Features #PDPC-Decisions #PersonalDataProtectionAct #PersonalDataProtectionCommission #Decisions #Law #NaturalLanguageProcessing #PDFMiner #Programming #Python #spaCy #tech

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

I really like Notion

I have a confession to make. I really like Notion. Notion, at its most basic, is a note-taking application. It allows you to create pages that contain various content, like web links, markdown, checklists, embedded content and so on. I am not alone too in liking Notion — look for the #notion hashtag on Twitter, and you'd find people rabidly professing their love. They aren't celebrities trying to sell something, but normal and authentic people who love a product.

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.A new tool that blends your everyday work apps into one. It’s the all-in-one workspace for you and your team.Notion

My wife, who is a bit flabbergasted that I recommend Notion for everything , told me frankly, “Isn't this like Todoist? Or some journal app? There are dozens of such programs out there on the Internet. Free or paid.”

It's true. You can find dozens of apps that can provide you with a Kanban. It's a crowded field.

But Notion is special.

To me, it's a content management system that is very user friendly, yet very powerful at the same time. Like my wife says, the kanban, the markdown, the images, whatever, all those features aren't very interesting. However, you can dive straight into the app and create any of them in Notion. There isn't much to configure or install. You don't get to code it. To be frank, your interaction with Notion doesn't go further than typing text on a keyboard or enabling some option on a pop-up toolbar.

Is “no-code” for lawyers... or losers?

This brings me to the concept of “no-code” or “low code”. Apparently, lawyers are highly allergic to coding anything, so the idea that you don't have to do any coding is a feature. Among the front runners of this feature is Documate. It is a document assembly service built on docassemble but turns all that “programming” into buildable boxes so you don't have to do any coding. Another non-legal example is Scratch, an MIT project that teaches children how to program using blocks.

Document Automation Software – DocumateLegal document automation software to create powerful workflows that push data into your templates and forms.Documate

I haven't used Documate before, so I can't tell you whether it's good. Judging by its community though, it looks great. If you are attracted to its “no-code” premise and terrified of having to learn YAML and Jinga/Python in order to use docassemble, you should definitely give it a go.

As I can code, the idea of “no-code” turns me off though. Being able to tinker with a product is the fun part to me. Telling me I can't code means I cannot fully utilise the product. Once you become familiar with the capabilities of the product, your inability to code starts to look like the true barrier to achieving something. Suddenly, it's your fault, and that feeling really sucks.

It doesn't have to be this way. The opposite of “no-code” or “low code” isn't to code everything. To put it in another way, a product that asks you to do everything yourself is terrible. Ultimately a product has to provide features that you can use to achieve your aims. A particular set of features might be so limiting that it can do only one thing (and maybe do that one thing well). However, a different set of features could be so intriguing that you can use it for everything.

Think of “no-code” or “low code” this way:

  • You hardly ever need to code in Microsoft Word. Yet you can create any kind of document you want. No code ftw!
  • It ain't obvious, but you do some coding when you input a formula in Microsoft Excel. However, this “low code” environment allows you to perform calculations and filters and then use the output to visualize data. Excel is a prototypical database, a custom program and a report generator. Oops, sorry, Excel is a spreadsheet program.

Why I would use Excel for my Contract Management SystemHow do I get on this legal technology wave? Where do I even start? A “contract management system” or a “document management system” (“CMS”) is a good place. Business operations are not affected, but the legal department can get their hands dirty and show results for it. If you wouldLove.Law.Robots.HoufuSome good advice: consider using stuff you have already installed for innovation rather than inventing the wheel.

It's important to note that these Microsoft Office programs don't advertise themselves as “no-code”. They're still easy to use and accessible to all types of users.

Using Notion to improve my wife's website

So if you've been following so far, I think Notion is a great product that is equal parts friendly and powerful. It's also improving with a killer feature — an API. So a great product is now available to be integrated with others, making it even more powerful.

Notion APIConnect Notion pages and databases to the tools you use every day, creating powerful workflows.Notion API

My use case shows how you can use Notion to make small, impactful improvements to your projects.

Problem Statement

My wife is diligently developing her illustrator side hustle on her website, which I developed in roughly a week using NextJS. As an illustrator, a gallery is an important showcase of her work. Even though I had no qualms about doing it, being able to manage the content on the gallery herself would be a good feature. Content gets on the website quicker, and she'd get full autonomy on how to present it.

How the end product is going to look like – The Gallery page features categories where illustrations are organised. The pages for individual galleries feature a gallery of illustrations for the selected categories. Individual images feature some metadata.

My wife is not a coder, so choosing a data format for her was going to be challenging. It has to work with the website and work in her workflow too. A full-fledged content management system like WordPress would surely be overkill. However, explaining to her the intricacies of JSON, YAML or TOML would probably turn her off as well.

To turn you off, here's how the original YAML file looked like:

—- – caption: Travel sketches dateupdated: '2021-03-02' id: 3 location: travels title: Adventures and travels images: – caption: '' source: travels1.jpg thumbnailCaption: '' – caption: '' source: travels2.jpg thumbnailCaption: '' – caption: '' source: travels3.jpg thumbnailCaption: '' – caption: '' source: travels_4.jpg thumbnailCaption: '' portrait: true

I'd call this “code lite” like docassemble since she only has to edit one text file. But this will become problematic very fast:

  • What does “travels_1.jpg” mean? How do I write a caption for something which I don't even know what it is?
  • This is a text file, so the actual images are missing. It turns out that she still has to send the file to me, and then I have to rename it, and errm... a bunch of other manual steps before it gets on the website.
  • It's not difficult for an ordinary user to commit errors on YAML. For example, the indents are significant. Any typo on the source file names is sufficient to break the system as well.

Notion Everywhere!

Enter Notion. It does all of the above and it does it even better. The YAML text file is now replaced with a screen that looks like this:

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/b366764b-cef6-4c06-8821-93d0a26c7c7d/Untitled.png

Each box is clickable, and it leads you to the illustration's own page.

Deceptively, it looks like a gallery, but its underlying structure is a database. So I can view all my pictures in a user-friendly gallery, which allows for searching and filtering. Furthermore, the gallery packs an interface for me to upload new illustrations. It's not obvious from the screenshot, but you can even create new categories of illustrations.

Editing the metadata of an individual illustration is also straightforward:

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d0737aaf-1ec4-4118-a480-26ee662b240b/Untitled.png

The gallery also allows my wife to attach her illustration to the individual item, so I am able to get everything I need to create the gallery from Notion.

Finally, using Notion API, the NextJS website generator grabs all the information from Notion to create the gallery you now see on the website. (NB: As of writing the Notion API does not support images, so that step is still manual at this time.)

Once you have decided on the scheme of your data, you can directly translate it into a user-friendly Notion page. My wife doesn't need to touch any code, but she still gets to plan her gallery in the way she wants. (Now she just has to scan some illustrations. 😝)

Conclusion

I hope this demonstration gets you to question what we really want from “no-code” or “low code”. It's sexy to claim that lawyers don't get to touch any code, or that they can automate their workflows by dragging and dropping some boxes. However, we can be more discerning than that. What exactly are the features of this product? What can I do with it? Does it make me pull my hair out at the limitations it imposes on me? Or does it offer to sacrifice me on the altar of my crappy computer skills? In the end, designing a good product is a lot more difficult than having an effect that doesn't need you to code. Even a product made for general use (like Word or Notion) might be more relevant than a product that is labelled “legal tech”.

And if you ask me, a good barometer of a great product is the fans who are willing to say nice things about it.

#Programming #tech #docassemble #blog #MicrosoftOffice #Notion #LegalTech

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu