NaturalLanguageProcessing

Evaluating Legislation for Readability, Exploring PLUS

March 22, 2022

Feature image

Introduction

In January 2022, the 2020 Revised Edition of over 500 Acts of Parliament (the primary legislation in Singapore) was released. It’s a herculean effort to update so many laws in one go. A significant part of that effort is to “ensure Singapore’s laws are understandable and accessible to the public” and came out of an initiative named Plain Laws Understandable by Singaporeans (or Plus).

Keeping Singapore laws accessible to all – AGC, together with the Law Revision Committee, has completed a universal revision of Singapore’s Acts of Parliament! pic.twitter.com/76TnrNCMUq

— Attorney-General's Chambers Singapore (@agcsingapore) December 21, 2021

After reviewing the list of changes they made, such as replacing “notwithstanding” with “despite”, I frankly felt underwhelmed by the changes. An earlier draft of this article was titled “PLUS is LAME”. The revolution is not forthcoming.

I was bemused by my strong reaction to a harmless effort with noble intentions. It led me to wonder how to evaluate a claim, such as whether and how much changing a bunch of words would lead to a more readable statute. Did PLUS achieve its goals of creating plain laws that Singaporeans understand?

In this article, you will be introduced to well-known readability statistics such as Flesch Reading Ease and apply them to laws in Singapore. If you like to code, you will also be treated to some Streamlit, Altair-viz and Python Kung Fu, and all the code involved can be found in my Github Repository.

GitHub – houfu/plus-explorer: A streamlit app to explore changes made by PLUSA streamlit app to explore changes made by PLUS. Contribute to houfu/plus-explorer development by creating an account on GitHub.GitHubhoufuThe code used in this project is accessible in this public repository.

How would we evaluate the readability of legislation?

Photo by Jamie Street / Unsplash

When we say a piece of legislation is “readable”, we are saying that a certain class of people will be able to understand it when they read it. It also means that a person encountering the text will be able to read it with little pain. Thus, “Plain Laws Understandable by Singaporeans” suggests that most Singaporeans, not just lawyers, should be able to understand our laws.

In this light, I am not aware of any tool in Singapore or elsewhere which evaluates or computes how “understandable” or readable laws are. Most people, especially in the common law world, seem to believe in their gut that laws are hard and out of reach for most people except for lawyers.

In the meantime, we would have to rely on readability formulas such as Flesch Reading Ease to evaluate the text. These formulas rely on semantic and syntactic features to calculate a score or index, which shows how readable a text is. Like Gunning FOG and Chall Dale, some of these formulas map their scores to US Grade levels. Very approximately, these translate to years of formal education. A US Grade 10 student would, for example, be equivalent to a Secondary four student in Singapore.

After months of mulling about, I decided to write a pair of blog posts about readability: one that's more facts oriented: (https://t.co/xbgoDFKXXt) and one that's more personal observations (https://t.co/U4ENJO5pMs)

— brycew (@wowitisbryce) February 21, 2022

I found these articles to be a good summary and valuable evaluation of how readability scores work.

These formulas were created a long time ago and for different fields. For example, Flesch Reading Ease was developed under contract to the US Navy in 1975 for educational purposes. In particular, using a readability statistic like FRE, you can tell whether a book is suitable for your kid.

I first considered using these formulas when writing interview questions for docassemble. Sometimes, some feedback can help me avoid writing rubbish when working for too long in the playground. An interview question is entirely different from a piece of legislation, but hopefully, the scores will still act as a good proxy for readability.

Selecting the Sample

Browsing vinyl music at a fair Photo by Artificial Photography / Unsplash

To evaluate the claim, two pieces of information regarding any particular section of legislation are needed – the section before the 2020 Edition and the section in the 2020 Edition. This would allow me to compare them and compute differences in scores when various formulas are applied.

I reckon it’s possible to scrape the entire website of statues online, create a list of sections, select a random sample and then delve into their legislative history to pick out the sections I need to compare. However, since there is no API to access statutes in Singapore, it would be a humongous and risky task to parse HTML programmatically and hope it is created consistently throughout the website.

Mining PDFs to obtain better text from DecisionsAfter several attempts at wrangling with PDFs, I managed to extract more text information from complicated documents using PDFMiner.Love.Law.Robots.HoufuIn one of my favourite programming posts, I extracted information from PDFs, even though the PDPC used at least three different formats to publish their decisions. Isn’t Microsoft Word fantastic?

I decided on an alternative method which I shall now present with more majesty:

The author visited the subject website and surveyed various acts of Parliament. When a particular act is chosen by the author through his natural curiosity, he evaluates the list of sections presented for novelty, variety and fortuity. Upon recognising his desired section, the author collects the 2020 Edition of the section and compares it with the last version immediately preceding the 2020 Edition. All this is performed using a series of mouse clicks, track wheel scrolling, control-Cs and control-Vs, as well as visual evaluation and checking on a computer screen by the author. When the author grew tired, he called it a day.

I collected over 150 sections as a sample and calculated and compared the readability scores and some linguistic features for them. I organised them using a pandas data frame and saved them to a CSV file so you can download them yourself if you want to play with them too.

Datacsv Gzipped file containing the source data of 152 sections, their content in the 2020 Rev Edn etc data.csv.gz 76 KB download-circle

Exploring the Data with Streamlit

You can explore the data associated with each section yourself using my PLUS Explorer! If you don’t know which section to start with, you can always click the Random button a few times to survey the different changes made and how they affect the readability scores.

Screenshot of PLUS Section Explorer: https://share.streamlit.io/houfu/plus-explorer/main/explorer.py

You can use my graph explorer to get a macro view of the data. For the readability scores, you will find two graphs:

A graph that shows the distribution of the value changes amongst the sample
A graph that shows an ordered list of the readability scores (from most readable to least readable) and the change in score (if any) that the section underwent in the 2020 Edition.

You can even click on a data point to go directly to its page on the section explorer.

Screenshot of PLUS graph explorer: https://share.streamlit.io/houfu/plus-explorer/main/graphs.py

This project allowed me to revisit Streamlit, and I am proud to report that it’s still easy and fun to use. I still like it more than Jupyter Notebooks. I tried using ipywidgets to create the form to input data for this project, but I found it downright ugly and not user-friendly. If my organisation forced me to use Jupyter, I might reconsider it, but I wouldn’t be using it for myself.

Streamlit — works out of the box and is pretty too. Here are some features that were new to me since I last used Streamlit probably a year ago:

Pretty Metric Display

Metric display from Streamlit

My dear friends, this is why Streamlit is awesome. You might not be able to create a complicated web app or a game using Streamlit. However, Steamlit’s creators know what is essential or useful for a data scientist and provide it with a simple function.

The code to make the wall of stats (including their changes) is pretty straightforward:

st.subheader('Readability Statistics') # Create three columns flesch, fog, ari = st.columns(3)

# Create each column flesch.metric(“Flesch Reading Ease”, dataset[“currentfleschreadingease”][sectionexplorerselect], dataset[“currentfleschreadingease”][sectionexplorer_select] - dataset[“previousfleschreadingease”][sectionexplorerselect])

# For Fog and ARI, the lower the better, so delta colour is inverse

fog.metric(“Fog Scale”, dataset[“currentgunningfog”][sectionexplorerselect], dataset[“currentgunningfog”][sectionexplorerselect] - dataset[“previousgunningfog”][sectionexplorerselect], delta_color=“inverse”)

ari.metric(“Automated Readability Index”, dataset[“currentari”][sectionexplorerselect], dataset[“currentari”][sectionexplorer_select] - dataset[“previousari”][sectionexplorerselect], delta_color=“inverse”)

Don’t lawyers deserve their own tools?

Now Accepting Arguments

Streamlit apps are very interactive (I came close to creating a board game using Streamlit). Streamlit used to suffer from a significant limitation — except for the consumption of external data, you can’t interact with it from outside the app.

It’s at an experimental state now, but you can access arguments in its address just like an HTML encoded form. Streamlit has also made this simple, so you don’t have to bother too much about encoding your HTML correctly.

I used it to communicate between the graphs and the section explorer. Each section has its address, and the section explorer gets the name of the act from the arguments to direct the visitor to the right section.

# Get and parse HTTP request queryparams = st.experimentalgetqueryparams()

# If the keyword is in the address, use it! if “section” in queryparams: sectionexplorerselect = queryparams.get(“section”)[0] else: sectionexplorerselect = 'Civil Law Act 1909 Section 6'

You can also set the address within the Streamlit app to reduce the complexity of your app.

# Once this callback is triggered, update the address def onselect(): st.experimentalsetqueryparams(section=st.session_state.selectbox)

# Select box to choose section as an alternative. # Note that the key keyword is used to specify # the information or supplies stored in that base. st.selectbox(“Select a Section to explore”, dataset.index, onchange=onselect, key='selectbox')

So all you need is a properly formed address for the page, and you can link it using a URL on any webpage. Sweet!

Key Takeaways

Changes? Not so much.

From the list of changes, most of the revisions amount to swapping words for others. For word count, most sections experienced a slight increase or decrease of up to 5 words, and a significant number of sections had no change at all. The word count heatmap lays this out visually.

Unsurprisingly, this produced little to no effect on the readability of the section as computed by the formulas. For Flesch Reading Ease, a vast majority fell within a band of ten points of change, which is roughly a grade or a year of formal education. This is shown in the graph showing the distribution of changes. Many sections are centred around no change in the score, and most are bound within the band as delimited by the red horizontal rulers.

This was similar across all the readability formulas used in this survey (Automated Readability Index, Gunning FOG and Dale Chall).

On the face of it, the 2020 Revision Edition of the laws had little to no effect on the readability of the legislation, as calculated by the readability formulas.

Laws remain out of reach to most people

I was also interested in the raw readability score of each section. This would show how readable a section is.

Since the readability formulas we are considering use years of formal schooling as a gauge, we can use the same measure to locate our target audience. If we use secondary school education as the minimum level of education (In 2020, this would cover over 75% of the resident population) or US Grade 10 for simplicity, we can see which sections fall in or out of this threshold.

Most if not all of the sections in my survey are out of reach for a US Grade 10 student or a person who attained secondary school education. This, I guess, proves the gut feeling of most lawyers that our laws are not readable to the general public in Singapore, and PLUS doesn’t change this.

Take readability scores with a pinch of salt

Suppose you are going to use the Automated Readability Index. In that case, you will need nearly 120 years of formal education to understand an interpretation section of the Point-to-Point Passenger Transport Industry Act.

Section 3 of the Point-to-Point Passenger Transport Industry Act makes for ridiculous reading.

We are probably stretching the limits of a tool made for processing prose in the late 60s. It turns out that many formulas try to average the number of words per sentence — it is based on the not so absurd notion that long sentences are hard to read. Unfortunately, many sections are made up of several words in 1 interminable sentence. This skews the scores significantly and makes the mapping to particular audiences unreliable.

The fact that some scores don’t make sense when applied in the context of legislation doesn’t invalidate its point that legislation is hard to read. Whatever historical reasons legislation have for being expressed the way they are, it harms people who have to use them.

In my opinion, the scores are useful to tell whether a person with a secondary school education can understand a piece. This was after all, what the score was made for. However, I am more doubtful whether we can derive any meaning from a score of, for example, ARI 120 compared to a score of ARI 40.

Improving readability scores can be easy. Should it?

Singaporean students know that there is no point in studying hard; you have to study smart.

Having realised that the number of words per sentence features heavily in readability formulas, the easiest thing to do to improve a score is to break long sentences up into several sentences.

True enough, breaking up one long sentence into two seems to affect the score profoundly: see Section 32 of the Defence Science and Technology Agency Act 2000. The detailed mark changes section shows that when the final part of subsection three is broken off into subsection 4, the scores improved by nearly 1 grade.

It’s curious why more sections were not broken up this way in the 2020 Revised Edition.

However, breaking long sentences into several short ones doesn’t always improve reading. It’s important to note that such scores focus on linguistic features, not content or meaning. So in trying to game the score, you might be losing sight of what you are writing for in the first place.

Here’s another reason why readability scores should not be the ultimate goal. One of PLUS’s revisions is to remove gendered nouns — chairperson instead of chairman, his or her instead of his only. Trying to replace “his” with “his or her” harms readability by generally increasing the length of the sentence. See, for example, section 32 of the Weights and Measures Act 1975.

You can agree or disagree whether legislation should reflect our values such as a society that doesn't discriminate between genders. (It's interesting to note that in 2013, frequent legislation users were not enthusiastic about this change.) I wouldn't suggest though that readability scores should be prioritised over such goals.

Readability scores can’t compute legal design

Here’s another point which shouldn’t be missed. Readability scores focus on linguistic features. They don’t consider things like the layout or even graphs or pictures.

A striking example of this is the interpretation section found in legislation. They aren’t perfect, but most legislation users are okay with them. You would use the various indents to find the term you need.

Example of an interpretation section and the use of indents to assist reading.

However, they are ignored because white space, including indents, are not visible to the formula. It appears to the computer like one long sentence, and readability is computed accordingly, read: terrible. This was the provision that required 120 years of formal education to read.

I am not satisfied that readability should be ignored in this context, though. Interpretation sections, despite the creative layout, remain very difficult to read. That’s because it is still text-heavy, and even when read alone, the definition is still a very long sentence.

A design that relies more on graphics and diagrams would probably use fewer words than this. Even though the scores might be meaningless in this context, they would still show up as an improvement.

Conclusion

PLUS might have a noble aim of making laws understandable to Singaporeans, but the survey of the clauses here shows that its effect is minimal. It would be great if drafters refer to readability scores in the future to get a good sense of whether the changes they are making will impact the text. Even if such scores have limitations, they still present a sound and objective proxy of the readability of the text.

I felt that the changes were too conservative this time. An opportunity to look back and revise old legislation will not return for a while (the last time such a project was undertaken was in 1985 ). Given the scarcity of opportunity, I am not convinced that we should (a) try to preserve historical nuances which very few people can appreciate, or (b) avoid superficial changes in meaning given the advances in statutory interpretation in the last few decades in Singapore.

Beyond using readability scores that focus heavily on text, it would be helpful to consider more legal design — I sincerely believe pictures and diagrams will help Singaporeans understand laws more than endlessly tweaking words and sentence structures.

This study also reveals that it might be helpful to have a readability score for legal documents. You will have to create a study group comprising people with varying education levels, test them on various texts or legislation, then create a machine model that predicts what level of difficulty a piece of legislation might be. A tool like that could probably use machine models that observe several linguistic features: see this, for example.

Finally, while this represents a lost opportunity for making laws more understandable to Singaporeans, the 2020 Revised Edition includes changes that improve the quality of life for frequent legislation users. This includes changing all the acts of parliaments to have a year rather than the historic and quaint chapter numbers and removing information that is no longer relevant today, such as provisions relating to the commencement of the legislation. As a frequent legislation user, I did look forward to these changes.

It’s just that I wouldn’t be showing them off to my mother any time soon.

#Features #DataScience #Law #Benchmarking #Government #LegalTech #NaturalLanguageProcessing #Python #Programming #Streamlit #JupyterNotebook #Visualisation #Legislation #AGC #Readability #AccesstoJustice #Singapore

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

How Data and the Law Interact: A Book Review

March 7, 2022

Feature image

I love playing with legal data. For me, books specialising in legal data are uncommon, especially those dealing with what’s available on the wild world of the internet today.

That’s why I snapped up Sarah Sutherland’s “Legal Data and Information in Practice”. Ms Sutherland was CEO of CanLII, one of the most admirable LIIs. CanLII is extensive, comprehensive, and packed with great features like noting up and keywords. It even comes in two languages.

Legal Data and Information in Practice: How Data and the Law InteractLegal Data and Information in Practice provides readers with an understanding of how to facilitate the acquisition, management, and use of legal data in organizations such as libraries, courts, governments, universities, and start-ups.Presenting a synthesis of information about legal data that will…Routledge & CRC PressSarah A. Sutherland

The book’s blurb recommends that it is “ essential reading for those in the law library community who are based in English-speaking countries with a common law tradition ”.

Since finishing the book, I found the blurb’s focus way too narrow. This is a book for anyone who loves legal data.

For one, I enjoyed the approachable language. My interaction with legal data has always been pragmatic. Either I was studying for some course, or I needed to find an answer quickly. It will be enough to appreciate the book if you’ve done any of those things. I liked that it didn’t baffle me with impossible or theoretical language. I found myself nodding at several junctures as I reflected on my experience of interacting with legal data as well.

Furthermore, it’s effectively a primer:

It’s short. I took a month to finish it at a leisurely place (i.e., in between taking care of children, making sure the legal department runs smoothly, and programming). Oh, and unlike most law books, it has pictures.
It effectively explains a broad range of topics. It talks about the challenges of AI and the political and administrative backgrounds of how legal data is provided without overwhelming you. More impressively, I found new areas in this field that I didn’t know about before reading the book, such as the various strategies to acquire legal data and an overview of statistical and machine learning techniques on data.

So, even if you are not a librarian or a legal technologist by profession, this book is still handy for you. I would love more depth, and maybe that’s some scope for a 2nd edition. In any case, Sarah Sutherland’s “Legal Data and Information in Practice” is a great starting point for everyone. Reading it will level up your ability to discuss and evaluate what’s going on in this exciting field.

I am sorry for being a sucker — I am the kind of guy who watches movies to swoon at sweeping visages of my home jurisdiction, Singapore. I enjoyed Crazy Rich Asians, even though it’s fake.

So, I couldn’t resist looking for references to Singapore in the book. Luckily for me, Singapore is mentioned several times in the book. It’s described as “an interesting example of what can happen if a government is willing to invest heavily in developing capacity in legal computing and data use”. I’m not convinced that LawNet is like an LII, but among other points raised, such as the infrastructure, availability and formats are still much better here than in the rest of the common law world.

The more interesting point is that Singapore, as a small jurisdiction, would usually find its dataset smaller. That’s why experimenting on making models trained on other kinds of data effective on yours is crucial. (I think the paper cited in the book is an excellent example of this.) Other facets are relevant when you have fewer data and resources: what kinds of legal data should one focus on and the strategies to acquire them.

The challenges of a smaller dataset seem to be less exciting because fewer people are staring at them. However, I would suggest that these challenges are more prevalent than you would expect — companies and organisations also have smaller datasets and fewer resources. What would work for Singapore should be of interest to many others.

There’s always something to be excited about in this field. What do you think?

#BookReview #ArtificalIntelligence #DataMining #Law #LegalTech #MachineLearning #NaturalLanguageProcessing #Singapore #TechnologyLaw

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Data Science with Judgement Data – My PDPC Decisions Journey

August 31, 2021

Feature image

Introduction

Over the course of 2019 and 2020, I embarked on a quest to apply the new things I was learning in data science to my field of work in law.

The dataset I chose was the enforcement decisions from the Personal Data Protection Commission in Singapore. The reason I chose it was quite simple. I wanted a simple dataset covering a limited number of issues and is pretty much independent (not affected by stare decisis or extensive references to legislation or other cases). Furthermore, during that period, the PDPC was furiously issuing several decisions.

This experiment proved to be largely successful, and I learned a lot from the experience. This post gathers all that I have written on the subject at the time. I felt more confident to move on to more complicated datasets like the Supreme Court Decisions, which feature several of the same problems faced in the PDPC dataset.

Since then, the dataset has changed a lot, such as the website has changed, so your extraction methods would be different. I haven't really maintained the code, so they are not intended to create your own dataset and analysis today. However, techniques are still relevant, and I hope they still point you in a good direction.

Extracting Judgement Data

Dog & Baltic Sea Photo by Janusz Maniak / Unsplash

The first step in any data science journey is to extract data from a source. In Singapore, one can find judgements from courts on websites for free. You can use such websites as the source of your data. API access is usually unavailable, so you have to look at the webpage to get your data.

It's still possible to download everything by clicking on it. However, you wouldn't be able to do this for an extended period of time. Automate the process by scraping it!

Automate Boring Stuff: Get Python and your Web Browser to download your judgements]

I used Python and Selenium to access the website and download the data I want. This included the actual judgement. Metadata, such as the hearing date etc., are also available conveniently from the website, so you should try and grab them simultaneously. In Automate Boring Stuff, I discussed my ideas on how to obtain such data.

Processing Judgement Data in PDF

Photo by Pablo Lancaster Jones / Unsplash

Many judgements which are available online are usually in #PDF format. They look great on your screen but are very difficult for robots to process. You will have to transform this data into a format that you can use for natural language processing.

I took a lot of time on this as I wanted the judgements to read like a text. The raw text that most (free) PDF tools can provide you consists of joining up various text boxes the PDF tool can find. This worked all right for the most part, but if the text ran across the page, it would get mixed up with the headers and footers. Furthermore, the extraction revealed lines of text, not paragraphs. As such, additional work was required.

Firstly, I used regular expressions. This allowed me to detect unwanted data such as carriage returns, headers and footers in the raw text matched by the search term.

I then decided to use machine learning to train my computer to decide whether to keep a line or reject it. This required me to create a training dataset and tag which lines should be kept as the text. This was probably the fastest machine-learning exercise I ever came up with.

However, I didn't believe I was getting significant improvements from these methods. The final solution was actually fairly obvious. Using the formatting information of how the text boxes were laid out in the PDF , I could make reasonable conclusions about which text was a header or footer, a quote or a start of a paragraph. It was great!

Natural Language Processing + PDPC Decisions = 💕

Photo by Moritz Kindler / Unsplash

With a dataset ready to be processed, I decided that I could finally use some of the cutting-edge libraries I have been raring to use, such as #spaCy and #HuggingFace.

One of the first experiments was to use spaCy's RuleMatcher to extract enforcement information from the summary provided by the authorities. As the summary was fairly formulaic, it was possible to extract whether the authorities imposed a penalty or the authority took other enforcement actions.

I also wanted to undertake key NLP tasks using my prepared data. This included tasks like Named Entity Recognition (does the sentence contain any special entities), summarisation (extract key points in the decision) and question answering (if you ask the machine a question, can it find the answer in the source?). To experiment, I used the default pipelines from Hugging Face to evaluate the results. There are clearly limitations, but very exciting as well!

Visualisations

Photo by Annie Spratt / Unsplash

Visualisations are very often the result of the data science journey. Extracting and processing data can be very rewarding, but you would like to show others how your work is also useful.

One of my first aims in 2019 was to show how PDPC decisions have been different since they were issued in 2016. Decisions became greater in number, more frequent, and shorter in length. There was clearly a shift and an intensifying of effort in enforcement.

I also wanted to visualise how the PDPC was referring to its own decisions. Such visualisation would allow one to see which decisions the PDPC was relying on to explain its decisions. This would definitely help to narrow down which decisions are worth reading in a deluge of information. As such, I created a network graph and visualised it. I called the result my “Star Map”.

Data continued to be very useful in leading the conclusion I made about the enforcement situation in Singapore. For example, how great an impact would the increase in maximum penalties in the latest amendments to the law have? Short answer: Probably not much, but they still have a symbolic effect.

What's Next?

As mentioned, I have been focusing on other priorities, so I haven't been working on PDPC-Decisions for a while. However, my next steps were:

I wanted to train a machine to process judgements for named entity recognition and summarization. For the second task, one probably needs to use a transformer in a pipeline and experiment with what works best.
Instead of using Selenium and Beautiful Soup, I wanted to use scrapy to create a sustainable solution to extract information regularly.

Feel free to let me know if you have any comments!

#Features #PDPC-Decisions #PersonalDataProtectionAct #PersonalDataProtectionCommission #Decisions #Law #NaturalLanguageProcessing #PDFMiner #Programming #Python #spaCy #tech

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Let the Robots do the talking — Exploring TTS

June 17, 2021

Feature image

Speaking has always been a big part of being a lawyer. You use your voice to make submissions in the highest courts of the land. Even in client meetings, you are also using your voice to persuade. Hell, when I write my emails, I imagine saying what I am writing to make sure it is in my voice.

So, thinking about how a synthesized voice can be useful is going to be controversial. You might think that a computer's voice is soulless and not interesting enough to hold on its own against a lawyer. However, with advances led by smart assistants like Google Home and Siri, Text to Speech (TTS) is certainly worth exploring.

Why use robots?

Talking is really convenient, as you would open your mouth and start talking (though some babies will disagree). However, working from home shows how difficult it can be to record and transmit good quality sound. Feedback and distortions are just some problems people regularly face using basic equipment to have online meetings. It's frustrating.

If you think this is an issue that is resolved by having better equipment, it can get expensive very easily. You might notice that several people are involved in producing your favourite podcast. You are going to need all sorts of equipment, like microphones and DAC mixers. Hire Engineers? What does a mixer do, actually?

Furthermore, human performance can be subject to various imperfections. The pitch or tone is not right here. Sometimes you lose concentration or get interrupted in the middle of your speech. All this means you may have to record something several times and hopefully get the delivery you are happy with. If you aren't confident about your English or would like to say something in another language, getting a computer to voice will help overcome it.

So a synthesized voice can be cheap, fast and consistent. If the quality is good enough, you can focus on the script. For me, I am interested in improving the quality of my online training. Explaining stuff doesn't need Leonard Cohen quality delivery. It's probably far less distracting anyway.

Experiments with TTS

I will take two major Text to Speech (TTS) solutions for a spin — Google Cloud and Mozilla's TTS (open source). The Python code used to write these experiments are contained in my Github.

houfu/TTS-experimentsContribute to houfu/TTS-experiments development by creating an account on GitHub.GitHubhoufu

Google Cloud

It's quite easy to try Google Cloud's TTS. A demo allows you to set the text and then convert it with a click of a button. If you want to know how it sounds, try it!

Text-to-Speech: Lifelike Speech Synthesis | Google CloudTurn text into natural-sounding speech in 220+ voices across 40+ languages and variants with an API powered by Google’s machine learning technology.Google Cloud

To generate audio files, you're going to need a Google Cloud account and some programming knowledge. However, it's pretty straightforward, and I mostly copied from the quickstart. You can hear the first two paragraphs of this blog post here.

Here's my personal view of Google Cloud's TTS:

It ain't free. Pricing for premium voices is free only for the first 1 million characters. After that, it's a hefty USD16 for every 1 million characters. Do note that the first two paragraphs of the blog have 629 characters. If you are converting text, it's hard to bust that limit.
The voices sound nice, in my opinion. However, if you are listening to it for a long time, it might be not easy.
Developer experience is great, and as you can see, converting lines of text to speech is straightforward.

Mozilla's TTS

Using Mozilla's TTS, you get much closer to the machine-training aspects of the text to speech. This includes training your own model, that is, if you have roughly 24 hours of recordings of your voice to spare.

mozilla/TTS:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts) – mozilla/TTSGitHubmozilla

However, for this experiment, we don't need that as we will use pre-trained models. Using python's built-in subprocess module, we can run the command line command that comes with the package. This generates wave files. You can hear the first two paragraphs of this blog post here.

Here's my personal view of Mozilla's TTS:

It's open-source, and I am partial to open source.
It also teaches you to how to train a machine using your voice. So, this is a possibility.
It sounds terrible, but that's because the audio feels a bit more varied than Google's. So... some parts sound louder, making other parts sound softer. There is also quite a lot of noise, which may be due to the recording quality of the source data. I did normalise the loudness for this sample.
Leaving those two points aside, it sounds more interesting to me. The variation feels a tad more natural to me.
There aren't characters to choose from (male, female etc.), so this may not be practical.
Considering I was not doing much more than running a command line, it was OK. Notably, choosing a pre-trained model was confusing at first, and I had to experiment a lot. Also, based on what you choose, the model might take a bit of time and computing power to produce audio. It took roughly about 15 minutes, and my laptop was wheezing throughout.

Conclusion

If you thought robots would replace lawyers in court, this isn't the post to persuade you. However, thinking further, I think some usage cases are certainly worth trying, such as online training courses. In this regard, Google Cloud is production-ready so that you can get the most presentable solutions. Mozilla TTS is open source and definitely far more interesting but needs more time to develop. Do you think there are other ways to use TTS?

#tech #NaturalLanguageProcessing #OpenSource #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Toying around with Hugging Face transformers and PDPC decisions

August 19, 2020

Feature image

For most natural language processing projects, the steps are quite straightforward. First, get a data set. Next, train a model for the task you want and fine-tune it. Once you’re happy, deploy and demo the product! In recent years, more and more powerful transformers like BERT and GPT-3 might make these steps more manageable, but you’re still doing the same things.

However, many libraries have default settings which allow you to play with their capabilities quickly. I did this once with spaCy, and while it was not perfect, it was still fascinating.

Today’s experiment is with HuggingFace’s transformers, the most exciting place on the internet for natural language processing. It’s a great place to experience cutting edge NLP models like BERT, ROBERTA and GPT-2. It’s also open source! (Read: free)

I will play with my own data and hopefully also provide an introduction to various NLP tasks available.

A quick installation and a run

We would like quick settings so that we can focus on experimenting and having fun. If we had to follow a 50-step process to get a job going, most people would be terrified. Thankfully, the folks at Hugging Face were able to condense the work into a pipeline, so a few lines of work is all you really need.

The main resource to refer to is their notebook on pipelines. For readers who wouldn’t read, the steps are outlined generally below.

Install Hugging Face transformers: pip install transformers
Import the pipeline into your code or notebook: from transformers import pipeline
Start the pipeline: (example) nlp_token_class = pipeline('ner')
Call the pipeline on the text you want processed: nlp_token_class('Here is some example text')

Sample texts

We are going to run three sample texts through the transformers.

First up is the conclusion of Henry Park Primary School Parents’ Association. A summary decision from 11 February 2020. I have also used the full text of the summary in some cases.

In the circumstances, the Deputy Commissioner for Personal Data Protection found the Organisation in breach of sections 11(3), 12 and 24 of the PDPA. In determining the directions, the Deputy Commissioner took into consideration that the Organisation was a volunteer organisation made up primarily of parents. The Organisation is directed to, within 60 days, (i) appoint one or more individuals to be responsible for ensuring that it complies with the PDPA, (ii) develop and implement internal data protection and training policies, and (iii) to put all volunteers handling personal data through data protection training.

Next, a sequence referring to an earlier decision from Society of Tourist Guides , a decision from 9 January 2020.

The Commission’s investigations also revealed that security testing had never been conducted since the launch of the Website in October 2018. In this regard, the Organisation admitted that it failed to take into consideration the security arrangements of the Website due to its lack of experience. As observed in WTS Automotive Services Pte Ltd [2018] SGPDPC 26 at [24], while an organisation may not have the requisite level of technical expertise, a responsible organisation would have made genuine attempts to give proper instructions to its service providers. The gravamen in the present case was the Organisation’s failure to do so.

Our final paragraph comes from Horizon Fast Ferry on 2 August 2019. It has some technical details of a breach.

Unbeknownst to the Organisation and contrary to its intention, the Contractor replicated the auto-retrieval and auto-population feature (which was only meant to be used in the internal CCIS) in the Booking Site as part of the website revamp. Consequently, whenever a user entered a passport number which matched a Returning Passenger’s passport number in the Database, the system would automatically retrieve and populate the remaining fields in the Booking Form with the Personal Data Set associated with the Returning Passenger’s passport number. As the Organisation failed to conduct proper user acceptance tests before launching the revamped Booking Site, the Organisation was not aware of this function until it was notified of the Incident.

Named Entity Recognition (NER)

A common task is to pick up special entities being referred to in the text. The normal examples are places and people, like Obama or Singapore. Picking up such entities can give us some extra context to the sentence, which is great for picking up more relationships. For example, “The organisation has to comply with section 24 of the PDPA”, would allow us to say that the organisation has to comply with a law called “section 24 of the PDPA”.

In Henry Park , the default settings failed to pick up any entities. All the words were identified as organisations, and a cursory look at the text does not agree. The provisions of the PDPA were not also identified as a Law.

[{'word': 'Personal', 'score': 0.533231258392334, 'entity': 'I-ORG', 'index': 9}, {'word': 'Data', 'score': 0.7411562204360962, 'entity': 'I-ORG', 'index': 10}, {'word': 'Protection', 'score': 0.9116974472999573, 'entity': 'I-ORG', 'index': 11}, {'word': 'Organisation', 'score': 0.77486252784729, 'entity': 'I-ORG', 'index': 14}, {'word': 'PD', 'score': 0.6492624878883362, 'entity': 'I-ORG', 'index': 29}, {'word': 'Organisation', 'score': 0.6983053684234619, 'entity': 'I-ORG', 'index': 45}, {'word': 'Organisation', 'score': 0.7912665605545044, 'entity': 'I-ORG', 'index': 57}, {'word': 'PD', 'score': 0.4049902558326721, 'entity': 'I-ORG', 'index': 86}]

Things don’t improve with Society of Tourist Guides. It got the “Organisation” as an organisation (duh), but it appeared to be confused with the citation.

[{'word': 'Commission', 'score': 0.9984033703804016, 'entity': 'I-ORG', 'index': 2}, {'word': 'Organisation', 'score': 0.8942610025405884, 'entity': 'I-ORG', 'index': 31}, {'word': 'Web', 'score': 0.6423472762107849, 'entity': 'I-MISC', 'index': 45}, {'word': 'W', 'score': 0.9225642681121826, 'entity': 'I-ORG', 'index': 57}, {'word': '##TS', 'score': 0.9971287250518799, 'entity': 'I-ORG', 'index': 58}, {'word': 'Auto', 'score': 0.9983740448951721, 'entity': 'I-ORG', 'index': 59}, {'word': '##mot', 'score': 0.9857855439186096, 'entity': 'I-ORG', 'index': 60}, {'word': '##ive', 'score': 0.9949991106987, 'entity': 'I-ORG', 'index': 61}, {'word': 'Services', 'score': 0.9982504844665527, 'entity': 'I-ORG', 'index': 62}, {'word': 'P', 'score': 0.9859911799430847, 'entity': 'I-ORG', 'index': 63}, {'word': '##te', 'score': 0.988381564617157, 'entity': 'I-ORG', 'index': 64}, {'word': 'Ltd', 'score': 0.9895501136779785, 'entity': 'I-ORG', 'index': 65}, {'word': 'S', 'score': 0.9822579622268677, 'entity': 'I-ORG', 'index': 69}, {'word': '##GP', 'score': 0.9837730526924133, 'entity': 'I-ORG', 'index': 70}, {'word': '##DP', 'score': 0.9856312274932861, 'entity': 'I-ORG', 'index': 71}, {'word': '##C', 'score': 0.8455315828323364, 'entity': 'I-ORG', 'index': 72}, {'word': 'Organisation', 'score': 0.682020902633667, 'entity': 'I-ORG', 'index': 121}]

For Horizon Fast Ferry , it picked up the Organisation as an Organisation but fared poorly otherwise.

[{'word': 'Organisation', 'score': 0.43128490447998047, 'entity': 'I-ORG', 'index': 8}, {'word': 'CC', 'score': 0.7553273439407349, 'entity': 'I-ORG', 'index': 43}, {'word': '##IS', 'score': 0.510469913482666, 'entity': 'I-MISC', 'index': 44}, {'word': 'Site', 'score': 0.5614328384399414, 'entity': 'I-MISC', 'index': 50}, {'word': 'Database', 'score': 0.8064141869544983, 'entity': 'I-MISC', 'index': 80}, {'word': 'Set', 'score': 0.7381612658500671, 'entity': 'I-MISC', 'index': 102}, {'word': 'Organisation', 'score': 0.5941202044487, 'entity': 'I-ORG', 'index': 115}, {'word': 'Site', 'score': 0.38454076647758484, 'entity': 'I-MISC', 'index': 132}, {'word': 'Organisation', 'score': 0.5979025959968567, 'entity': 'I-ORG', 'index': 135}]

Conclusion : it’s rubbish and you probably have to work on training on your own model with its own domain-specific information.

Question Answering

Question Answering is more fun. The model considers a text (the context) and generates text based on a question. Quite like filling in a blank but a tad less conscious.

I’m more hopeful for this one because the answer should be findable in the text. The model thus has to infer the answer by comparing the question and the context. Anyway the Transformer warns us again that we probably have to train our own model.

Easy tasks

For Henry Park , let’s ask an easy question: Which sections of the PDPA did the Organisation breach?

{'score': 0.8479166372224196, 'start': 122, 'end': 138, 'answer': '11(3), 12 and 24'}

The output gave the right answer! It was also about 84% sure it was right. Haha! Note that the model extracted the answer from the text.

For Society of Tourist Guides , let’s ask something a tad more difficult. Why did the Organisation not conduct security testing?

{'score': 0.43685166522863683, 'start': 283, 'end': 303, 'answer': 'lack of experience.'}

I agree with the output! The computer was not so sure though (43%).

For Horizon Fast Ferry , here’s a question everyone is dying to know: What happens if a user enters a passport number?

{'score': 0.010287796461319285, 'start': 393, 'end': 470, 'answer': 'automatically retrieve and populate the remaining fields in the Booking Form'}

Errm, the model was looking at the right place, but failed to include the second part of the clause. The answer may be misleading to someone with no access to the main text. However, the model was only 1% sure of its answer. For most Singaporean students, 1% confidence means you are going to fail your examinations.

Tweaking the model to give a longer answer allowed it to give the right answer with 48% confidence.

Now for something more difficult

I also experimented with answering a question from a long text. As the default pipeline has a 512 word or token limit, I downloaded the LongFormer model (all 2 Gbs of it), a model built for longer texts.

I ran the whole Henry Park decision text (which is still comparatively shorter than most decisions) through the model with the same question. However, the model failed to give the right answer. It gave its (wrong) response a probability score of “2.60319e-05”. You wouldn’t call that confident.

Theoretically, this isn’t surprising because a longer text means there are more possible answers. More possible answers means the probability of finding a wrong answer is much higher. This probably needs a creative workaround in order to be useful.

Conclusion : For easy tasks, even the default pipeline can give surprisingly good results. However the default pipeline can only accept a maximum of 512 tokens/words, which means it can only work on small paragraphs, and most legal texts are far longer than that.

Summarisation

We can also ask a computer to summarise another text. It’s a bit of a mystery to me how a computer can summarise a text. How does a computer decide what text to take out and to leave behind? Let’s take it for a whirl now anyway.

Henry Park excerpt (max length=50):

'the Deputy Commissioner for Personal Data Protection found the organisation in breach of sections 11(3), 12 and 24 of the PDPA . the organisation is directed to, within 60 days, appoint one or more individuals to be responsible'

Society of Tourist Guides (max length=50):

'security testing had never been conducted since launch of the Website in October 2018 . the organisation admitted it failed to take into consideration security arrangements due to its lack of experience . a responsible organisation would have made genuine attempts to give proper instructions to'

Horizon Fast Ferry:

'the Contractor replicated the auto-retrieval and auto-population feature in the Booking Site as part of the website revamp . when a user entered a passport number which matched a Returning Passenger’'

Henry Park full summary:

'the organisation had a website at https://hppa.org.sg (the “Website”) where members could view their own account particulars upon logging in using their assigned user ID and password . on 15 March 2019, the Personal Data Protection Commission received a complaint . the organisation failed to conduct adequate security testing before launching the Website .'

Conclusion : I think the summaries are OK, but they wouldn’t pass the Turing test. It appears to work better in the longer text as long as it’s not longer than 512 words. This is not long enough for almost all PDPC decisions. Would training help? That’s something to explore.

Conclusion

Using the default pipelines from Hugging Face provided quick food for natural language processing tasks using state of the art transformers. It’s really satisfying to see the model extract good answers from a text. I liked it too because it introduced me to the library.

However, our simple experiments also reveal that much work needs to be done to improve its performance. Training a model for the task is one such undertaking. Furthermore, given the limitations of some models, such as the length of the text the model can process, alternative workarounds may be needed to enjoy the models. I’ll take this for now though!

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

#NaturalLanguageProcessing #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

Mining PDFs to obtain better text from Decisions

June 24, 2020

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural language processing, data extraction and processing!

This post is the latest one dealing with creating a corpus out of the decisions of the Personal Data Protection Commission of Singapore. However, this time I believe I got it right. If I don’t turn out to be an ardent perfectionist, I might have a working corpus.

Problem Statement

I must have written this a dozen of times here already. Law reports in Singapore unfortunately are generally available in PDF format. There is a web format accessible from LawNet, but it is neither free as in beer nor free as in libre. You have to work with PDFs to create a solution that is probably not burdened with copyrights.

However, text extracted from PDFs can look like crap. It’s not the PDF’s fault. PDFs are designed to look the same on any computer. Thus a PDF comprises of a bunch of text boxes and containers rather than paragraphs and headers. If the text extractor is merely reading the text boxes, the results will not look pretty. See an example below:

    The operator, in the mistaken belief that the three statements belonged
     to  the  same  individual,  removed  the  envelope  from  the  reject  bin  and
     moved it to the main bin. Further, the operator completed the QC form
     in  a  way  that  showed  that  the  number  of  “successful”  and  rejected
     Page 3 of 7
     
     B.
     (i)
     (ii)
     9.
     (iii)
     envelopes tallied with the expected total from the run. As the envelope
     was  no  longer  in  the  reject bin, the  second and  third  layers  of  checks
     were by-passed, and the envelope was sent out without anyone realising
     that it contained two extra statements. The manual completion of the QC
     form by the operator to show that the number of successful and rejected
     envelopes tallied allowed this to go undetected.

Those lines are broken up with new lines by the way. So the computer reads the line in a way that showed that the number of “successful” and rejected, and then wonders “rejected” what?! The rest of the story continues about seven lines away. None of this makes sense to a human, let alone a computer.

Previous workarounds were... unimpressive

Most beginning data science books advise programmers to using regular expressions as a way to clean the text. This allowed me to choose what to keep and reject in the text. I then joined up the lines to form a paragraph. This was the subject of Get rid of the muff: pre-processing PDPC Decisions.

As mentioned in that post, I was very bothered with removing useful content such as paragraph numbers. Furthermore, it was very tiring playing whack a-mole figuring out which regular expression to use to remove a line. The results were not good enough for me, as several errors continued to persist in the text.

Not wanting to play whack-a-mole, I decided to train a model to read lines and make a guess as to what lines to keep or remove. This was the subject of First forays into natural language processing — get rid of a line or keep it? The model was surprisingly accurate and managed to capture most of what I thought should be removed.

However, the model was quite big, and the processing was also slow. While using natural language processing was certainly less manual, I was just getting results I would have obtained if I had worked harder at regular expressions. This method was still not satisfactory for me.

I needed a new approach.

A breakthrough — focus on the layout rather than the content

I was browsing Github issues on PDFMiner when it hit me like a brick. The issue author had asked how to get the font data of a text on PDFMiner.

I then realised that I had another option other than removing text more effectively or efficiently.

Readers don’t read documents by reading lines and deciding whether to ignore them or not. The layout — the way the text is arranged and the way it looks visually — informs the reader about the structure of the document.

Therefore, if I knew how the document was laid out, I could determine its structure based on observing its rules.

Useful layout information to analyse the structure of a PDF.

Rather than relying only on the content of the text to decide whether to keep or remove text, you now also have access to information about what it looks like to the reader. The document’s structure is now available to you!

Information on the structure also allows you to improve the meaning of the text. For example, I replaced the footnote of a text with its actual text so that the proximity of the footnote is closer to where it was supposed to be read rather than finding it at the bottom of page. This makes the text more meaningful to a computer.

Using PDFMiner to access layout information

To access the layout information of the text in a PDF, unfortunately, you need to understand the inner workings of a PDF document. Fortunately, PDFMiner simplifies this and provides it in a Python-friendly manner.

Your first port of call is to extract the page of the PDF as an LTPage. You can use the high level function extract_pages for this.

Once you extract the page, you will be able to access the text objects as a list of containers. That’s right — using list comprehension and generators will allow you to access the containers themselves.

Once you have access to each container in PDFMiner, the metadata can be found in its properties.

The properties of a LTPage reveal its layout information.

It is not apparent from the documentation, so studying the source code itself can be very useful.

Code Highlights

Here are some highlights and tips from my experience so far implementing this method using PDFMiner.

Consider adjusting LAParams first

Before trying to rummage through your text boxes in the PDF, pay attention to the layout parameters which PDFMiner uses to extract text. Parameters like line, word, and char margin determine how PDFMiner groups texts together. Effectively setting these parameters can help you to just extract the text.

Notwithstanding, I did not use LAParams as much for this project. This is because the PDFs in my collection can be very arbitrary in terms of layout. Asking PDFMiner to generalise in this situation did not lead to satisfactory results. For example, PDFMiner would try to join lines together in some instances and not be able to in others. As such, processing the text boxes line by line was safer.

Retrieving text margin information as a list

As mentioned in the diagram above, margin information can be used to separate the paragraph numbers from their text. The original method of using a regular expression to detect paragraph numbers had the risk of raising false positives.

The x0 property of an LTTextBoxContainer represents its left co-ordinate, which is its left margin. Assuming you had a list of LTTextBoxContainers (perhaps extracted from a page), a simple list comprehension will get you all the left margins of the text.

    from collections import Counter
    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTTextContainer, LAParams
    
    limit = 1 # Set a limit if you want to remove uncommon margins
    first_page = extract_pages(pdf, page_numbers=[0], laparams=LAParams(line_margin=0.1, char_margin=3.5))
    containers = [container for container in first_page if isinstance(container, LTTextContainer)]
    text_margins = [round(container.x0) for container in containers]
    c = Counter(text_margins)
    sorted_text_margins = sorted([margin for margin, count in c.most_common() if count &gt; limit])</code></pre>

Counting the number of times a text margin occurs is also useful to eliminate elements that do not belong to the main text, such as titles and sometimes headers and footers.

You also get a hierarchy by sorting the margins from left to right. This is useful for separating first-level text from the second-level, and so on.

Using the gaps between lines to determine if there is a paragraph break

As mentioned in the diagram above, the gap between lines can be used to determine if the next line is a new paragraph.

In PDFMiner as well as PDF, the x/y coordinate system of a page starts from its lower-left corner (0,0). The gap between the current container and the one immediately before it is between the current container’s y1 (the current container’s top) and the previous container’s y0 (the previous container’s base). Conversely, the gap between the current container and the one immediately after is the current container’s y0 (the current container’s base) and the next container’s y1 (the next container’s top)

Therefore, if we have a list of LTTextBoxContainers, we can write a function to determine if there is a bigger gap.

    def check_gap_before_after_container(containers: List[LTTextContainer], index: int) -> bool:
        index_before = index - 1
        index_after = index + 1
        gap_before = round(containers[index_before].y1 - containers[index].y0)
        gap_after = round(containers[index].y1 - containers[index_after].y0)
        return gap_after >= gap_before

Therefore if the function returns true, we know that the current container is the last line of the paragraph. As I would then save the paragraph and start a new one, keeping the paragraph content as close to the original as possible.

Retrieve footnotes with the help of the footnote separator

As mentioned in the diagram, footnotes are generally smaller than the main text, so you could get footnotes by comparing the height of the main text with that of the footnotes. However, this method is not so reliable because some small text may not be footnotes (figure captions?).

For this document, the footnotes are contained below a separator, which looks like a line. As such, our first task is to locate this line on the page. In my PDF file, another layout object, LTRect, describes lines. An LTRect with a height of less than 1 point appears not as a rectangle, but as a line!

    if footnote_line_container := [container for container in page if all([ 
        isinstance(container, LTRect),
        container.height < 1,
        container.y0 < 350,
        container.width < 150,
        round(container.x0) == text_margins[0]
    ])]:
        footnote_line_container = sorted(footnote_line_container, key=lambda container: container.y0)
        footnote_line = footnote_line_container[0].y0

Once we have obtained the y0 coordinate of the LTRect, we know that any text box under this line is a footnote. This accords with the intention of the maker of the document, which is sometimes Microsoft Word!

You might notice that I have placed several other conditions to determine whether a line is indeed a footnote marker. It turns out that LTRects are also used to mark underlined text. The extra conditions I added are the length of the line (container.width < 150), whether the line is at the top or bottom half of the page (container.y0 < 350), and that it is in line with the leftmost margin of the text (round(container.x0) == text_margins[0]).

For Python programming, I also found using the built-in all() and any() to be useful in improving the readability of the code if I have several conditions.

I also liked using a new feature in the latest 3.8 version of Python: the walrus operator! The code above might not work for you if you are on a different version.

Conclusion

Note that we are reverse-engineering the PDF document. This means that getting perfect results is very difficult, and you would probably need to try several times to deal with the side effects of your rules and conditions. The code which I developed for pdpc-decisions can look quite complicated and it is still under development!

Given a choice, I would prefer using documents where the document’s structure is apparent (like web pages). However, such an option may not be available depending on your sources. For complicated materials like court judgements, studying the structure of a PDF can pay dividends. Hopefully, some of the ideas in this post will get you thinking about how to exploit such information more effectively.

#PDPC-Decisions #NaturalLanguageProcessing #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

First forays into natural language processing — get rid of a line or keep it?

April 28, 2020

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

Avid followers of Love Law Robots will know that I have been hard at creating a corpus of Personal Data Protection Commission decisions. Downloading them and pre-processing them has taken a lot of work! However, it has managed to help me create interesting charts that shows insight at a macro level. How many decisions are released in a year and how long have they been? What decisions refer to each other in a network?

Unfortunately, what I would really to do is natural language processing. A robot should analyse text and make conclusions from it. This is much closer to the bread and butter of what lawyers do. I have been poking around spaCy, but using their regular expression function doesn’t really cut it.

This is not going to be the post where I say I trained a model to ask what the ratio decendi of a decision is. Part of the difficulty is finding a problem that is solvable given my current learning. So I have picked something that is useful and can be implemented fast.

The Problem

The biggest problem I have is that the decisions, like many other judgements produced by Singapore courts, is in PDF. This looks great on paper but is gibberish to a computer. I explained this problem in an earlier post about pre-processing.

Get rid of the muff: pre-processing PDPC DecisionsThis post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing! The life of a budding data science enthusiast. You need data to work on, so you look all around youLove.Law.Robots.Houfu

Having seen how the PDF extraction tool does its work, you can figure out which lines you want or don’t want. You don’t want empty lines. You don’t want lines with just numbers on them (these are usually page numbers). Citations? One-word indexes? The commissioner’s name. You can’t exactly think up of all the various permutations and then brainstorm on regular expression rules to cover all of them.

It becomes a whack a mole.

Training a Model for the win

It was during one of those rage-filled “how many more things do I have to do to improve this” nights when it hit me.

“I know what lines I do not want to keep. Why don’t I just tell the computer what they are instead of abstracting the problem with regular expressions?!”

Then I suddenly remembered about machine learning. Statistically, the robot, after learning about what lines I would keep or not, could make a guess. If the robot can guess right most of the time, that would determine in which cases regular expression must be used.

So, I got off my chair, selected dozens of PDFs and converted them into text. Then I separated the text into a CSV file and started classifying them.

Classification of lines for training

I managed to compile a list of over five thousand lines for my training and test data. After that, I lifted the training code from spaCy’s documentation to train the model. My Macbook Pro’s fans got noisy, but it was done in a matter of minutes.

Asking the model to classify sentences gave me the following results:

Text	Remove or Keep
Hello.	Keep
Regulation 10(2) provides that a contract referred to in regulation 10(1) must:	Keep
YEONG ZEE KIN	Remove
[2019] SGPDPC 18	Remove
transferred under the contract”.	Keep
There were various internal frameworks, policies and standards which apply to	Keep
(vi)	Remove

By applying it to text extracted from the PDF, we can get a resulting document which can be used in the corpus. You can check out the code used for this in the Github Repository under the branch “line_categoriser”.

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

Conclusion

Will I use this for production purposes? Nope. When I ran some decisions through this process, the effectiveness is unfortunately like using regular expressions. The model, which weighs nearly 19Mbs, also took noticeably longer to process a series of strings.

My current thoughts on this specific problem is for a different approach. It would involve studying PDF internals and observing things like font size and styles to determine whether a line is a header or a footnote. It would also make it easier to join lines of the same style to make a paragraph. Unfortunately, that is some homework for another day.

Was it a wasted adventure? I do not think so. Ultimately, I wanted to experiment, and embarking on a task I could do in a week of nights was surely insightful in determining whether I can do it, and what are the limitations of machine learning in certain tasks.

So, hold on to your horses, I will be getting there much sooner now.

#PDPC-Decisions #spaCy #NaturalLanguageProcessing

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter

New features for pdpc-decisions!

April 22, 2020

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

As mentioned in my previous post, I have not been able to spend time writing as much code as I wanted. I had to rewrite a lot of code due to the layout change in the PDPC Website. That was not the post I wanted to write. I have finally been able to write about my newest forays for this project.

Enforcement information#

I had noticed that the summary provided by the Personal Data Protection Commission provided an easy place to cull basic information. So, I have added enforcement information. Decisions now tell you whether a financial penalty or a warning was meted out.

Information is extracted from the summaries using RuleMatcher in spaCy. It isn’t perfect. Some text does not really fit the mould. However, due to the way the summaries are written, information is mostly extracted accurately.

Visualising the parts of speech in a typical sentence can allow you to write rules to extract information.

This is the first time I have used spaCy or any natural language processing for this purpose. Remarkably, it has been fast. Culling this information (as well as the other extra features) only added about two hundred seconds to building a database from scratch. I would like to find more avenues to use these newfound techniques!

spaCy · Industrial-strength Natural Language Processing in PythonspaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

References

Court decisions are special in that they often require references to leading cases. This is because they are either binding (stare decisis) or persuasive to the decision maker. Of course, previous PDPC Decisions are not binding on the PDPC. Lately, respondents have been referring to the body of cases to argue they should be treated alike. I have not read a decision where this argument has worked.

Nevertheless, the network of cases referring to and referred by offers remarkably interesting insights. To imagine, we are looking at a social network of cases. To establish a point, the Personal Data Protection Commission does refer to earlier cases. All things being equal, a case with more references is more influential.

pdpc-decisions now reads the text of the decisions to create a list of decisions it refers to in the decision (“ referring to “). From the list of decisions, we can also create a list of decisions which makes references to it (“ referred by “). Because of the haphazard way the PDPC has been writing its decisions and its citations, this is also not perfect, but it is still kind of accurate.

As I mentioned, compiling a network of decisions can offer some interesting insights. So here it is — the social network of PDPC decisions.

I guess this is the real pdpc decisions in one chart…

Update (24/4/2020): The chart was lumping together the Aviva case in 2018 with the Aviva case in 2017. The graph has been updated. Not much has changed in the big picture though.

Of course, a more advanced visualisation tool would allow you to drill down to see which cases are more influential. However, a big diagram like the above shows you which are the big boys in this social network.

Before I leave this section, here’s a fun fact to take home. Based on the computer’s analysis, over 68% of PDPC decisions refer to one another. That’s a lot of chatter!

Moving On

I keep thinking I have finished my work here, but there seems to be new things coming up. Here is some interesting information I would like to find out:

In a breach, how many data subjects were affected? How exactly does this affect the penalty given by the PDPC?
What kinds of information are most often involved in a data breach? How does this affect the penalty given by the PDPC?
How long does it take for the PDPC to complete investigations? Can we create a timeline from the information in a decision?

You would just have to keep watching this space! What kind of information is interesting to you too?

#PDPC-Decisions #NaturalLanguageProcessing #spaCy

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Discuss... this Post
If you found this post useful, or like my work, a tip is always appreciated:
Follow this blog on the Fediverse [Enter the blog's address in Mastodon's search accounts function]
Contact me:
- Email
- Github
- Personal Mastodon
- LinkedIn
- Twitter