Love.Law.Robots. by Ang Hou Fu

Programming

Feature image

Speaking has always been a big part of being a lawyer. You use your voice to make submissions in the highest courts of the land. Even in client meetings, you are also using your voice to persuade. Hell, when I write my emails, I imagine saying what I am writing to make sure it is in my voice.

So, thinking about how a synthesized voice can be useful is going to be controversial. You might think that a computer's voice is soulless and not interesting enough to hold on its own against a lawyer. However, with advances led by smart assistants like Google Home and Siri, Text to Speech (TTS) is certainly worth exploring.

Why use robots?

Talking is really convenient, as you would open your mouth and start talking (though some babies will disagree). However, working from home shows how difficult it can be to record and transmit good quality sound. Feedback and distortions are just some problems people regularly face using basic equipment to have online meetings. It's frustrating.

If you think this is an issue that is resolved by having better equipment, it can get expensive very easily. You might notice that several people are involved in producing your favourite podcast. You are going to need all sorts of equipment, like microphones and DAC mixers. Hire Engineers? What does a mixer do, actually?

Furthermore, human performance can be subject to various imperfections. The pitch or tone is not right here. Sometimes you lose concentration or get interrupted in the middle of your speech. All this means you may have to record something several times and hopefully get the delivery you are happy with. If you aren't confident about your English or would like to say something in another language, getting a computer to voice will help overcome it.

So a synthesized voice can be cheap, fast and consistent. If the quality is good enough, you can focus on the script. For me, I am interested in improving the quality of my online training. Explaining stuff doesn't need Leonard Cohen quality delivery. It's probably far less distracting anyway.

Experiments with TTS

I will take two major Text to Speech (TTS) solutions for a spin — Google Cloud and Mozilla's TTS (open source). The Python code used to write these experiments are contained in my Github.

houfu/TTS-experimentsContribute to houfu/TTS-experiments development by creating an account on GitHub.GitHubhoufu

Google Cloud

It's quite easy to try Google Cloud's TTS. A demo allows you to set the text and then convert it with a click of a button. If you want to know how it sounds, try it!

Text-to-Speech: Lifelike Speech Synthesis | Google CloudTurn text into natural-sounding speech in 220+ voices across 40+ languages and variants with an API powered by Google’s machine learning technology.Google Cloud

To generate audio files, you're going to need a Google Cloud account and some programming knowledge. However, it's pretty straightforward, and I mostly copied from the quickstart. You can hear the first two paragraphs of this blog post here.

Here's my personal view of Google Cloud's TTS:

  • It ain't free. Pricing for premium voices is free only for the first 1 million characters. After that, it's a hefty USD16 for every 1 million characters. Do note that the first two paragraphs of the blog have 629 characters. If you are converting text, it's hard to bust that limit.
  • The voices sound nice, in my opinion. However, if you are listening to it for a long time, it might be not easy.
  • Developer experience is great, and as you can see, converting lines of text to speech is straightforward.

Mozilla's TTS

Using Mozilla's TTS, you get much closer to the machine-training aspects of the text to speech. This includes training your own model, that is, if you have roughly 24 hours of recordings of your voice to spare.

mozilla/TTS:robot: :speech_balloon: Deep learning for Text to Speech (Discussion forum: https://discourse.mozilla.org/c/tts) – mozilla/TTSGitHubmozilla

However, for this experiment, we don't need that as we will use pre-trained models. Using python's built-in subprocess module, we can run the command line command that comes with the package. This generates wave files. You can hear the first two paragraphs of this blog post here.

Here's my personal view of Mozilla's TTS:

  • It's open-source, and I am partial to open source.
  • It also teaches you to how to train a machine using your voice. So, this is a possibility.
  • It sounds terrible, but that's because the audio feels a bit more varied than Google's. So... some parts sound louder, making other parts sound softer. There is also quite a lot of noise, which may be due to the recording quality of the source data. I did normalise the loudness for this sample.
  • Leaving those two points aside, it sounds more interesting to me. The variation feels a tad more natural to me.
  • There aren't characters to choose from (male, female etc.), so this may not be practical.
  • Considering I was not doing much more than running a command line, it was OK. Notably, choosing a pre-trained model was confusing at first, and I had to experiment a lot. Also, based on what you choose, the model might take a bit of time and computing power to produce audio. It took roughly about 15 minutes, and my laptop was wheezing throughout.

Conclusion

If you thought robots would replace lawyers in court, this isn't the post to persuade you. However, thinking further, I think some usage cases are certainly worth trying, such as online training courses. In this regard, Google Cloud is production-ready so that you can get the most presentable solutions. Mozilla TTS is open source and definitely far more interesting but needs more time to develop. Do you think there are other ways to use TTS?

#tech #NaturalLanguageProcessing #OpenSource #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

For most natural language processing projects, the steps are quite straightforward. First, get a data set. Next, train a model for the task you want and fine-tune it. Once you’re happy, deploy and demo the product! In recent years, more and more powerful transformers like BERT and GPT-3 might make these steps more manageable, but you’re still doing the same things.

However, many libraries have default settings which allow you to play with their capabilities quickly. I did this once with spaCy, and while it was not perfect, it was still fascinating.

Today’s experiment is with HuggingFace’s transformers, the most exciting place on the internet for natural language processing. It’s a great place to experience cutting edge NLP models like BERT, ROBERTA and GPT-2. It’s also open source! (Read: free)

I will play with my own data and hopefully also provide an introduction to various NLP tasks available.

A quick installation and a run

We would like quick settings so that we can focus on experimenting and having fun. If we had to follow a 50-step process to get a job going, most people would be terrified. Thankfully, the folks at Hugging Face were able to condense the work into a pipeline, so a few lines of work is all you really need.

The main resource to refer to is their notebook on pipelines. For readers who wouldn’t read, the steps are outlined generally below.

  1. Install Hugging Face transformers: pip install transformers
  2. Import the pipeline into your code or notebook: from transformers import pipeline
  3. Start the pipeline: (example) nlp_token_class = pipeline('ner')
  4. Call the pipeline on the text you want processed: nlp_token_class('Here is some example text')

Sample texts

We are going to run three sample texts through the transformers.

First up is the conclusion of Henry Park Primary School Parents’ Association. A summary decision from 11 February 2020. I have also used the full text of the summary in some cases.

In the circumstances, the Deputy Commissioner for Personal Data Protection found the Organisation in breach of sections 11(3), 12 and 24 of the PDPA. In determining the directions, the Deputy Commissioner took into consideration that the Organisation was a volunteer organisation made up primarily of parents. The Organisation is directed to, within 60 days, (i) appoint one or more individuals to be responsible for ensuring that it complies with the PDPA, (ii) develop and implement internal data protection and training policies, and (iii) to put all volunteers handling personal data through data protection training.

Next, a sequence referring to an earlier decision from Society of Tourist Guides , a decision from 9 January 2020.

The Commission’s investigations also revealed that security testing had never been conducted since the launch of the Website in October 2018. In this regard, the Organisation admitted that it failed to take into consideration the security arrangements of the Website due to its lack of experience. As observed in WTS Automotive Services Pte Ltd [2018] SGPDPC 26 at [24], while an organisation may not have the requisite level of technical expertise, a responsible organisation would have made genuine attempts to give proper instructions to its service providers. The gravamen in the present case was the Organisation’s failure to do so.

Our final paragraph comes from Horizon Fast Ferry on 2 August 2019. It has some technical details of a breach.

Unbeknownst to the Organisation and contrary to its intention, the Contractor replicated the auto-retrieval and auto-population feature (which was only meant to be used in the internal CCIS) in the Booking Site as part of the website revamp. Consequently, whenever a user entered a passport number which matched a Returning Passenger’s passport number in the Database, the system would automatically retrieve and populate the remaining fields in the Booking Form with the Personal Data Set associated with the Returning Passenger’s passport number. As the Organisation failed to conduct proper user acceptance tests before launching the revamped Booking Site, the Organisation was not aware of this function until it was notified of the Incident.

Named Entity Recognition (NER)

A common task is to pick up special entities being referred to in the text. The normal examples are places and people, like Obama or Singapore. Picking up such entities can give us some extra context to the sentence, which is great for picking up more relationships. For example, “The organisation has to comply with section 24 of the PDPA”, would allow us to say that the organisation has to comply with a law called “section 24 of the PDPA”.

In Henry Park , the default settings failed to pick up any entities. All the words were identified as organisations, and a cursory look at the text does not agree. The provisions of the PDPA were not also identified as a Law.

[{'word': 'Personal', 'score': 0.533231258392334, 'entity': 'I-ORG', 'index': 9}, {'word': 'Data', 'score': 0.7411562204360962, 'entity': 'I-ORG', 'index': 10}, {'word': 'Protection', 'score': 0.9116974472999573, 'entity': 'I-ORG', 'index': 11}, {'word': 'Organisation', 'score': 0.77486252784729, 'entity': 'I-ORG', 'index': 14}, {'word': 'PD', 'score': 0.6492624878883362, 'entity': 'I-ORG', 'index': 29}, {'word': 'Organisation', 'score': 0.6983053684234619, 'entity': 'I-ORG', 'index': 45}, {'word': 'Organisation', 'score': 0.7912665605545044, 'entity': 'I-ORG', 'index': 57}, {'word': 'PD', 'score': 0.4049902558326721, 'entity': 'I-ORG', 'index': 86}]

Things don’t improve with Society of Tourist Guides. It got the “Organisation” as an organisation (duh), but it appeared to be confused with the citation.

[{'word': 'Commission', 'score': 0.9984033703804016, 'entity': 'I-ORG', 'index': 2}, {'word': 'Organisation', 'score': 0.8942610025405884, 'entity': 'I-ORG', 'index': 31}, {'word': 'Web', 'score': 0.6423472762107849, 'entity': 'I-MISC', 'index': 45}, {'word': 'W', 'score': 0.9225642681121826, 'entity': 'I-ORG', 'index': 57}, {'word': '##TS', 'score': 0.9971287250518799, 'entity': 'I-ORG', 'index': 58}, {'word': 'Auto', 'score': 0.9983740448951721, 'entity': 'I-ORG', 'index': 59}, {'word': '##mot', 'score': 0.9857855439186096, 'entity': 'I-ORG', 'index': 60}, {'word': '##ive', 'score': 0.9949991106987, 'entity': 'I-ORG', 'index': 61}, {'word': 'Services', 'score': 0.9982504844665527, 'entity': 'I-ORG', 'index': 62}, {'word': 'P', 'score': 0.9859911799430847, 'entity': 'I-ORG', 'index': 63}, {'word': '##te', 'score': 0.988381564617157, 'entity': 'I-ORG', 'index': 64}, {'word': 'Ltd', 'score': 0.9895501136779785, 'entity': 'I-ORG', 'index': 65}, {'word': 'S', 'score': 0.9822579622268677, 'entity': 'I-ORG', 'index': 69}, {'word': '##GP', 'score': 0.9837730526924133, 'entity': 'I-ORG', 'index': 70}, {'word': '##DP', 'score': 0.9856312274932861, 'entity': 'I-ORG', 'index': 71}, {'word': '##C', 'score': 0.8455315828323364, 'entity': 'I-ORG', 'index': 72}, {'word': 'Organisation', 'score': 0.682020902633667, 'entity': 'I-ORG', 'index': 121}]

For Horizon Fast Ferry , it picked up the Organisation as an Organisation but fared poorly otherwise.

[{'word': 'Organisation', 'score': 0.43128490447998047, 'entity': 'I-ORG', 'index': 8}, {'word': 'CC', 'score': 0.7553273439407349, 'entity': 'I-ORG', 'index': 43}, {'word': '##IS', 'score': 0.510469913482666, 'entity': 'I-MISC', 'index': 44}, {'word': 'Site', 'score': 0.5614328384399414, 'entity': 'I-MISC', 'index': 50}, {'word': 'Database', 'score': 0.8064141869544983, 'entity': 'I-MISC', 'index': 80}, {'word': 'Set', 'score': 0.7381612658500671, 'entity': 'I-MISC', 'index': 102}, {'word': 'Organisation', 'score': 0.5941202044487, 'entity': 'I-ORG', 'index': 115}, {'word': 'Site', 'score': 0.38454076647758484, 'entity': 'I-MISC', 'index': 132}, {'word': 'Organisation', 'score': 0.5979025959968567, 'entity': 'I-ORG', 'index': 135}]

Conclusion : it’s rubbish and you probably have to work on training on your own model with its own domain-specific information.

Question Answering

Question Answering is more fun. The model considers a text (the context) and generates text based on a question. Quite like filling in a blank but a tad less conscious.

I’m more hopeful for this one because the answer should be findable in the text. The model thus has to infer the answer by comparing the question and the context. Anyway the Transformer warns us again that we probably have to train our own model.

Easy tasks

For Henry Park , let’s ask an easy question: Which sections of the PDPA did the Organisation breach?

{'score': 0.8479166372224196, 'start': 122, 'end': 138, 'answer': '11(3), 12 and 24'}

The output gave the right answer! It was also about 84% sure it was right. Haha! Note that the model extracted the answer from the text.

For Society of Tourist Guides , let’s ask something a tad more difficult. Why did the Organisation not conduct security testing?

{'score': 0.43685166522863683, 'start': 283, 'end': 303, 'answer': 'lack of experience.'}

I agree with the output! The computer was not so sure though (43%).

For Horizon Fast Ferry , here’s a question everyone is dying to know: What happens if a user enters a passport number?

{'score': 0.010287796461319285, 'start': 393, 'end': 470, 'answer': 'automatically retrieve and populate the remaining fields in the Booking Form'}

Errm, the model was looking at the right place, but failed to include the second part of the clause. The answer may be misleading to someone with no access to the main text. However, the model was only 1% sure of its answer. For most Singaporean students, 1% confidence means you are going to fail your examinations.

Tweaking the model to give a longer answer allowed it to give the right answer with 48% confidence.

Now for something more difficult

I also experimented with answering a question from a long text. As the default pipeline has a 512 word or token limit, I downloaded the LongFormer model (all 2 Gbs of it), a model built for longer texts.

I ran the whole Henry Park decision text (which is still comparatively shorter than most decisions) through the model with the same question. However, the model failed to give the right answer. It gave its (wrong) response a probability score of “2.60319e-05”. You wouldn’t call that confident.

Theoretically, this isn’t surprising because a longer text means there are more possible answers. More possible answers means the probability of finding a wrong answer is much higher. This probably needs a creative workaround in order to be useful.

Conclusion : For easy tasks, even the default pipeline can give surprisingly good results. However the default pipeline can only accept a maximum of 512 tokens/words, which means it can only work on small paragraphs, and most legal texts are far longer than that.

Summarisation

We can also ask a computer to summarise another text. It’s a bit of a mystery to me how a computer can summarise a text. How does a computer decide what text to take out and to leave behind? Let’s take it for a whirl now anyway.

Henry Park excerpt (max length=50):

'the Deputy Commissioner for Personal Data Protection found the organisation in breach of sections 11(3), 12 and 24 of the PDPA . the organisation is directed to, within 60 days, appoint one or more individuals to be responsible'

Society of Tourist Guides (max length=50):

'security testing had never been conducted since launch of the Website in October 2018 . the organisation admitted it failed to take into consideration security arrangements due to its lack of experience . a responsible organisation would have made genuine attempts to give proper instructions to'

Horizon Fast Ferry:

'the Contractor replicated the auto-retrieval and auto-population feature in the Booking Site as part of the website revamp . when a user entered a passport number which matched a Returning Passenger’'

Henry Park full summary:

'the organisation had a website at https://hppa.org.sg (the “Website”) where members could view their own account particulars upon logging in using their assigned user ID and password . on 15 March 2019, the Personal Data Protection Commission received a complaint . the organisation failed to conduct adequate security testing before launching the Website .'

Conclusion : I think the summaries are OK, but they wouldn’t pass the Turing test. It appears to work better in the longer text as long as it’s not longer than 512 words. This is not long enough for almost all PDPC decisions. Would training help? That’s something to explore.

Conclusion

Using the default pipelines from Hugging Face provided quick food for natural language processing tasks using state of the art transformers. It’s really satisfying to see the model extract good answers from a text. I liked it too because it introduced me to the library.

However, our simple experiments also reveal that much work needs to be done to improve its performance. Training a model for the task is one such undertaking. Furthermore, given the limitations of some models, such as the length of the text the model can process, alternative workarounds may be needed to enjoy the models. I’ll take this for now though!

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

#NaturalLanguageProcessing #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural language processing, data extraction and processing!

This post is the latest one dealing with creating a corpus out of the decisions of the Personal Data Protection Commission of Singapore. However, this time I believe I got it right. If I don’t turn out to be an ardent perfectionist, I might have a working corpus.

Problem Statement

I must have written this a dozen of times here already. Law reports in Singapore unfortunately are generally available in PDF format. There is a web format accessible from LawNet, but it is neither free as in beer nor free as in libre. You have to work with PDFs to create a solution that is probably not burdened with copyrights.

However, text extracted from PDFs can look like crap. It’s not the PDF’s fault. PDFs are designed to look the same on any computer. Thus a PDF comprises of a bunch of text boxes and containers rather than paragraphs and headers. If the text extractor is merely reading the text boxes, the results will not look pretty. See an example below:

    The operator, in the mistaken belief that the three statements belonged
     to  the  same  individual,  removed  the  envelope  from  the  reject  bin  and
     moved it to the main bin. Further, the operator completed the QC form
     in  a  way  that  showed  that  the  number  of  “successful”  and  rejected
     Page 3 of 7
     
     B.
     (i)
     (ii)
     9.
     (iii)
     envelopes tallied with the expected total from the run. As the envelope
     was  no  longer  in  the  reject bin, the  second and  third  layers  of  checks
     were by-passed, and the envelope was sent out without anyone realising
     that it contained two extra statements. The manual completion of the QC
     form by the operator to show that the number of successful and rejected
     envelopes tallied allowed this to go undetected. 

Those lines are broken up with new lines by the way. So the computer reads the line in a way that showed that the number of “successful” and rejected, and then wonders “rejected” what?! The rest of the story continues about seven lines away. None of this makes sense to a human, let alone a computer.

Previous workarounds were... unimpressive

Most beginning data science books advise programmers to using regular expressions as a way to clean the text. This allowed me to choose what to keep and reject in the text. I then joined up the lines to form a paragraph. This was the subject of Get rid of the muff: pre-processing PDPC Decisions.

As mentioned in that post, I was very bothered with removing useful content such as paragraph numbers. Furthermore, it was very tiring playing whack a-mole figuring out which regular expression to use to remove a line. The results were not good enough for me, as several errors continued to persist in the text.

Not wanting to play whack-a-mole, I decided to train a model to read lines and make a guess as to what lines to keep or remove. This was the subject of First forays into natural language processing — get rid of a line or keep it? The model was surprisingly accurate and managed to capture most of what I thought should be removed.

However, the model was quite big, and the processing was also slow. While using natural language processing was certainly less manual, I was just getting results I would have obtained if I had worked harder at regular expressions. This method was still not satisfactory for me.

I needed a new approach.

A breakthrough — focus on the layout rather than the content

I was browsing Github issues on PDFMiner when it hit me like a brick. The issue author had asked how to get the font data of a text on PDFMiner.

I then realised that I had another option other than removing text more effectively or efficiently.

Readers don’t read documents by reading lines and deciding whether to ignore them or not. The layout — the way the text is arranged and the way it looks visually — informs the reader about the structure of the document.

Therefore, if I knew how the document was laid out, I could determine its structure based on observing its rules.

Useful layout information to analyse the structure of a PDF.

Rather than relying only on the content of the text to decide whether to keep or remove text, you now also have access to information about what it looks like to the reader. The document’s structure is now available to you!

Information on the structure also allows you to improve the meaning of the text. For example, I replaced the footnote of a text with its actual text so that the proximity of the footnote is closer to where it was supposed to be read rather than finding it at the bottom of page. This makes the text more meaningful to a computer.

Using PDFMiner to access layout information

To access the layout information of the text in a PDF, unfortunately, you need to understand the inner workings of a PDF document. Fortunately, PDFMiner simplifies this and provides it in a Python-friendly manner.

Your first port of call is to extract the page of the PDF as an LTPage. You can use the high level function extract_pages for this.

Once you extract the page, you will be able to access the text objects as a list of containers. That’s right — using list comprehension and generators will allow you to access the containers themselves.

Once you have access to each container in PDFMiner, the metadata can be found in its properties.

The properties of a LTPage reveal its layout information.

It is not apparent from the documentation, so studying the source code itself can be very useful.

Code Highlights

Here are some highlights and tips from my experience so far implementing this method using PDFMiner.

Consider adjusting LAParams first

Before trying to rummage through your text boxes in the PDF, pay attention to the layout parameters which PDFMiner uses to extract text. Parameters like line, word, and char margin determine how PDFMiner groups texts together. Effectively setting these parameters can help you to just extract the text.

Notwithstanding, I did not use LAParams as much for this project. This is because the PDFs in my collection can be very arbitrary in terms of layout. Asking PDFMiner to generalise in this situation did not lead to satisfactory results. For example, PDFMiner would try to join lines together in some instances and not be able to in others. As such, processing the text boxes line by line was safer.

Retrieving text margin information as a list

As mentioned in the diagram above, margin information can be used to separate the paragraph numbers from their text. The original method of using a regular expression to detect paragraph numbers had the risk of raising false positives.

The x0 property of an LTTextBoxContainer represents its left co-ordinate, which is its left margin. Assuming you had a list of LTTextBoxContainers (perhaps extracted from a page), a simple list comprehension will get you all the left margins of the text.

    from collections import Counter
    from pdfminer.high_level import extract_pages
    from pdfminer.layout import LTTextContainer, LAParams
    
    limit = 1 # Set a limit if you want to remove uncommon margins
    first_page = extract_pages(pdf, page_numbers=[0], laparams=LAParams(line_margin=0.1, char_margin=3.5))
    containers = [container for container in first_page if isinstance(container, LTTextContainer)]
    text_margins = [round(container.x0) for container in containers]
    c = Counter(text_margins)
    sorted_text_margins = sorted([margin for margin, count in c.most_common() if count &gt; limit])</code></pre>

Counting the number of times a text margin occurs is also useful to eliminate elements that do not belong to the main text, such as titles and sometimes headers and footers.

You also get a hierarchy by sorting the margins from left to right. This is useful for separating first-level text from the second-level, and so on.

Using the gaps between lines to determine if there is a paragraph break

As mentioned in the diagram above, the gap between lines can be used to determine if the next line is a new paragraph.

In PDFMiner as well as PDF, the x/y coordinate system of a page starts from its lower-left corner (0,0). The gap between the current container and the one immediately before it is between the current container’s y1 (the current container’s top) and the previous container’s y0 (the previous container’s base). Conversely, the gap between the current container and the one immediately after is the current container’s y0 (the current container’s base) and the next container’s y1 (the next container’s top)

Therefore, if we have a list of LTTextBoxContainers, we can write a function to determine if there is a bigger gap.

    def check_gap_before_after_container(containers: List[LTTextContainer], index: int) -> bool:
        index_before = index - 1
        index_after = index + 1
        gap_before = round(containers[index_before].y1 - containers[index].y0)
        gap_after = round(containers[index].y1 - containers[index_after].y0)
        return gap_after >= gap_before

Therefore if the function returns true, we know that the current container is the last line of the paragraph. As I would then save the paragraph and start a new one, keeping the paragraph content as close to the original as possible.

Retrieve footnotes with the help of the footnote separator

As mentioned in the diagram, footnotes are generally smaller than the main text, so you could get footnotes by comparing the height of the main text with that of the footnotes. However, this method is not so reliable because some small text may not be footnotes (figure captions?).

For this document, the footnotes are contained below a separator, which looks like a line. As such, our first task is to locate this line on the page. In my PDF file, another layout object, LTRect, describes lines. An LTRect with a height of less than 1 point appears not as a rectangle, but as a line!

    if footnote_line_container := [container for container in page if all([ 
        isinstance(container, LTRect),
        container.height < 1,
        container.y0 < 350,
        container.width < 150,
        round(container.x0) == text_margins[0]
    ])]:
        footnote_line_container = sorted(footnote_line_container, key=lambda container: container.y0)
        footnote_line = footnote_line_container[0].y0

Once we have obtained the y0 coordinate of the LTRect, we know that any text box under this line is a footnote. This accords with the intention of the maker of the document, which is sometimes Microsoft Word!

You might notice that I have placed several other conditions to determine whether a line is indeed a footnote marker. It turns out that LTRects are also used to mark underlined text. The extra conditions I added are the length of the line (container.width < 150), whether the line is at the top or bottom half of the page (container.y0 < 350), and that it is in line with the leftmost margin of the text (round(container.x0) == text_margins[0]).

For Python programming, I also found using the built-in all() and any() to be useful in improving the readability of the code if I have several conditions.

I also liked using a new feature in the latest 3.8 version of Python: the walrus operator! The code above might not work for you if you are on a different version.

Conclusion

Note that we are reverse-engineering the PDF document. This means that getting perfect results is very difficult, and you would probably need to try several times to deal with the side effects of your rules and conditions. The code which I developed for pdpc-decisions can look quite complicated and it is still under development!

Given a choice, I would prefer using documents where the document’s structure is apparent (like web pages). However, such an option may not be available depending on your sources. For complicated materials like court judgements, studying the structure of a PDF can pay dividends. Hopefully, some of the ideas in this post will get you thinking about how to exploit such information more effectively.

#PDPC-Decisions #NaturalLanguageProcessing #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

I’m always trying my best to be a good husband. Unfortunately, compared to knitting, cooking and painting, computer programming does not look essential. You sit in front of a computer, furiously punching on your keyboard. Something works, but most people can’t see it. Sure, your phone works, but how did your sitting around contribute to that? Geez. It’s time to contribute around the house by automating with Python!

The goal is to create a script that will download a sudoku puzzle from daily sudoku and print it in at home. My wife likes doing these puzzles, so I was sure that she would appreciate having one waiting for her to conquer in the printer tray.

You can check out the code in its repository and the docker hub for the image. An outline of the solution is provided below:

  1. Write a Python script that does the following things: (1) Set up a schedule to get the daily sudoku and download it, (2) sends an email to my HP Printer email address.
  2. HP ePrint prints the puzzle.
  3. Dockerize the container so that it can be run on my home server.
  4. Wait patiently around 9:25 am every day at the printer for my puzzle.

Coding highlights

This code is pretty short since I cobbled it together in about 1 night. You can read it on your own, but here are some highlights.

Download a file by constructing its path

As I mentioned before, you have to study your quarry carefully in order to use it. For the daily sudoku website, I found a few possibilities to automate the process of getting your daily sudoku.

  • Visit the main web page and “click” on the using an automated web browser.
  • Parse the RSS feed and get the puzzle you would like.
  • “Construct” the URL to the PDF of the puzzle of the day

I went with the last option because it was the easiest to implement without needing to download additional packages or pursue extra steps.

now = datetime.now()
r = requests.get( f”http://www.dailysudoku.com/sudoku//pdf/{now.year}/" f”{now.strftime('%m')}/{now.strftime('%Y-%m-%d')}S1N1.pdf”, timeout=5)

Notice that the python code using f-strings and strftime, which provides a text format for you to fill your URL.

This method ain’t exactly foolproof. If the structure of the website is changed, the whole code is useless.

Network printing — a real PITA

My original idea was to send a PDF directly to a network printer. However, it was far more complicated than I expected. You could send a file through the Internet Print Protocol, Line Printer Daemon or even HP’s apparently proprietary port 9100. First, though, you might need to convert the PDF file to a Postscript file. Then locate or open a socket… You could install CUPS in your container…

Errm never mind.

Sending an email to print

Luckily for me, HP can print PDF attachments sent by email. It turns out that sending a simple email using Python is quite straightforward.

msg = EmailMessage() msg['To'] = printemail msg['From'] = smtpuser msg['Subject'] = 'Daily sudoku' msg.addattachment(r.content, maintype='application', subtype='pdf', filename='sudoku.pdf') with SMTP(smtpserver, 587) as s: s.starttls() s.login(smtpuser, smtppassword) s.send_message(msg)

However, HP’s requirements for sending a valid email to their HP ePrint servers is kind of quirky. For example, your attachment will not print if:

  • There is no subject in the email.
  • The attachment has no filename (stating the MIME type of the file is not enough)
  • The person who emails the message must be a permitted user. You have to go to the HP Connected website to set these allowed senders manually.

Setting the local timezone for the docker container

The schedule package does not deal with time zones. To be fair, if you are not really serving an international audience, that’s not too bad. However, for a time-sensitive application like this, there’s a big difference between waiting for your puzzle at 9:30 am local time and 9:30 am UTC (that’s nearly time to knock off work in Singapore!).

Setting your time zone in a docker container depends on the base image of the Operating System you used. I used Debian for this container, so the code is as follows.

RUN ln -sf /usr/share/zoneinfo/Asia/Singapore /etc/localtime

Note that the script sleeps for 3 hours before executing pending jobs. This means that while the job is submitted at 9:30 am, it may be quite sometime later before it is executed.

Environment variables

The code does not make it very obvious, but you must set environment variables in order to use the script. (Oh read the README for crying out loud) This can be done in a cinch with a docker-compose file.

sudoku: image: “houfu/dailysudoku:latest” hostname: sudoku containername: dailysudoku environment: – PRINT[email protected] – SMTPSERVER=smtp.gmail.com – SMTP[email protected] – SMTP-PWD=blah blah blah

Update 17/6/2020: There was a typo in the address of Gmail’s SMTP server and this is now rectified.

Conclusion

I hastily put this together. It’s more important to have a puzzle in your hand that configuration variables and plausibly smaller containers. Since the project is personal, I don’t think I will be updating it much unless it stops working for me. I’ll be happy to hear if you’re using it and you may have some suggestions.

In the meantime, do visit www.dailysudoku.com for more puzzles, solutions, hints, books and other resources!

#Programming #docker #Python

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

I have always liked Jupyter Notebooks. They were my first contact with python and made the language fun and easy. It is even better when you want to explore a data set. You can just get the data and then plot it into a graph quickly. It’s even good enough to present them to other people.

However, if you want to run the same Jupyter Notebook server in different environments, it can be very challenging. Virtual environments can be useful but very difficult to sync. I own a MacOS, a Windows PC and a Linux server. Getting these parts to play with each other nicely has not been a happy experience. Repositories are inconsistent and I have spent time figuring out quirks in each operating environment.

Solution: Dockerize your Jupyter

Since being exposed to docker after setting up a docassemble server, I have been discovering all the wonders of virtual computers. The primary bugbear which is solved by docker is that it provides the means for you not to care about your computer’s background environment, instead of building “blocks” to create your application.

For this application, I wanted the following pieces of software to work together:

  • A python environment for data analysis
  • spaCy for natural language processing, given its prebuilt models and documentation
  • Jupyter notebooks to provide a web-based editor I can use anywhere
  • A MongoDB driver since the zeeker database is currently hosted in a MongoDB Atlas service.

Once all this is done, instead of writing jupyter notebook on a console and praying that I am in the right place, I will run docker run houfu/zeeker-notebooks and create a server from a build in the cloud.

The Magic: The Dockerfile

Shocking! All you need to create a docker image is a simple text file! To put it simply, it is like a recipe book that contains all the steps to build your docker image. Therefore, by working out all the ingredients in your recipe, you can create your very own docker image. The “ingredients” in this case, would be all the pieces of software I listed in the previous section.

Of course, there are several ways to slice an apple, and creating your own docker image is no different. Docker has been around for some time and has enjoyed huge support, so you can find a docker image for almost every type and kind of open-source software. For this project, I first tried docker-compose with a small Linux server and installing python and Linux.

However quite frankly, I didn’t need to reinvent the wheel. The jupyter project already has several builds for different flavours and types of projects. Since I knew that some of my notebooks needed matplotlib and pandas, I chose the scipy-notebook base image. This is the first line in my “recipe”:

FROM jupyter/scipy-notebook

The other RUN lines allow me to install dependencies and other software like spaCy and MongoDB. These are the familiar instructions you would normally use on your computers. I even managed to download the spaCy models into the docker image!

RUN conda install —quiet —yes -c conda-forge spacy && \ python -m spacy download encorewebsm && \ python -m spacy download encorewebmd && \ conda clean —all -f -y

RUN pip install —no-cache-dir pymongo dnspython

Once docker has all these instructions, it can build the image. Furthermore, since the base image already contains the functionality of the jupyter notebook, there’s no need to include any further instructions on EXEC or ENTRYPOINT.

Persistence: Get your work saved!

The Dockerfile was only 12 lines of code. At this point, you are saying — this is too good to be true! Maybe, you are right. Because what goes on in docker, stays in docker. Docker containers are not meant to be true computers. Just so that they are lightweight and secure, they can easily be destroyed. If you created a great piece of work in the docker container, you have to ensure that it gets saved somewhere.

The easiest way to deal with this is to bind a part of your computer’s filesystem to a part of your docker container’s filesystem. This way your jupyter server in the docker container “sees” your files. Then your computer gets to keep those files no matter what happens to the container. Furthermore, if you run this from a git repository, your changes can be reflected and merged in the origin server.

This is the command you call in your shell terminal:

$ docker run -p 8888:8888 —mount type=bind,source=“$(pwd)”,target=/home/jovyan/work houfu/zeeker-notebooks

Since user accounts in the base image are named jovyan (another way of saying Jupiter), this is where we bind them. The $(pwd) is an abbreviation that allows the “present working directory” to be filled in as a source, no matter where or how you saved the repository.

Screenshot of jupyter notebook page showing file directory.

There you have it! Let’s get working!

Bonus: Automate your docker image builds

It is great that you have got docker to create your standard environment for you. Now let’s take it one step further! Won’t it be great if docker would update the image on the repository whenever there are changes to the code? That way the “latest” tag actually means what it says,

You can do that in a cinch by setting up auto-build if you are publishing on Docker hub. Link your source repository on GitHub to your docker hub repository and configure automated builds. This way, every time there is a push to your repository, an image is automatically built, providing the latest image to all using the repository.

Webpage in Docker showing automated builds.

Conclusion

It’s fun to explore new technologies, but you must consider whether they will help your current workflow or not. In this case, docker will help you to take your environment with you to share with anyone and on any computer. You will save time figuring out how to make your tools work, focusing on how to make your tools work for you. Are you seeing any other uses for docker in your work?

#Programming #docker #Python

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

The life of a budding data science enthusiast.

You need data to work on, so you look all around you for something that has meaning to you. Anyone who reads the latest 10 posts of this blog knows I have a keen interest in data protection in Singapore. So naturally, I wanted to process the PDPC’s enforcement cases. Fortunately for me, the listing of published cases are complete, and they are not exactly hindered by things like stare decisis. We begin by scraping the website.

The Problem

Using the scraping method I devised, you will now get a directory filled with PDFs of the cases. Half the battle done, right? If you thought so, then you have not looked hard enough at the product.

It’s right there. They are PDFs. Notwithstanding its name, PDFs do not actually represent text well. They do represent pages or documents, which means you what get what you see and not what you read.

I used PDFminer to extract the text from the PDF, and this is a sample output that I get:

The operator, in the mistaken belief that the three statements belonged to the same individual, removed the envelope from the reject bin and moved it to the main bin. Further, the operator completed the QC form in a way that showed that the number of “successful” and rejected Page 3 of 7 B. (i) (ii) 9. (iii) envelopes tallied with the expected total from the run. As the envelope was no longer in the reject bin, the second and third layers of checks were by-passed, and the envelope was sent out without anyone realising that it contained two extra statements. The manual completion of the QC form by the operator to show that the number of successful and rejected envelopes tallied allowed this to go undetected.

Notice the following:

  • There are line breaks in the middle of sentences. This was where the sentence broke apart for new lines in the document. The computer would read “The operator, in the mistaken belief that the three statements belonged” and then go “What? What happened?”
  • Page footers and headers appears in the document. They make sense when you are viewing a document, but are meaningless in a text.
  • Orphan bullet and paragraph numbers. They used to belong to some text, but now nobody knows. Table contents are also seriously borked.

If you had fed this to your computer or training, you are going to get rubbish. The next step, which is very common in data science but particularly troublesome in natural language processing is preprocessing. We have to fix the errors ourselves before letting your computer do more with it.

I used to think that I could manually edit the files and fix the issues one by one, but it turned out to be very time consuming. (Duh!) My computer had to do the work for me. Somehow!

The Preprocessing Solution

Although I decided I would let the computer do the correcting for me, I still had to do some work on my own. Instead of looking for errors, this time I was looking for patterns instead. This involved scanning through the output of the PDF to Text converter and the actual document. Once I figured out how these errors came about, I can get down to fixing it.

Deciding what to take out, what to leave behind

Not so fast. Unfortunately, correcting errors is not the only decision I had to make. Like many legal reports, PDPC decisions have paragraph numbers. These numbers are important. They are used for cross-referencing. In the raw output, the paragraph numbers are present as plain numbers in the text. They may be useful to a human reader who knows they are meta-information in the text, but to a computer it probably is just noise.

I decided to take it out. I don’t doubt that one day we can engineer a solution that makes it useful, but for now, they are just distractions.

Put Regular Expressions to work!

As mentioned earlier, we looked for patterns in the text to help the computer to look for them and correct them. I found regular expressions to be a very powerful way to express such patterns in a way that the computer can look for. A regular expression is sort of like a language to define a search pattern.

For example,. this code in bold looks for feed carriages in the text (a “carriage return” is what happens when you press ‘Enter’ on your keyboard, and is much cooler in a typewriter)

def removefeedcarriage(source): return [x for x in source if not re.search(r'\f', x)]

This python code tells the computer not to include any text which contains feed carriages in the text. This eliminates the multiple blank lines created by the PDF converter (usually blank space in the PDF).

A well crafted regular expression can find a lot of things. For example, the expression in bold looks for citations (“[2016] SGPDPC 15” and others) and removes them.

def remove_citations(source): return [x for x in source if not re.search(r'^[\d{4}]\s+(?:\d\s+)?[A-Z|()]+\s+\d+[\s.]?$', x)]

Figuring out the “language” in regular expressions does takes some time, but it pays dividends. To help the journey, I test my expressions using freely available websites for testing and providing a reference for regular expressions. For python, this is one of the sites I used.

Getting some results!

Besides removing citations, feed carriages and paragraph numbers, I also tried to join broken sentences together. In all, the code manages to remove around 90% of the extra line breaks. Most of the paragraphs in the text actually reads like sentences and I feel much more confident training a model based on these text.

It ain’t perfect of course. The text gets really mushed up once a table is involved, and the headers and footers are not exactly removed all the time. But as Voltaire said, “Perfect is the enemy of the good”. For now, this will do.

Concluding remarks

Hopefully this pretty quick rundown of how I pre-processed the pdpc-decisions will give you some idea as to what to do in your own projects. Now that I have the text, I have got to find something to use it for! :O Is there any other ways to improve the code to catch even more errors? Feel free to comment to let me know!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

#PDPC-Decisions #PDFMiner #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wrote this post, the structure of the PDPC’s website was drastically altered. The concepts and the ideas in this post haven’t changed, but the examples are outdated. This gives me a chance to rewrite this post. If I ever get round to it, I’ll provide a link.

Regular readers would already know that I try to run a github repository which tries to compile all personal data protection decisions in Singapore. Decisions are useful resources teeming with lots of information. They have statistics, insights into what factors are relevant in decision making and show that data protection is effective in Singapore. Even basic statistics about decisions make newspaper stories here locally. It would be great if there was a way to mine all that information!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

Unfortunately, using the Personal Data Protection Commission in Singapore’s website to download judgements can be painful.

This is our target webpage today – Note the website has been transformed.

As you can see, you are only able to view no more than 5 decisions at one time. As the first decision dates back to 2016, you will have to go through several pages to grab everything! Actually just 23. I am sure you can do all that in 1 night, right? Right?

If you are not inclined to do it, then get your computer to do it. Using selenium, I wrote a python script to automate the whole process of finding all the decisions available on the website. What could have been a tedious night’s work was accomplished in 34 seconds.

Check out the script here.

What follows here is a step by step write up of how I did it. So hang on tight!

Section 1: Observe your quarry

Before setting your computer loose on a web page, it pays to understand the structure and inner workings of your web page. Open this up by using your favourite browser. For Chrome, this is Developer's Tools and in Firefox, this is Web Developer. You will be looking for a tab called Sources, which shows you the HTML code of the web page.

Play with the structure of the web page by hovering over various elements of the web page with your mouse. You can then look for the exact elements you need to perform your task:

  • In order to see a new page, you will have to click on the page number in the pagination. This is under a section (a CSS class) called group__pages. Each page-number is under a section (another CSS class) called page-number.
  • Each decision has its own section (a CSS class) named press-item. The link to the download, which is either to a text file or a PDF file, is located in a link in each press-item.
  • Notice too that each press-item also has other metadata regarding the decision. For now, we are curious about the date of the decision and the respondent.

Section 2: Decide on a strategy

Having figured out the website, you can decide on how to achieve your goal. In this case, it would be pretty similar to what you would have done manually.

  1. Start on a page
  2. Click on a link to download
  3. Go to the next link until there are no more links
  4. Move on to the next page
  5. Keep repeating steps 1 to 4 until there are no more pages
  6. Profit!

Since we did notice the metadata, let’s use it. If you don’t use what is already in front of you, you will have to read the decision to extract such information In fact, we are going to use the metadata to name our decision.

Section 3: Get your selenium on it!

Selenium drives a web browser. It mimics user interactions on the web browser, so our strategy in Step 2 is straightforward to implement. Instead of moving our mouse like we ordinarily would, we would tell the web driver what to do instead.

WebDriver :: Documentation for SeleniumDocumentation for SeleniumSelenium

Let’s translate our strategy to actual code.

Step 1: Start on a page

We are going to need to start our web driver and get it to run on our web page.

from selenium.webdriver import Chrome from selenium.webdriver.chrome.options import Options PDPCdecisionssite = “https://www.pdpc.gov.sg/Commissions-Decisions/Data-Protection-Enforcement-Cases" # Setup webdriver options = Options() # Uncomment the next two lines for a headless chrome # options.addargument('—headless') # options.addargument('—disable-gpu') # options.addargument('—window-size=1920,1080') driver = Chrome(options=options) driver.get(PDPCdecisions_site)

Steps 2: Download the file

Now that you have prepared your page, let’s drill down to the individual decisions itself. As we figured out earlier, each decision is found in a section named press-item. Get selenium to collect all the decisions on the page.

judgements = driver.findelementsbyclassname('press-item')

Recall that we were not just going to download the file, we will also be using the date of the decision and the respondent to name the file. For the date function, I found out that under each press-item there is a press-date which gives us the text of the decision date; we can easily convert this to a python datetime so we can format it anyway we like.

def getdate(item: WebElement): itemdate = datetime.strptime(item.findelementbyclassname('press_date').text, “%d %b %Y”) return itemdate.strftime(“%Y-%m-%d”)

For the respondent, the heading (which is written in a fixed format and also happens to be the link to the download – score!) already gives you the information. Use a regular expression on the text of the link to suss it out. (One of the decisions do not follow the format of “Breach … by respondent “, so the alternative is also laid out)

def get_respondent(item): text = item.text return re.split(r”\s+[bB]y|[Aa]gainst\s+“, text, re.I)[1]

You are now ready to download a file! Using the metadata and the link you just found, you can come up with meaningful names to download your files. Naming your own files will also help you avoid the idiosyncratic ways the PDPC names its own downloads.

Note that some of the files are not PDF downloads but instead are short texts in web pages. Using the earlier strategies, you can figure out what information you need. This time, I used BeautifulSoup to get the information. I did not want to use selenium to do any unnecessary navigation. Treat PDFs and web pages differently.

def downloadfile(item, filedate, filerespondent): url = item.getproperty('href') print(“Downloading a File: “, url) print(“Date of Decision: “, filedate) print(“Respondent: “, filerespondent) if url[-3:] == 'pdf': dest = SOURCEFILEPATH + filedate + ' ' + filerespondent + '.pdf' wget.download(url, out=dest) else: with open(SOURCEFILEPATH + filedate + ' ' + filerespondent + '.txt', “w”) as f: from bs4 import BeautifulSoup from urllib.request import urlopen soup = BeautifulSoup(urlopen(url), 'html5lib') text = soup.find('div', class_='rte').getText() lines = re.split(r”ns+“, text) f.writelines([line + 'n' for line in lines if line != “”])

Steps 3 to 5: Download every item on every page

The next steps follow a simple idiom — for every page and for every item on each page, download a file.

for pagecount in range(len(pages)): pages[pagecount].click() print(“Now at Page “, pagecount) pages = refreshpages(driver) judgements = driver.findelementsbyclassname('press-item') for judgement in judgements: date = getdate(judgement) link = judgement.findelementbytagname('a') respondent = getrespondent(link) download_file(link, date, respondent)

Unfortunately, once selenium changes a page, it needs to be refreshed. We are going to need a new group__pages and page-number in order to continue accessing the page. I wrote a function to “refresh” the variables I am using to access these sections.

def refreshpages(webdriver: Chrome): grouppages = webdriver.findelementbyclassname('group_pages') return grouppages.findelementsbyclassname('page-number') . . . pages = refresh_pages(driver)

Conclusion

Once you got your web driver to be thorough, you are done! In my last pass, 115 decisions were downloaded in 34 seconds. The best part is that you can repeat this any time there are new decisions. Data acquisition made easy! At least until the PDPC breaks its website.

Postscript: Is this… Illegal?

I’m listening…

Web scraping has always been quite controversial and the stakes can be quite high. Copyright infringement, Misuse of Computer Act and trespass, to name a few. Funnily enough, manually downloading may be less illegal than using a computer. The PDPC’s own terms of use is not on point at this.

( Update 15 Mar 2021 : OK I think I am being fairly obtuse about this. There is a paragraph that states you can’t use robots or spiders to monitor their website. That might make sense in the past when data transfers were expensive, but I don't think that this kind of activity at my scale can crash a server.)

Personally, I feel this particular activity is fairly harmless — this is considered “own personal, non-commercial use” to me. I would likely continue with this for as long as I would like my own collection of decisions. Or until they provide better programmatic access to their resources.

Ready to mine free online legal materials in Singapore? Not so fast!Amendments to Copyright Act might support better access to free online legal materials in Singapore by robots. I survey government websites to find out how friendly they are to this.Love.Law.Robots.HoufuIn 2021, the Copyright Act in Singapore was amended to support data analysis, like web scraping? I wrote this follow-up post.

#PDPC-Decisions #Programming #Python #tutorial #Updated

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu