Toying around with Hugging Face transformers and PDPC decisions

Feature image

For most natural language processing projects, the steps are quite straightforward. First, get a data set. Next, train a model for the task you want and fine-tune it. Once you’re happy, deploy and demo the product! In recent years, more and more powerful transformers like BERT and GPT-3 might make these steps more manageable, but you’re still doing the same things.

However, many libraries have default settings which allow you to play with their capabilities quickly. I did this once with spaCy, and while it was not perfect, it was still fascinating.

Today’s experiment is with HuggingFace’s transformers, the most exciting place on the internet for natural language processing. It’s a great place to experience cutting edge NLP models like BERT, ROBERTA and GPT-2. It’s also open source! (Read: free)

I will play with my own data and hopefully also provide an introduction to various NLP tasks available.

A quick installation and a run

We would like quick settings so that we can focus on experimenting and having fun. If we had to follow a 50-step process to get a job going, most people would be terrified. Thankfully, the folks at Hugging Face were able to condense the work into a pipeline, so a few lines of work is all you really need.

The main resource to refer to is their notebook on pipelines. For readers who wouldn’t read, the steps are outlined generally below.

  1. Install Hugging Face transformers: pip install transformers
  2. Import the pipeline into your code or notebook: from transformers import pipeline
  3. Start the pipeline: (example) nlp_token_class = pipeline('ner')
  4. Call the pipeline on the text you want processed: nlp_token_class('Here is some example text')

Sample texts

We are going to run three sample texts through the transformers.

First up is the conclusion of Henry Park Primary School Parents’ Association. A summary decision from 11 February 2020. I have also used the full text of the summary in some cases.

In the circumstances, the Deputy Commissioner for Personal Data Protection found the Organisation in breach of sections 11(3), 12 and 24 of the PDPA. In determining the directions, the Deputy Commissioner took into consideration that the Organisation was a volunteer organisation made up primarily of parents. The Organisation is directed to, within 60 days, (i) appoint one or more individuals to be responsible for ensuring that it complies with the PDPA, (ii) develop and implement internal data protection and training policies, and (iii) to put all volunteers handling personal data through data protection training.

Next, a sequence referring to an earlier decision from Society of Tourist Guides , a decision from 9 January 2020.

The Commission’s investigations also revealed that security testing had never been conducted since the launch of the Website in October 2018. In this regard, the Organisation admitted that it failed to take into consideration the security arrangements of the Website due to its lack of experience. As observed in WTS Automotive Services Pte Ltd [2018] SGPDPC 26 at [24], while an organisation may not have the requisite level of technical expertise, a responsible organisation would have made genuine attempts to give proper instructions to its service providers. The gravamen in the present case was the Organisation’s failure to do so.

Our final paragraph comes from Horizon Fast Ferry on 2 August 2019. It has some technical details of a breach.

Unbeknownst to the Organisation and contrary to its intention, the Contractor replicated the auto-retrieval and auto-population feature (which was only meant to be used in the internal CCIS) in the Booking Site as part of the website revamp. Consequently, whenever a user entered a passport number which matched a Returning Passenger’s passport number in the Database, the system would automatically retrieve and populate the remaining fields in the Booking Form with the Personal Data Set associated with the Returning Passenger’s passport number. As the Organisation failed to conduct proper user acceptance tests before launching the revamped Booking Site, the Organisation was not aware of this function until it was notified of the Incident.

Named Entity Recognition (NER)

A common task is to pick up special entities being referred to in the text. The normal examples are places and people, like Obama or Singapore. Picking up such entities can give us some extra context to the sentence, which is great for picking up more relationships. For example, “The organisation has to comply with section 24 of the PDPA”, would allow us to say that the organisation has to comply with a law called “section 24 of the PDPA”.

In Henry Park , the default settings failed to pick up any entities. All the words were identified as organisations, and a cursory look at the text does not agree. The provisions of the PDPA were not also identified as a Law.

[{'word': 'Personal', 'score': 0.533231258392334, 'entity': 'I-ORG', 'index': 9}, {'word': 'Data', 'score': 0.7411562204360962, 'entity': 'I-ORG', 'index': 10}, {'word': 'Protection', 'score': 0.9116974472999573, 'entity': 'I-ORG', 'index': 11}, {'word': 'Organisation', 'score': 0.77486252784729, 'entity': 'I-ORG', 'index': 14}, {'word': 'PD', 'score': 0.6492624878883362, 'entity': 'I-ORG', 'index': 29}, {'word': 'Organisation', 'score': 0.6983053684234619, 'entity': 'I-ORG', 'index': 45}, {'word': 'Organisation', 'score': 0.7912665605545044, 'entity': 'I-ORG', 'index': 57}, {'word': 'PD', 'score': 0.4049902558326721, 'entity': 'I-ORG', 'index': 86}]

Things don’t improve with Society of Tourist Guides. It got the “Organisation” as an organisation (duh), but it appeared to be confused with the citation.

[{'word': 'Commission', 'score': 0.9984033703804016, 'entity': 'I-ORG', 'index': 2}, {'word': 'Organisation', 'score': 0.8942610025405884, 'entity': 'I-ORG', 'index': 31}, {'word': 'Web', 'score': 0.6423472762107849, 'entity': 'I-MISC', 'index': 45}, {'word': 'W', 'score': 0.9225642681121826, 'entity': 'I-ORG', 'index': 57}, {'word': '##TS', 'score': 0.9971287250518799, 'entity': 'I-ORG', 'index': 58}, {'word': 'Auto', 'score': 0.9983740448951721, 'entity': 'I-ORG', 'index': 59}, {'word': '##mot', 'score': 0.9857855439186096, 'entity': 'I-ORG', 'index': 60}, {'word': '##ive', 'score': 0.9949991106987, 'entity': 'I-ORG', 'index': 61}, {'word': 'Services', 'score': 0.9982504844665527, 'entity': 'I-ORG', 'index': 62}, {'word': 'P', 'score': 0.9859911799430847, 'entity': 'I-ORG', 'index': 63}, {'word': '##te', 'score': 0.988381564617157, 'entity': 'I-ORG', 'index': 64}, {'word': 'Ltd', 'score': 0.9895501136779785, 'entity': 'I-ORG', 'index': 65}, {'word': 'S', 'score': 0.9822579622268677, 'entity': 'I-ORG', 'index': 69}, {'word': '##GP', 'score': 0.9837730526924133, 'entity': 'I-ORG', 'index': 70}, {'word': '##DP', 'score': 0.9856312274932861, 'entity': 'I-ORG', 'index': 71}, {'word': '##C', 'score': 0.8455315828323364, 'entity': 'I-ORG', 'index': 72}, {'word': 'Organisation', 'score': 0.682020902633667, 'entity': 'I-ORG', 'index': 121}]

For Horizon Fast Ferry , it picked up the Organisation as an Organisation but fared poorly otherwise.

[{'word': 'Organisation', 'score': 0.43128490447998047, 'entity': 'I-ORG', 'index': 8}, {'word': 'CC', 'score': 0.7553273439407349, 'entity': 'I-ORG', 'index': 43}, {'word': '##IS', 'score': 0.510469913482666, 'entity': 'I-MISC', 'index': 44}, {'word': 'Site', 'score': 0.5614328384399414, 'entity': 'I-MISC', 'index': 50}, {'word': 'Database', 'score': 0.8064141869544983, 'entity': 'I-MISC', 'index': 80}, {'word': 'Set', 'score': 0.7381612658500671, 'entity': 'I-MISC', 'index': 102}, {'word': 'Organisation', 'score': 0.5941202044487, 'entity': 'I-ORG', 'index': 115}, {'word': 'Site', 'score': 0.38454076647758484, 'entity': 'I-MISC', 'index': 132}, {'word': 'Organisation', 'score': 0.5979025959968567, 'entity': 'I-ORG', 'index': 135}]

Conclusion : it’s rubbish and you probably have to work on training on your own model with its own domain-specific information.

Question Answering

Question Answering is more fun. The model considers a text (the context) and generates text based on a question. Quite like filling in a blank but a tad less conscious.

I’m more hopeful for this one because the answer should be findable in the text. The model thus has to infer the answer by comparing the question and the context. Anyway the Transformer warns us again that we probably have to train our own model.

Easy tasks

For Henry Park , let’s ask an easy question: Which sections of the PDPA did the Organisation breach?

{'score': 0.8479166372224196, 'start': 122, 'end': 138, 'answer': '11(3), 12 and 24'}

The output gave the right answer! It was also about 84% sure it was right. Haha! Note that the model extracted the answer from the text.

For Society of Tourist Guides , let’s ask something a tad more difficult. Why did the Organisation not conduct security testing?

{'score': 0.43685166522863683, 'start': 283, 'end': 303, 'answer': 'lack of experience.'}

I agree with the output! The computer was not so sure though (43%).

For Horizon Fast Ferry , here’s a question everyone is dying to know: What happens if a user enters a passport number?

{'score': 0.010287796461319285, 'start': 393, 'end': 470, 'answer': 'automatically retrieve and populate the remaining fields in the Booking Form'}

Errm, the model was looking at the right place, but failed to include the second part of the clause. The answer may be misleading to someone with no access to the main text. However, the model was only 1% sure of its answer. For most Singaporean students, 1% confidence means you are going to fail your examinations.

Tweaking the model to give a longer answer allowed it to give the right answer with 48% confidence.

Now for something more difficult

I also experimented with answering a question from a long text. As the default pipeline has a 512 word or token limit, I downloaded the LongFormer model (all 2 Gbs of it), a model built for longer texts.

I ran the whole Henry Park decision text (which is still comparatively shorter than most decisions) through the model with the same question. However, the model failed to give the right answer. It gave its (wrong) response a probability score of “2.60319e-05”. You wouldn’t call that confident.

Theoretically, this isn’t surprising because a longer text means there are more possible answers. More possible answers means the probability of finding a wrong answer is much higher. This probably needs a creative workaround in order to be useful.

Conclusion : For easy tasks, even the default pipeline can give surprisingly good results. However the default pipeline can only accept a maximum of 512 tokens/words, which means it can only work on small paragraphs, and most legal texts are far longer than that.

Summarisation

We can also ask a computer to summarise another text. It’s a bit of a mystery to me how a computer can summarise a text. How does a computer decide what text to take out and to leave behind? Let’s take it for a whirl now anyway.

Henry Park excerpt (max length=50):

'the Deputy Commissioner for Personal Data Protection found the organisation in breach of sections 11(3), 12 and 24 of the PDPA . the organisation is directed to, within 60 days, appoint one or more individuals to be responsible'

Society of Tourist Guides (max length=50):

'security testing had never been conducted since launch of the Website in October 2018 . the organisation admitted it failed to take into consideration security arrangements due to its lack of experience . a responsible organisation would have made genuine attempts to give proper instructions to'

Horizon Fast Ferry:

'the Contractor replicated the auto-retrieval and auto-population feature in the Booking Site as part of the website revamp . when a user entered a passport number which matched a Returning Passenger’'

Henry Park full summary:

'the organisation had a website at https://hppa.org.sg (the “Website”) where members could view their own account particulars upon logging in using their assigned user ID and password . on 15 March 2019, the Personal Data Protection Commission received a complaint . the organisation failed to conduct adequate security testing before launching the Website .'

Conclusion : I think the summaries are OK, but they wouldn’t pass the Turing test. It appears to work better in the longer text as long as it’s not longer than 512 words. This is not long enough for almost all PDPC decisions. Would training help? That’s something to explore.

Conclusion

Using the default pipelines from Hugging Face provided quick food for natural language processing tasks using state of the art transformers. It’s really satisfying to see the model extract good answers from a text. I liked it too because it introduced me to the library.

However, our simple experiments also reveal that much work needs to be done to improve its performance. Training a model for the task is one such undertaking. Furthermore, given the limitations of some models, such as the length of the text the model can process, alternative workarounds may be needed to enjoy the models. I’ll take this for now though!

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

#NaturalLanguageProcessing #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu