Love.Law.Robots. by Ang Hou Fu

tutorial

Two people seated at a table. conversation. A robot barista behind them. Friendly. Warm. Inviting.

Since receiving support requests and development ideas all over the place for redlines, I decided to try running a #Matrix room to centralise some discussion and an informal venue to ask weird and “stupid” questions. Since I am in the mood for experimentation, I wanted to run robots in my room. For my redlines room, I wanted a robot to track the activity on the GitHub repo, and welcome new users by posting what is the purpose of the room etc.

I was surprised to find very little documents or bots on how to run a matrix bot. I decided to go with maubot because I like the idea of plugins. While the interface and the docs leaves something to be desired, it’s actually relatively straightforward. Here’s a short write-up/#tutorial of how I did it in case it helps someone out there.

Read more...

Feature image

Introduction

You must have worked hard to get here! We are almost at the end now.

Our journey took us from providing the user experience, figuring out what should happen in the background, and interacting with an external service.

In this part, we ask docassemble to provide a file for the user to download.

Provisioning a File

When we left part 2, this was our result screen.

    event: final_screen
    question: |
      Download your sound file here.
    subquestion: |
      The audio on your text has been generated.
      
      You can preview it here too.
      
      <audio controls>
       <source type="audio/mpeg">
       Your browser does not support playing audio.
      </audio>
      
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
    buttons:
      - Exit: exit
      - Restart: restart

There are two places where you need an audio file.

  1. In the “question”, a link to the file is provided in “here” for download.
  2. The audio preview widget (the thing which you click to play) also needs a link to the file to function.

Luckily for us, docassemble provides a straightforward way to deal with files on the server. Simply stated, create a variable of type DAFile to hold a reference to the file, save the data to the file and then use it for those links in the results screen.

Let’s get started. Add this block to your main.yml file.

    ---
    objects:
      - generated: DAFile
    ---

This block creates an object called “generated”, which is a DAFile. Now your interview code can use “generated”.

Add the new line in the mandatory block we created in Part 2.

    mandatory: True
    code: |
    # The next line is new
      generated.initialize(filename="output.mp3") 
      tts_task
      if tts_task.ready():
        final_screen
      else:
        waiting_screen

This code initialises “generated” by getting the docassemble server to provision it. If you use “generated” before initialising it, docassemble raises an error. 👻 (You will only get this error if you use the DAFile to create a file)

Now your background action needs access to “generated”. Pass it in a keyword parameter in the background action you created in Part 3.

    code: |
      tts_task = background_action(
        'bg_task', 
        text_to_synthesize=text_to_synthesize, 
        voice=voice, 
        speaking_rate=speaking_rate, 
        pitch=pitch,
    # This is the new keyword parameter
        file=generated  
      )

Now that your background action has the file, use it to save the audio content. Add the new lines below to bg_task that you also created in Part 3.

    event: bg_task
    code: |
      audio = get_text_to_speech(
        action_argument('text_to_synthesize'),
        action_argument('voice'),
        action_argument('speaking_rate'),
        action_argument('pitch'),
      )
    # The next three lines are new
      file_output = action_argument('file') 
      file_output.write(audio, binary=True) 
      file_output.commit() 
      background_response()

We assign the file to a new variable in the background task and then use it to write the audio (make sure it is in binary format as MP3s are not text). After that, commit the file to save it in the server or your external storage, depending on your configuration. (The above method are from DAFile. You can read more details about what they do and other methods here.)

Now that the file is ready, we can plunk it into our results screen. We are providing URLs here so that your user can download them from the browser. If you used paths, that would not work because it is the server's file system. Modify the lines in the results screen block.

    event: final_screen
    question: |
    # Modify the next line
      Download your sound file **[here](${generated.url_for(attachment=True)}).** 
    subquestion: |
      The audio on your text has been generated.
      
      You can preview it here too.
      
      <audio controls>
    # Modify the next line
       <source src="${generated.url_for()}" type="audio/mpeg"> 
       Your browser does not support playing audio.
      </audio>
      
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
    buttons:
      - Exit: exit
      - Restart: restart

To get the URL for a DAFIle, use the url_for method. This lets you have an address you can use for downloading or the web browser.

Conclusion

Congratulations! You are now ready to run the interview. Give it a go and see if you can download the audio of a text you would like spoken. (If you are still at the Playground, you can click “Save and Run” to ensure your work is safe and test it around a bit.)

This Text to Speech docassemble interview is relatively straightforward to me. Nevertheless, its simplicity also showcases several functions which you may want to be familiar with. Hopefully, you now have an idea of dealing with external services. If you manage to hook up something interesting, please share it with the community!

Bonus: Trapping errors and alerting the users

The code so far is enough to provide users with hours of fun (hopefully not at your expense). However, there are edge cases which you should consider if you plan to make your interview more widely available.

Firstly, while it's pretty clear in this tutorial that you should have updated your Configuration so that this interview can find your service account, this doesn't always happen for others. Admins might have overlooked it.

Add this code as the first mandatory code block of main.yml (before the one we wrote in Part 3):

    mandatory: True
    code: |
      if get_config('google') is None or 'tts service account' not in get_config('google'):   
        if get_config('error notification email') is not None:
          send_email(to=get_config('error notification email'), 
            subject='docassemble-Google TTS raised an error', 
            body='You need to set service account credentials in your google configuration.' )
        else:
          log('docassemble-Google TTS raised an error -- You need to set service account credentials in your google configuration.')
          
        message('Error: No service account for Google TTS', 'Please contact your administrator.')

Take note that if you add more than one mandatory block, they are called in the order of their appearance in the interview file. So if you put this after the mandatory code block defining our processes, the process gets called before checking whether we should run this code in the first place.

This code block does a few things. Firstly it checks whether there is a “google” directive or a “tts service account” directive in the “google directive”. If it doesn't find any tts service account information, it checks whether the admin has set an error notification email in the Configuration. If it does, the server will send an email to the admin email to report the issue. If it doesn't, it prints the error on docassemble.log, one of the logs in the server. (If the admin doesn't check his email or logs, I am unsure how we can help the admin.)

This mandatory check before starting the interview is helpful to catch the most obvious error – no configuration. However, you can pass this check by putting nonsense in the “tts service account”. Google is not going to process this. There may be other errors, such as Google being offline.

Narrowing down every possible error will be very challenging. Instead, we will make one crucial check: the code did save a file at the end of the process. Even if we aren't going to be able to tell the user what went wrong, at least we spared the user the confusion of finding out that there was no file to download.

First, let's write the code that makes the check. Add this new code block.

    event: file_check
    code: |
      path = generated.path()
      if not os.path.exists(path):
        if get_config('error notification email') is not None:
          send_email(to=get_config('error notification email'), 
            subject='docassemble-Google TTS raised an error', 
            body='No file was saved in this interview.' )
        else:
          log('docassemble-Google TTS raised an error -- No audio file was saved in this interview.')
        message('Error: No audio file was saved', 'We are not sure why. Please try again. If the problem persists, contact your administrator.')

This code checks whether the audio file (generated, a DAFile) is an actual file or an apparition. If it doesn't exist, the admin receives a message. The user is also alerted to the failure.

We would need to add a need directive to our results screen so that the check is made before the final screen to download the file is shown.

    event: final_screen 
    need:  # Add this line
      - file_check  # Add this line
    question: |
      Download your sound file **[here](${generated.url_for(attachment=True)}).**
    subquestion: |
      The audio on your text has been generated.
      
      You can preview it here too.
      
      <audio controls>
       <source src="${generated.url_for()}" type="audio/mpeg">
       Your browser does not support playing audio.
      </audio>
      
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
    buttons:
      - Exit: exit
      - Restart: restart

We would also need to import the python os standard library to make the check on our system. Add this new block near the top of our main.yml file.

    imports:
      - os.path

There you have it! The interview checks before you start whether there's a service account. It also checks before showing you the final screen whether your request succeeded and if an audio file is ready to download.

👈🏻 Go to the previous part.

☝🏻Return to the overview of this tutorial.

#tutorial #Python #Programming #docassemble #Google #TTS #LegalTech

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Introduction

So far, all our work is on our docassemble install, which has been quite a breeze. Now we come to the most critical part of this tutorial: working with an external service, Google Text to Speech. Different considerations come into play when working with others.

In this part, we will install a client library from Google. We will then configure the setup to interact with Google’s servers and write this code in a separate module, google_tts.py. At the end of this tutorial, your background action will be able to call the function get_text_to_speech and get the audio file from Google.

1. A quick word about APIs

The term “API” can be used loosely nowadays. Some people use it to describe how you can use a program or a software library. In this tutorial, an API refers to a connection between computer programs. Instead of a website or a desktop program, we’re using Python and docassemble to interact with Google Text to Speech. In some cases, like Google Text to Speech, an API is the only way to work with the program.

The two most common ways to work with an API over the internet are (1) using a client library or (2) RESTful APIs. There are pros and cons to working with any one of these options. In this tutorial we are going to go with a client library that Google provided. This allows us to work with the API in a programming language we are familiar with, Python. RESTful APIs can have more support and features than a client language (especially if the programming language is not popular). Still, you’d need to know your way around the requests and similar packages if you want to use them in Python.

2. Install the Client Library in your docassemble

Before we can start using the client library, we need to ensure that it’s there in our docassemble install. Programming in Python can be very challenging because of issues like this:

source: https://imgs.xkcd.com/comics/python_environment.png

Luckily, you will not face this problem if you’re using docker for your docassemble install (which most people do). Do this instead:

  1. Leave the Playground and go to another page called “Package Management”. (If you don’t see this page, you need to be either an admin or a developer)
  2. Under Install or update a package, specify google-cloud-texttospeech as the package to find on PyPI
  3. Click Update, and wait for the screen to show that the install is OK. (This takes time as there are quite a few dependencies to install)
  4. Verify that the google-cloud-texttospeech package has been installed by checking out the list of packages installed.

3. Set up a Text To Speech service account in docassemble

At this point, you should have obtained your Google Cloud Platform service account so that you can access the Text to Speech API. If you haven’t done so, please follow the instructions here. Take note that we will need your key information in JSON format. You don’t need to “Set your authentication environment variable” for this tutorial.

If you have not realised it yet, the key information in JSON format is a secret. While Google’s Text to Speech platform has a generous free tier, the service is not free. So, expect to pay Google if somebody with access to your account tries to read The Lord of the Rings trilogy. In line with best practices, secrets should be kept in a private and secure place, which is not your code repository. Please don’t include your service account details in your playground files!

Luckily, you can store this information in docassemble’s Configuration, which someone can’t access without an admin role and is not generally publicly available. Let’s do that by creating a directive google with a sub-directive of tts service account. Go to your configuration page and add these directives. Then fill out the information in JSON format you received from Google when you set up the service account.

In this example, the lines you will add to the Configuration should look like lines 118 to 131.

4. Putting it all together in the google_tts.py module

Now that our environment is set up, it’s time to create our get_speech_from_text function.

Head back to the Playground, Look for the dropdown titled “Folders”, click it, then select “Modules”.

Look for the editor and rename the file as google_tts.py. This is where you will enter the code to interact with Google Text to Speech. If you recall in part 3, we had left out a function named get_text_to_speech. We were also supposed to feed it with the answers we collected from the interviews we wrote in part 2. Let’s enter the signature of the function now.

    def get_text_to_speech(text_to_synthesize, voice, speaking_rate, pitch):
      //Enter more code here
      return

Since our task is to convert text to speech, we can follow the code in the example provided by Google.

A. Create the Google Text-to-Speech client

Using the Python client library, we can create a client to interact with Google’s service.

We need credentials to use the client to access the service. This is the secret you set up in step 3 above. It’s in docassemble’s configuration, under the directive google with a sub-directive of tts service account. Use docassemble’s get_config to look into your configuration and get the secret tts service account as a JSON.

With the secret to the service account, you can pass it to the class factory function and let it do the work.

    def get_text_to_speech(text_to_synthesize, voice, speaking_rate, pitch):
        from google.cloud import texttospeech
        import json
        from docassemble.base.util import get_config
    
        credential_info = json.loads(get_config('google').get('tts service account'), strict=False)
    
        client = texttospeech.TextToSpeechClient.from_service_account_info(credential_info)

Now that the client is ready with your service account details, let's get some audio.

B. Specify some options and submit the request

The primary function to request Google to turn text into speech is synthesize_speech. The function needs a bunch of stuff — the text to convert, a set of voice options, and options for your audio file. Let’s create some with the answers to the questions in part 2. Add these lines of code to your function.

The text to synthesise:

    input_text = texttospeech.SynthesisInput(text=text_to_synthesize)

The voice options:

    voice = texttospeech.VoiceSelectionParams(
            language_code="en-US",
            name=voice,
        )

The audio options:

    audio_config = texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3,
            speaking_rate=speaking_rate,
            pitch=pitch,
        )

Note that we did not allow all the options to be customised by the user. You can go through the documentation yourself to figure out what options you need or don’t need to worry the user. If you think the user should have more options, you’re free to write your questions and modify the code.

Finally, submit the request and return the audio.

    response = client.synthesize_speech(
            request={"input": input_text, "voice": voice, "audio_config": audio_config}
        )
    
    return response.audio_content

Voila! The client library could call Google using your credentials and get your personalised result.

5. Let’s go back to our interview

Now that you have written your function, it’s time to let our interview know where to find it.

Go back to the playground, and add this new block in your main.yml file.

    ---
    modules:
      - .google_tts
    ---

This block tells the interview that some of our functions (specifically, the get_text_to_speech function) is found in the google_tts module.

Conclusion

At the end of this part, you have written your google_tts.py module and included it in your main.yml. You should also know how to install your python package to docassemble and edit your configuration file.

Well, that leaves us with only one more thing to do. We’ve got our audio content; now we just need to get it to the user. How do we do that? What’s that? DAFile? Find out in the next part.

👉🏻 Go to the final part.

👈🏻 Go back to the previous part.

☝🏻 Check out the overview of this tutorial.

#tutorial #docassemble #LegalTech #Google #TTS #Programming #Python

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Introduction

In Part 2, we managed to write a few questions and a result screen. Even with all that eye candy, you will notice that you can’t run this interview. There is no mandatory block, so docassemble does not know what it needs to do to run the interview. In this part, I will talk about the code block required to run the interview, forming the foundation of our next steps.

1. The backbone of this interview

In ordinary docassemble interviews, your endpoint is a template or a form. For this interview, the endpoint is a sound file. So let’s start with this code block. It tells the reader how the interview will run. Since it is a pretty important block, we should put it near the top of the interview file, maybe right under the meta block. (In this tutorial, the order of blocks does not affect how the interview will run. Or at least until you get to the bonus section.)

    mandatory: True
    code: |
      tts_task
      final_screen

If you recall, the user downloads the audio file in the final screen.

So this mandatory code block asks docassemble to “do” the tts_task and then show the final screen. Now we need to define tts_task, and your interview will be ready to run.

So what should tts_task be? The most straightforward answer is that it is the result of the API call to create the sound file. You can write a code block that gets and returns a sound file and assigns it to tts_task.

Well, don’t write that code block yet.

2. Introducing the background action

If you call an API on the other side of the internet, you should know that many things can happen along the way. For example, it takes time for your request to reach Google, then for Google to process it using its fancy robots, send the result back to your server, and then for your server to send it back to the user. In my experience, Google’s and docassemble’s latency is quite good, but it is still noticeable with large requests.

A user is not supposed to notice that a program lags when the interview runs well. If the user realises that the interview is stuck on the same page for too long, the user might get worried that the interview is broken. The truth is that we are waiting for the file to come back. Get back in your chair and wait for it!

To improve user experience, you should have a waiting screen where you tell the user to hold his horses. While this happens, the interview should work in the background. In this manner, your user is assured everything is well while your interview focuses on getting the file back from Google.

docassemble already provides a mechanism for the docassemble program to carry out background actions. It’s aptly called background_action().

Check out a sample of a background action by reading the first example block (”Return a value”) under the Background processes category. Modify our mandatory code block by following the mandatory code block in the example. It should look like this.

    mandatory: True
    code: |
      tts_task
      if tts_task.ready():
        final_screen
      else:
        waiting_screen

So now we tell docassemble to do the tts_task, our background task. Once it is ready, show the final screen. If the task is not ready, show the waiting screen.

3. Define the background action

Now that we have defined the interview flow, it’s time to do the background action. In the spirit of docassemble, we do this by defining tts_task.

The next code block defines the task. Adapt this example into our interview file as follows.

    code: |
      tts_task = background_action(
        'bg_task', 
        text_to_synthesize=text_to_synthesize, 
        voice=voice, 
        speaking_rate=speaking_rate, 
        pitch=pitch,
      )

So we have defined tts_task as a background action. The background action function has two kinds of arguments.

The first positional argument (“bg_task”) is the name of the code block that the background action should execute in the background.

The other keyword arguments are the information you need to pass to this background action like the text_to_synthesize, voice etc. These options you answered earlier during this interview will now be used for this background action. Defining your variables here in a mandatory block indirectly also ensures that docassemble will look for the answers for these variables before performing this code block.

So why do you need to define all the variables in this way? Don’t forget that the background action is a separate process from the rest of your interview so they don’t share the same variables. To enable these processes to share their information, you pass on the variables from the main interview process to the background action.

4. Perform the background action

We have defined the background action. Now let’s code what happens inside the background action.

The background action is defined in an event called bg_task. Now add a new code block as follows:

    event: bg_task
    code: |
      audio = get_text_to_speech(
        action_argument('text_to_synthesize'),
        action_argument('voice'),
        action_argument('speaking_rate'),
        action_argument('pitch'),
      )
      background_response()

So in this code block, we say that the audio is obtained by calling a function named get_text_to_speech. For get_text_to_speech to produce an audio file, it requires the answers to the questions you asked the user earlier. As a background process, it gets access to the variables you defined earlier through the keywords of the background_action function by calling action_argument.

Once get_text_to_speech is completed, we call background_response(). Calling background_response is important for a background action as it tells docassemble that this is the endpoint for the background action. Make sure you don’t leave your background action without it.

5. Provide a waiting screen

Before we leave the example block for background processes, let’s add the question block that tells the user to wait for their audio file. Find the block which defines waiting_screen, and adapt it for your interview as follows.

    event: waiting_screen
    question: |
      Hang tight.
      Google is doing its magic.
    subquestion: |
      This screen will reload every
      few seconds until the file
      is available.
    reload: True

By adding reload: True to the block, you tell docassemble to refresh the screen every 10 seconds. This helps the user to believe that they only need to be patient and some “magic” is going on somewhere.

Conclusion

In the next part of the tutorial, we will dive into get_text_to_speech. (What else, right?) We will need to call Google’s Text-to-Speech API to do this. If you found it easy to follow the code blocks in this part of the tutorial, we will kick this up a notch — the next file we will be working on ends with a “.py”.

👉🏻 Go ahead to the next part

👈🏻 Go to the previous part

👈🏻 Check out the overview of this tutorial.

#tutorial #docassemble #Programming #Python #Google #TTS

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Introduction

In Part 1, we talked about what we will do and the things you need to follow in this tutorial. Let’s get our hands wet now!

We are going to get the groundwork done by creating four pages. The first page gets the text to be turned into speech. The second page chooses the voice which Google will use to generate the audio. The third page edits some attributes in the production of the audio. The last page is the results page to download the spoken text.

If you are familiar with docassemble, nothing here is exciting, so you can skip this part. If you’re very new to docassemble, this is a gentle way to introduce you to getting started.

1. All projects begin like this

Log in to your docassemble program and go to the Playground. We will be doing most of the work here.

If you’re working from a clean install, your screen probably looks like this.

The default new project in docassemble's Playground.

  1. Let’s change the name of the interview file from test.yml to main.yml.
  2. Delete all the default blocks/text in the interview file. We are going to replace it with the blocks for this project.

You will have a clean main.yml file at the end.

2. I never Meta an Interview like you

I like to start my interview file with a meta block that tells someone about this interview and who made it.

It’s not easy to remember what a meta block looks like every time. You can use the example blocks in the playground to insert template blocks and modify them.

The example blocks also link to the relevant part of the documentation for easy reference. (It’s the blue “View Documentation” button.)

You should also use the example blocks as much as possible when you’re new to docassemble and writing YAML files. If you keep using those example blocks, you will not forget to separate your blocks with --- and you will minimise errors about indents and lists. After some practice (and lots of mistakes), you should be familiar with the syntax of a YAML file.

So, even though the example blocks section is found below the fold, you should not leave home without it.

You can write anything you like in the meta block as it’s a reference for other users. The field title, for example, is shown as the name of the interview on the “Available Interviews” page.

For this project, this is the meta block I used.

    metadata:
      title: |
        Google TTS Interview
      short title: |
        Have Google read your text
      description: |
        This interview produces a sound file based 
        on the text input by the user and other options.
      revision_date: 2022-05-01

3. Let’s write some questions

This is probably the most visual part of the tutorial, so enjoy it!

An easy way to think about question blocks is that they represent a page in your interview. As long as docassemble can find question blocks that answer all the variables it needs to finish the interview, you can organise and write your question block as you prefer.

So, for example, you can add this text box block which asks you to provide the input text. You can find the example text box block under the Fields category. (Putting no label allows the block to appear as if only one variable is set in this question)

    question: |
      Tell me what text you would like Google to voice.
    fields:
      - no label: text_to_synthesize
        input type: area
      - note: |
          The limit is 5000 characters. (Short paragraphs should be fine)

You can also combine several questions on one page like this question for setting the audio options. Using the range slider example block under the Fields category, you can build this block.

    question: |
      Modify the way Google speaks your text.
    subquestion: |
      You can skip this if you don't need any modifications.
    fields:
      - Speaking pitch: pitch
        datatype: range
        min: -20.0
        max: 20.0
        default: 0
      - note: |
          20 means increase 20 semitones from the original pitch. 
          -20 means decrease 20 semitones from the original pitch. 
      - Speaking rate/speed: speaking_rate
        datatype: range
        min: 0.25
        max: 4.0
        default: 1.0
        step: 0.1
      - note: |
          1.0 is the normal native speed supported by the specific voice. 
          2.0 is twice as fast, and 0.5 is half as fast.

Notice that I have set constraints and defaults in this block based on the documentation of the various options. This will help the user avoid pesky and demoralising error messages from the external API by entering unacceptable values.

A common question for a newcomer is how should present a question to a user? You can use a list of choices like the one below. (Build this question using the Radio buttons example block under the Multiple Choice category.)

    question: |
      Choose the voice that Google will use.
    field: voice
    default: en-US-Wavenet-A
    choices:
      - en-US-Wavenet-A
      - en-US-Wavenet-B
      - en-US-Wavenet-C
      - en-US-Wavenet-D
      - en-US-Wavenet-E
      - en-US-Wavenet-F
      - en-US-Wavenet-G
      - en-US-Wavenet-H
      - en-US-Wavenet-I
      - en-US-Wavenet-J
    under: |
      You can preview the voices [here](<https://cloud.google.com/text-to-speech/docs/voices>).

An interesting side question: When do I use a slider or a text entry box?

It depends on the kind of information you want. If you input numbers, the field's datatype should be a number. If you’re making a choice, a list of options works better.

Honestly, it takes some experience to figure out what works best. Think about all the online forms you have experienced and what you liked or did not like. To gain experience quickly, you can experiment by trying different fields in docassemble and asking yourself whether it gets the job done.

4. The Result Screen

Now that you have asked all your questions, it’s time to give your user the answer.

The result screen is shown when Google’s API has processed the user’s request and sent over the mp3 file containing the synthesised speech. In the result screen, you will be able to download the file. It’s also helpful to allow the user to preview the sound file so that the user can go back and modify any options.

    event: final_screen
    question: |
      Download your sound file here.
    subquestion: |
      The audio on your text has been generated.
      
      You can preview it here too.
      
      <audio controls>
       <source type="audio/mpeg">
       Your browser does not support playing audio.
      </audio>
      
      Press `Back` above if you want to modify the settings and generate a new file,
      or click `Restart` below to begin a new request.
    buttons:
      - Exit: exit
      - Restart: restart

Note: This image shows the completed file with links on how to download it. The reference question block above does not contain any links.

You would notice that I used an audio HTML tag in the subquestion to provide my media previewer. Take note that you can use HTML tags in your markdown text if docassemble does not have an option that meets your needs. However, your HTML hack might vary since this is based on the browser, so try to test as much as possible and avoid complex HTML.

Preview: Let’s do some actual coding

If you followed this tutorial carefully, your main.yml will have a meta block, 3 question blocks and one results screen.

There are a few problems now:

  • You cannot run the interview. The main reason is that there’s no “mandatory” block, so docassemble does not know what it needs to execute to finish the job.
  • The results screen does not contain a link to download or a media to preview.
  • We haven’t even asked Google to provide us with a sound file.

In the next part, we will go through the overall logic of the interview and do some actual coding. Once you are ready, head on over there!

👉🏻 Head to the next part.

👈🏻 Go back to the previous part.

☝🏻 Check out the overview of this tutorial.

#tutorial #docassemble #TTS #Google #Python #Programming

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Most people associate docassemble with assembling documents using guided interviews. That’s in the name, right? The program asks a few questions and out pops a completed form, contract or document. However, the documentation makes it quite clear that docassemble can do more:

Though the name emphasizes the document assembly feature, docassemble interviews do not need to assemble a document; they might submit an application, direct the user to other resources on the internet, store user input, interact with APIs, or simply provide the user with information.

In this post, let’s demonstrate how to use docassemble to call an API, get a response and provide it to a user. You can check out the completed code on Github (NB: the git branch I would recommend for following this post is blog. I am actively using this package, so I may add new features to the main branch that I don’t discuss here.)

Problem Statement

I do a lot of internal training on various legal and compliance topics. I think I am a pretty all right speaker, but I have my limitations — I can’t give presentations 24/7, and my performance varies in a particular session. Wouldn’t it be nice if I could give a presentation at any time engagingly and consistently?

I could record my voice, but I did not like the result.

I decided to use a text-to-speech program instead, like the one provided by Google Cloud Platform. I created a computerised version of my speech in the presentation. My audience welcomed this version as it was more engaging than a plain PowerPoint presentation. Staff whose first language was not (Singapore) English also found the voice clear and understandable.

The original code was terminal based. I detailed my early exploits in this blog post last year. The script was great for developing something fast. However, as more of my colleagues became interested in incorporating such speech in their presentations, I needed something more user-friendly.

I already have a docassemble installation at work, so it appears convenient to work on that. The program would have to do the following:

  • Ask the user what text it wants to transform into speech
  • Allow the user to modify some properties of the speech (speed, pitch etc.)
  • Call Google TTS API, grab the sound file and provide it to the user to download

Assumptions

To follow this tutorial, you will need the following:

  • A working docassemble install. You can start up an instance on your laptop by following these instructions.
  • A Google Cloud Platform (GCP) account with a service account enabled for Google TTS. You can follow Google’s instructions here to set one up.
  • Use the Playground provided in docassemble. If you'd like to use an IDE, you can, but I wouldn’t be providing instructions like creating files to follow a docassemble package's directory structure.
  • Some basic knowledge about docassemble. I wouldn’t be going through in detail how to write a block. If you can follow the Hello World example, you should have sufficient knowledge to follow this tutorial.

A Roadmap of this Tutorial

In the next part of this post, I talk about the thinking behind creating this interview and how I got the necessary information (off the web) to make it.

In Part 2, we get the groundwork done by creating four pages. This provides us with a visual idea of what happens in this interview.

In Part 3, I talk about docassemble's background action and why we should use it for this interview. Merging the visual requirements with code gives us a clearer picture of what we need to write.

In Part 4, we work with an external API by using a client library for Python. We install this client library in our docassemble's python environment and write a python module.

In Part 5, we finish the interview by coding the end product: an audio file in the guise of a DAFile. You can run the interview and get your text transformed into speech now! I also give some ideas of what else you might want to do in the project.

Part 1: Familiarise yourself with the requirements

To write a docassemble interview, it makes sense to develop it backwards. In a simple case, you would like docassemble to fill in a form. So you would get a form, figure out its requirements, and then write questions for each requirement.

An API call is not a contract or a form, but your process is the same.

Based on Google’s quickstart, this is the method in the Python library which synthesises speech.

    # Set the text input to be synthesized
        synthesis_input = texttospeech.SynthesisInput(text="Hello, World!")
    
    # Build the voice request, select the language code ("en-US") and the ssml
    # voice gender ("neutral")
        voice = texttospeech.VoiceSelectionParams(
            language_code="en-US", 
            ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
        )
    
    # Select the type of audio file you want returned
        audio_config = texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3
        )
    
    # Perform the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
        response = client.synthesize_speech(
            input=synthesis_input, voice=voice, audio_config=audio_config
        )

From this example code, you need to provide the program with the input text (synthesis input), the voice, and audio configuration options to synthesise speech.

That looks pretty straightforward, so you might be tempted to dive into it immediately.

However, I would recommend going through the documents provided online.

  • docassemble provides some of the most helpful documentation, great for varying proficiency levels.
  • Google’s Text To Speech’s documentation is more typical of a product offered by a big tech company. Demos, use cases and guides help you get started quickly. You’re going to have to dig deep to find the one for Python. It receives less love than the other programming languages.

Reading the documentation, especially if you want to use a third-party service, is vital to know what’s available and how to exploit it fully. For example, going through the docs is the best way to find out what docassemble is capable of and learn about existing features — such as transforming a python list of strings into a human-readable list complete with an “and”.

You don’t have to follow the quickstart if it does not meet your use case. Going through the documentation, I figured out that I wanted to give the user a choice of which voice to use rather than letting Google select that for me. Furthermore, audio options like how fast a speaker is will be handy since non-native listeners may appreciate slower speaking. Also, I don’t think I need the user to select a specific file format as mp3s should be fine.

Let’s move on!

This was a pretty short one. I hope I got you curious and excited about what comes next. Continue to the next part, where we get started on a project!

👉🏻 Head to the next part of this tutorial!

#tutorial #docassemble #Python #Programming #TTS #Google

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Way back in December 2021, I caught wind of the 2020 Revised Edition of the statutes in Singapore law:

The AGC highlighted that the revised legislation now uses “simpler language”. I was curious about this claim and looked over their list of changes. I was not very impressed with them.

However, I did not want to rely only on my subjective intuition to make that conclusion. I wanted to test it using data science. This meant I had to compare text, calculate the changes' readability statistics, and see what changed.

Read more...

I run a docassemble server at work, ostensibly introducing co-workers to a different way of using templates to generate their agreements. It's been pretty useful, so much so that I use it myself for my work. However, due to the pandemic, it's not been easy to go and sell it. Maybe I am going to have better luck soon.

In the meantime, I decided to move the server from AWS to DigitalOcean.

Why move?

I liked the wide variety of features available on AWS, such as CodeCommit, Lambda and SES. DigitalOcean is not comparable in that regard. If I wanted to create a whole suite of services for my application, I would probably find something on AWS's glorious one-page filled with services.

However, with great functions come great complexity. I had a headache trying to exploit them. I was not going to be able to make full use of their ecosystem. (I shall never scoff at AWS certification anymore.)

On the other hand, I was more familiar with DigitalOcean and liked their straightforward pricing. So, if I wanted to move my pet project somewhere, I would have liked it to be in my backyard.

Let's get moving!

Lesson 1: Respect the shutdown

The docassemble docs expressly ask you to shut down your docassemble server gracefully. This is not the usual docker stop <container> command but with a timeout flag. It isn't fatal to forget the timeout flag in many simple use cases, so you would never actually notice it.

However, there's another way to kill your server in the cloud — flip the switch on your cloud instance on the management console. It doesn't feel like that when you click the red button, but it has the same effect. The cloud instance is sent straight to heaven, and there is nothing you can do about it.

The shutdown is important because docassemble does quite a lot of work when it shuts down. It dumps the database records in your storage. If the storage is located in the cloud (like AWS's S3 or DigitalOcean's Spaces), there is some lag when sending all the files there. If the shutdown is not respected, the server's state is not saved, and you might not be able to restore it when you start the container.

So with my AWS container gone in a cloud of dust, I found my files in my S3 storage were not updated. The last copy was over several months ago — the last time I had shut down my container normally. This meant that several months of work was gone! 😲

Lesson 2: Restore from backup

This blog could have ended on that sad note. Luckily for CloudOps newbies like me, docassemble automatically stores backups of the server state. These are stored in the backup folder of your storage and are arranged by date.

If you, like me, borked your docassemble server and set it back to August 2020, you can grab your latest backup and replace the main directory files (outside backup). The process is described in the docassemble docs here. Instead of having no users back in August 2020, I managed to retrieve all my users in the Postgres database stored in the backups. Phew!

Lesson 3: Check your config.yml file

After this exercise, I decided to go with a DigitalOcean Droplet and AWS S3. Given that I was already on S3 and the costs of S3 are actually fairly negligible, this seems like a cost-effective combo. DigitalOcean spaces cost $5 no matter how big they are, whereas my S3 usage rarely comes up to more than a dollar.

Before giving your new docassemble server a spin, do check your config.yml file. You can specify environment variables when you start a container, but once the server is running free, it uses the config.yml file found in the storage. If the configuration file was specially set for AWS, your server might not be able to run properly on DigitalOcean. This means you have to download the config.yml file on the storage (I used the web interface of S3 to do this) and edit it manually to fit your new server.

In my setup, my original configuration file was set up for an AWS environment. This meant that my EC2 instance used security policies to access the S3. At the time, it simplified the set-up of the server. However, my Droplet cannot use these features. Generate an access key and secret key, and input these details and more in your updated config.yml file. Oh, and turn off ec2.

If you are going to use Spaces, you will transfer the files in your old S3 to Spaces (I used s4cmd) and fill in the details of your S3 in the configuration file.

Conclusion

To be honest, the migration was essentially painless. The design of the docassemble server allows it to be restored from a single source of truth — the storage method you choose. Except for the problems that come from hand-editing your old config.yml (I had to type my SecretKey a few times 😢), you probably don't need to enter the docker and read initialize error logs. Given my positive experience, I will be well prepared to move back to AWS again! (Just kidding for now.)

#tech #docassemble #AWS #DigitalOcean #docker #OpenSource #tutorial #CloudComputing

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Things you can only do during a lockdown – install a new server. As I mentioned previously, I got round to installing the latest Ubuntu LTS on my home server. Instead of using apt get for new software, I would run my server services through Docker instead. First up, I got Pi-Hole working and blocking ads. I's been sweet.

Let’s Play with: Pi-HoleI try to install Pi-Hole Server to block all ads and tracking websites at home.Love.Law.Robots.Houfu

My conviction to use containers started with docassemble. You can use docassemble to generate contracts from answering questions. It's relevant to my work and I am trying to get more of my (non-legal) colleagues to try it. Unlike other software I have tried, docassemble recommends just using docker. With one command, docker run -d jhpyle/docassemble, I would get a fully-featured server. My mind was blown.

DocassembleA free, open-source expert system for guided interviews and document assembly, based on Python, YAML, and Markdown.Docassemble

However, as I became more familiar with how to get docker to do what I want, the limitations of that simple command began to restrict me. Docassemble uses several ports. Many other applications share the same port, especially for a web server: 80 and 443. If docker and docassemble took these ports, no one else was going to get them. I wasn't sure if I wanted my home server to be just a docassemble server.

Furthermore, using secure ports (HTTPS) became a serious problem. I wanted to use my home server's docassemble installation as a development base, so it should be accessible to the outside world. For some reason, docassemble wouldn't accept my wildcard certs. If I planned to use it for anything serious, having an unsecured website was impossible.

It got so frustrating that I gave up.

Enter the Reverse-Proxy: Traefik

The short answer to my problems was to use a reverse proxy. A reverse proxy is a kind of server that gets information from another server for a client. Or in this case, a traefik server receives a request and figures out which docker container it should go to. A traefik server can also do other things, such as providing end to end security for your communications by obtaining free SSL certificates from Let's Encrypt.

TraefikTraefik Documentationlogo

I was convinced to go for this because it claimed that it would make “publishing your services a fun and easy experience”. When I read that, I let a tear go. Is it actually possible for this program to automatically detect the right configuration for your services? Even for something as big as docassemble?

I'll let you be the judge of that at the end of this article.

Step 1: Set up Traefik

Of course, you would need to have docker set up and good to go.

There are a bunch of ways to get Traefik going, but I would be using a docker-compose.yml file like in the QuickStart.

The documentation for docassemble does not mention anything about docker compose. It is a shame because I found it to be a more user-friendly tool than the docker command-line interface. So instead of writing a bash script just to shorten my long docker run command, I would write out the blueprint of my setup in the docker-compose.yml. After that, I can run docker-compose up -d and the services in the file will start altogether.

This is very important in my setup, because there are other services in my home server like plex or grocy (another lockdown project) too. For the sake of convenience, I decided to include all these like projects in the same docker-compose.yml file. This is the blueprint of my home server!

Back to Traefik, this is the section of my docker-compose.yml file setting out the reverse proxy server:

services: reverse-proxy: # The official v2 Traefik docker image image: traefik:v2.2 containername: traefik # Enables the web UI and tells Traefik to listen to docker command: —api.insecure=true —providers.docker ports: # The HTTP/HTTPS port – “80:80” – “443:443” # The Web UI (enabled by —api.insecure=true) – “8080:8080” volumes: # So that Traefik can listen to the Docker events – /var/run/docker.sock:/var/run/docker.sock – /home/houfu/traefik/:/etc/traefik/ environment: DOAUTH_TOKEN: XXX restart: unless-stopped

Just a few notes:

  • This line /home/houfu/traefik/:/etc/traefik/ under volumes allows me to have access to the configuration file used by traefik.
  • This line DO_AUTH_TOKEN: XXX under environment is to generate SSL certificates using my personal domain, which is managed by DigitalOcean.

Step 2: Prepare Traefik to generate SSL Certificates

Instead of having docassemble obtain the SSL certificates to use HTTPS, I decided to get Traefik to do it instead. Reverse proxies do this job much better, and I wouldn't need to “enter” the docassemble container to hunt down why SSL is not working.

Besides, my other services on my home server were already getting their certificates through Traefik, so getting docassemble to do the same would be straightforward right?

For this step, you would need to define a certificate resolver for Traefik to use. Please read the documentation as it is quite informative. For my set-up, I decided to use DigitalOcean as I was already using it for my DNS.

In the configuration file (traefik.toml), add a section to define the certificate resolver.

[certificatesResolvers.docassembleResolver.acme] email = “[email protected]” storage = “acme.json”

[certificatesResolvers.docassembleResolver.acme.dnsChallenge] # used during the challenge provider = “digitalocean”

The final step, especially if you have chosen DigitalOcean as a provider, is to get an API key and provide it to Traefik so that the process of getting a certificate can be automated. This was the DO_AUTH_TOKEN in the docker-compose.yml file referred to in the first step.

Step 3: Provide a blueprint for the Docassemble service

Once we have the reverse proxy set up, it’s time to get docassemble to run. This is the final form of the docker-compose.yml file for the docassemble service.

docassemble: image: “jhpyle/docassemble:latest” hostname: docassemble containername: docassemble stopgrace_period: 1m30s environment: – CONTAINERROLE=all – DBPREFIX=postgresql+psycopg2:// – DBNAME=docassemble – DBUSER=docassemble – DBPASSWORD=abc123 – DBHOST=localhost – USEHTTPS=false – DAHOSTNAME=docassemble.example.com – USELETSENCRYPT=false – S3ENABLE=true – S3ACCESSKEY=ABCDEFGH – S3SECRETACCESSKEY=1234567 – S3BUCKET=docassemble – S3ENDPOINTURL=https://xxxx.sgp1.digitaloceanspaces.com – TIMEZONE=Asia/Singapore – DAPYTHONVERSION=3 labels: – traefik.backend=docassemble – traefik.http.routers.docassemble.rule=Host(docassemble.example.com) – traefik.http.services.docassemble.loadbalancer.server.port=80 – traefik.http.routers.docassemble.tls=true – traefik.http.routers.docassemble.tls.certresolver=docassembleResolver – traefik.http.middlewares.docassemble-redirect.redirectscheme.scheme=https – traefik.http.middlewares.docassemble-redirect.redirectscheme.permanent=true – traefik.http.routers.docassemble.middlewares=docassemble-redirect

One of the most important aspects of setting up your own docassemble server is figuring out the environment variables. The docassemble documentation recommends that we use an env.list file or pass a list of configuration values to the docker run command. For our docker-compose file, we pass them as a dictionary to the environment section of the service blueprint. Feel free to add or modify these options as you need. For example, you can see that I am using DigitalOcean Spaces as my S3 compatible storage.

So where does the magic of Trafik’s automatic configuration come in? Innocuously under the label section of the blueprint. Let’s split this up for easy explanation.

labels: – traefik.backend=docassemble – traefik.http.routers.docassemble.rule=Host(docassemble.example.com) – traefik.http.services.docassemble.loadbalancer.server.port=80

In the first block of labels, we define the name and the host of the docassemble server. Traefik now knows what to call this server, and to direct queries from “docassemble.example.com” to this server. As docassemble exposes several ports, we also help prod traefik to use the correct port to access the server.

labels: – traefik.http.routers.docassemble.tls=true – traefik.http.routers.docassemble.tls.certresolver=docassembleResolver

In this block of labels, we tell Traefik to use HTTPS and to use the certificate provider we defined earlier to get these certificates.

labels: – traefik.http.middlewares.docassemble-redirect.redirectscheme.scheme=https – traefik.http.middlewares.docassemble-redirect.redirectscheme.permanent=true – traefik.http.routers.docassemble.middlewares=docassemble-redirect

Finally we tell traefik to use a middleware here — a redirect. The redirect middleware ensures that uses will use HTTPS to communicate with the server.

Note that in our environment variables for the docassemble server, we tell docassemble not to use https (“USEHTTPS=false”). This is because traefik is already taking care of it. We don’t need docassemble to bother with it.

It works!

Docassemble servers take a bit of time to set up. But once you get it up, you will see my favourite screen in the entire application.

docassemble server is working.I would like to thank my...

Notice the grey padlock in the address bar of my Firefox browser? That’s right, HTTPS baby!!

Final Thoughts

I am glad I learnt a lot about docker from docassemble, and its documentation is top-notch for what it is. However, running one is not easy. Using docker-compose helped iron out some of the pain. In any case, I am glad I got over this. It’s time to get developing! What should I work on next?

#blog #docassemble #docker #tutorial #tech #Traefik #HTTPS

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

This post is part of a series on my Data Science journey with PDPC Decisions. Check it out for more posts on visualisations, natural languge processing, data extraction and processing!

Update 13 June 2020: “At least until the PDPC breaks its website.” How prescient… about three months after I wrote this post, the structure of the PDPC’s website was drastically altered. The concepts and the ideas in this post haven’t changed, but the examples are outdated. This gives me a chance to rewrite this post. If I ever get round to it, I’ll provide a link.

Regular readers would already know that I try to run a github repository which tries to compile all personal data protection decisions in Singapore. Decisions are useful resources teeming with lots of information. They have statistics, insights into what factors are relevant in decision making and show that data protection is effective in Singapore. Even basic statistics about decisions make newspaper stories here locally. It would be great if there was a way to mine all that information!

houfu/pdpc-decisionsData Protection Enforcement Cases in Singapore. Contribute to houfu/pdpc-decisions development by creating an account on GitHub.GitHubhoufu

Unfortunately, using the Personal Data Protection Commission in Singapore’s website to download judgements can be painful.

This is our target webpage today – Note the website has been transformed.

As you can see, you are only able to view no more than 5 decisions at one time. As the first decision dates back to 2016, you will have to go through several pages to grab everything! Actually just 23. I am sure you can do all that in 1 night, right? Right?

If you are not inclined to do it, then get your computer to do it. Using selenium, I wrote a python script to automate the whole process of finding all the decisions available on the website. What could have been a tedious night’s work was accomplished in 34 seconds.

Check out the script here.

What follows here is a step by step write up of how I did it. So hang on tight!

Section 1: Observe your quarry

Before setting your computer loose on a web page, it pays to understand the structure and inner workings of your web page. Open this up by using your favourite browser. For Chrome, this is Developer's Tools and in Firefox, this is Web Developer. You will be looking for a tab called Sources, which shows you the HTML code of the web page.

Play with the structure of the web page by hovering over various elements of the web page with your mouse. You can then look for the exact elements you need to perform your task:

  • In order to see a new page, you will have to click on the page number in the pagination. This is under a section (a CSS class) called group__pages. Each page-number is under a section (another CSS class) called page-number.
  • Each decision has its own section (a CSS class) named press-item. The link to the download, which is either to a text file or a PDF file, is located in a link in each press-item.
  • Notice too that each press-item also has other metadata regarding the decision. For now, we are curious about the date of the decision and the respondent.

Section 2: Decide on a strategy

Having figured out the website, you can decide on how to achieve your goal. In this case, it would be pretty similar to what you would have done manually.

  1. Start on a page
  2. Click on a link to download
  3. Go to the next link until there are no more links
  4. Move on to the next page
  5. Keep repeating steps 1 to 4 until there are no more pages
  6. Profit!

Since we did notice the metadata, let’s use it. If you don’t use what is already in front of you, you will have to read the decision to extract such information In fact, we are going to use the metadata to name our decision.

Section 3: Get your selenium on it!

Selenium drives a web browser. It mimics user interactions on the web browser, so our strategy in Step 2 is straightforward to implement. Instead of moving our mouse like we ordinarily would, we would tell the web driver what to do instead.

WebDriver :: Documentation for SeleniumDocumentation for SeleniumSelenium

Let’s translate our strategy to actual code.

Step 1: Start on a page

We are going to need to start our web driver and get it to run on our web page.

from selenium.webdriver import Chrome from selenium.webdriver.chrome.options import Options PDPCdecisionssite = “https://www.pdpc.gov.sg/Commissions-Decisions/Data-Protection-Enforcement-Cases" # Setup webdriver options = Options() # Uncomment the next two lines for a headless chrome # options.addargument('—headless') # options.addargument('—disable-gpu') # options.addargument('—window-size=1920,1080') driver = Chrome(options=options) driver.get(PDPCdecisions_site)

Steps 2: Download the file

Now that you have prepared your page, let’s drill down to the individual decisions itself. As we figured out earlier, each decision is found in a section named press-item. Get selenium to collect all the decisions on the page.

judgements = driver.findelementsbyclassname('press-item')

Recall that we were not just going to download the file, we will also be using the date of the decision and the respondent to name the file. For the date function, I found out that under each press-item there is a press-date which gives us the text of the decision date; we can easily convert this to a python datetime so we can format it anyway we like.

def getdate(item: WebElement): itemdate = datetime.strptime(item.findelementbyclassname('press_date').text, “%d %b %Y”) return itemdate.strftime(“%Y-%m-%d”)

For the respondent, the heading (which is written in a fixed format and also happens to be the link to the download – score!) already gives you the information. Use a regular expression on the text of the link to suss it out. (One of the decisions do not follow the format of “Breach … by respondent “, so the alternative is also laid out)

def get_respondent(item): text = item.text return re.split(r”\s+[bB]y|[Aa]gainst\s+“, text, re.I)[1]

You are now ready to download a file! Using the metadata and the link you just found, you can come up with meaningful names to download your files. Naming your own files will also help you avoid the idiosyncratic ways the PDPC names its own downloads.

Note that some of the files are not PDF downloads but instead are short texts in web pages. Using the earlier strategies, you can figure out what information you need. This time, I used BeautifulSoup to get the information. I did not want to use selenium to do any unnecessary navigation. Treat PDFs and web pages differently.

def downloadfile(item, filedate, filerespondent): url = item.getproperty('href') print(“Downloading a File: “, url) print(“Date of Decision: “, filedate) print(“Respondent: “, filerespondent) if url[-3:] == 'pdf': dest = SOURCEFILEPATH + filedate + ' ' + filerespondent + '.pdf' wget.download(url, out=dest) else: with open(SOURCEFILEPATH + filedate + ' ' + filerespondent + '.txt', “w”) as f: from bs4 import BeautifulSoup from urllib.request import urlopen soup = BeautifulSoup(urlopen(url), 'html5lib') text = soup.find('div', class_='rte').getText() lines = re.split(r”ns+“, text) f.writelines([line + 'n' for line in lines if line != “”])

Steps 3 to 5: Download every item on every page

The next steps follow a simple idiom — for every page and for every item on each page, download a file.

for pagecount in range(len(pages)): pages[pagecount].click() print(“Now at Page “, pagecount) pages = refreshpages(driver) judgements = driver.findelementsbyclassname('press-item') for judgement in judgements: date = getdate(judgement) link = judgement.findelementbytagname('a') respondent = getrespondent(link) download_file(link, date, respondent)

Unfortunately, once selenium changes a page, it needs to be refreshed. We are going to need a new group__pages and page-number in order to continue accessing the page. I wrote a function to “refresh” the variables I am using to access these sections.

def refreshpages(webdriver: Chrome): grouppages = webdriver.findelementbyclassname('group_pages') return grouppages.findelementsbyclassname('page-number') . . . pages = refresh_pages(driver)

Conclusion

Once you got your web driver to be thorough, you are done! In my last pass, 115 decisions were downloaded in 34 seconds. The best part is that you can repeat this any time there are new decisions. Data acquisition made easy! At least until the PDPC breaks its website.

Postscript: Is this… Illegal?

I’m listening…

Web scraping has always been quite controversial and the stakes can be quite high. Copyright infringement, Misuse of Computer Act and trespass, to name a few. Funnily enough, manually downloading may be less illegal than using a computer. The PDPC’s own terms of use is not on point at this.

( Update 15 Mar 2021 : OK I think I am being fairly obtuse about this. There is a paragraph that states you can’t use robots or spiders to monitor their website. That might make sense in the past when data transfers were expensive, but I don't think that this kind of activity at my scale can crash a server.)

Personally, I feel this particular activity is fairly harmless — this is considered “own personal, non-commercial use” to me. I would likely continue with this for as long as I would like my own collection of decisions. Or until they provide better programmatic access to their resources.

Ready to mine free online legal materials in Singapore? Not so fast!Amendments to Copyright Act might support better access to free online legal materials in Singapore by robots. I survey government websites to find out how friendly they are to this.Love.Law.Robots.HoufuIn 2021, the Copyright Act in Singapore was amended to support data analysis, like web scraping? I wrote this follow-up post.

#PDPC-Decisions #Programming #Python #tutorial #Updated

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu