[Part 1] Do more with docassemble: Google Text to Speech

You can use docassemble to call an API and provide its results. I discuss why I did this with Google Text to Speech and the thinking and research behind it.

[Part 1] Do more with docassemble: Google Text to Speech

Most people associate docassemble with assembling documents using guided interviews. That’s in the name, right? The program asks a few questions and out pops a completed form, contract or document. However, the documentation makes it quite clear that docassemble can do more:

Though the name emphasizes the document assembly feature, docassemble interviews do not need to assemble a document; they might submit an application, direct the user to other resources on the internet, store user input, interact with APIs, or simply provide the user with information.

In this post, let’s demonstrate how to use docassemble to call an API, get a response and provide it to a user. You can check out the completed code on Github (NB: the git branch I would recommend for following this post is blog. I am actively using this package, so I may add new features to the main branch that I don’t discuss here.)

GitHub - houfu/docassemble-googleTTS at blog
A docassemble interview that performs text to speech with Google Cloud - GitHub - houfu/docassemble-googleTTS at blog

Problem Statement

I do a lot of internal training on various legal and compliance topics. I think I am a pretty all right speaker, but  I have my limitations — I can’t give presentations 24/7, and my performance varies in a particular session. Wouldn’t it be nice if I could give a presentation at any time engagingly and consistently?

I could record my voice, but I did not like the result.

I decided to use a text to speech program instead, like the one provided by Google Cloud Platform. I created a computerised version of my speech in the presentation. My audience welcomed this version as it was more engaging than a plain PowerPoint presentation. Staff whose first language was not (Singapore) English also found the voice clear and understandable.

The original code was terminal based. The script was great for developing something fast. However, as more of my colleagues became interested in incorporating such speech in their presentations, I needed something more user friendly.

Let the Robots do the talking — Exploring TTS
Speaking has always been a big part of being a lawyer. You use your voice to make submissions in the highest courts of the land. Even in client meetings, you are also using your voice to persuade. Hell, when I write my emails, I imagine saying what I am writing
I detailed my early exploits in this blog post last year.

I already have a docassemble installation at work, so it appears convenient to work on that. The program would have to do the following:

  • Ask the user what text it wants to transform into speech
  • Allow the user to modify some properties of the speech (speed, pitch etc.)
  • Call Google TTS API, grab the sound file and provide it to the user to download

Assumptions

To follow this tutorial, you will need the following:

  • A working docassemble install. You can start up an instance on your laptop by following these instructions.
  • A Google Cloud Platform (GCP) account with a service account enabled for Google TTS. You can follow Google’s instructions here to set one up.
  • Use the Playground provided in docassemble. If you'd like to use an IDE, you can, but I wouldn’t be providing instructions like how to create files to follow the directory structure of a docassemble package.
  • Some basic knowledge about docassemble. I wouldn’t be going through in detail how to write a block. If you can follow the Hello World example, you should have sufficient knowledge to follow this tutorial.

A Roadmap of this Tutorial

In the next part of this post, I talk about the thinking behind creating this interview and how I got the necessary information (off the web) to make it.

In Part 2, we get the groundwork done by creating four pages. This provides us with a visual idea of what happens in this interview.

[Part 2] Do more with docassemble: Start a project and write a few questions
For this part of the docassemble, Google Text to Speech project, we get the groundwork done by creating four pages. This provides us with a visual idea of what happens in this interview.

In Part 3, I talk about docassemble's background action and why we should use it for this interview. Merging the visual requirements with code, we now have a clearer picture of what we need to write.

[Part 3] Do more with docassemble: Getting work done in a background action
In this part of the docassemble, Google TTS tutorial, I talk about docassemble’s background action and why we should use it for this interview.

In Part 4, we work with an external API by using a client library for Python. We install this client library in our docassemble's python environment and write a python module.

[Part 4] Do more with docassemble: Calling Google Text To Speech 🎺
Introduction So far, all the work we have done is on our docassemble install, which frankly has been quite a breeze. Now we come to the most critical part of this tutorial: working with an external service, Google Text to Speech. Different considerations come into play when working with others.

In Part 5, we finish the interview by coding the end product: an audio file in the guise of a DAFile. You can run the interview and get your text transformed into speech now! I also give some ideas of what else you might want to do in the project.

[Part 5] Do more with docassemble: Provide an audio file for your user to download 💾
In the final part of this docassemble, Google TTS tutorial, we provide an audio file for the user to download.

Part 1: Familiarise yourself with the requirements

To write a docassemble interview, it makes sense to develop it backwards. In a simple case, you would like docassemble to fill in a form. So you would get a form, figure out its requirements, and then write questions for each requirement.

An API call is not a contract or a form, but your process is the same.

Based on Google’s quickstart, this is the method in the Python library which synthesises speech.

# Set the text input to be synthesized
    synthesis_input = texttospeech.SynthesisInput(text="Hello, World!")

# Build the voice request, select the language code ("en-US") and the ssml
# voice gender ("neutral")
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US", 
        ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
    )

# Select the type of audio file you want returned
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

# Perform the text-to-speech request on the text input with the selected
# voice parameters and audio file type
    response = client.synthesize_speech(
        input=synthesis_input, voice=voice, audio_config=audio_config
    )

From this example code, you need to provide the program with the input text (synthesis input), the voice, and audio configuration options to synthesise speech.

That looks pretty straightforward, so you might be tempted to dive into it immediately.

However, I would recommend going through the documents provided online.

  • docassemble provides some of the most helpful documentation, great for varying proficiency levels.
  • Google’s Text To Speech’s documentation is more typical of a product offered by a big tech company. Demos, use cases and guides help you get started quickly. You’re going to have to dig deep to find the one for Python. It receives less love than the other programming languages.

Reading the documentation, especially if you want to use a third-party service, is vital to know what’s available and how to exploit it fully. For example, going through the docs is the best way to find out what docassemble is capable of and learn about existing features — such as transforming a python list of strings into a human-readable list complete with an “and”.

You don’t have to follow the quickstart if it does not meet your use case. Going through the documentation, I figured out that I wanted to give the user a choice of which voice to use rather than letting Google select that for me. Furthermore, audio options like how fast a speaker is will be handy since non-native listeners may appreciate slower speaking. Also, I don’t think I need the user to select a specific file format as mp3s should be fine.

Let’s move on!

This was a pretty short one. I hope I got you curious and excited about what comes next. Continue to the next part, where we get started on a project!