Love.Law.Robots. by Ang Hou Fu

CloudComputing

Feature image

I’ve wanted to pen down my thoughts on the next stage of the evolution of my projects for some time. Here I go!

What’s next after pdpc-decisions?

I had a lot of fun writing pdpc-decisions. It scraped data from the Personal Data Protection Commission’s enforcement decisions web page and produced a table, downloads and text. Now I got my copy of the database! From there, I made visualisations, analyses and fun graphs.

All for free.

The “free” includes the training I got coding in Python and trying out various stages of software development, from writing tests to distributing a package as a module and a docker container.

In the lofty “what’s next” section of the post, I wrote:

The ultimate goal of this code, however leads to my slow-going super-project, which I called zeeker. It’s a database of personal data protection resources in the cloud, and I hope to expand on the source material here to create an even richer database. So this will not be my last post on this topic.

I also believe that this is a code framework which can be used to scrape other types of legal cases like the Supreme Court, the State Court, or even the Strata Titles Board. However, given my interest in using enforcement decisions as a dataset, I started with PDPC first. Nevertheless, someone might find it helpful so if there is an interest, please let me know!

What has happened since then?

For one, personal data protection commission decisions are not interesting enough for me. Since working on that project, the deluge of decisions has trickled as the PDPC appeared to have changed its focus to compliance and other cool techy projects.

Furthermore, there are much more interesting data out there: for example, the PDPC has created many valuable guidelines which are currently unsearchable. As Singapore’s rules and regulations grow in complexity, there’s much hidden beneath the surface. The zeeker project shouldn’t just focus on a narrow area of law or judgements and decisions.

In terms of system architecture, I made two other decisions.

Use more open-source libraries, and code less.

I grew more confident in my coding skills doing pdpc-decisions, but I used a few basic libraries and hacked my way through the data. When I look back at my code, it is unmaintainable. Any change can break the library, and the bog of whacked-up coding made it hard for me to understand what I was doing several months later. Tests, comments and other documentation help, but only if you’re a disciplined person. I’m not that kind of guy.

Besides writing code (which takes time and lots of motivation), I could also “piggyback” on the efforts of others to create a better stack. The stack I’ve decided so far has made coding more pleasant.

There are also other programs I would like to try — for example, I plan to deliver the data through an API, so I don’t need to use Python to code the front end. A Javascript framework like Next.JS would be more effective for developing websites.

Decoupling the project with the programming language also expands the palette of tools I can have. For example, instead of using a low-level Python library like pdfminer to “plumb” a PDF, I could use a self-hosted docker container like parsr to OCR or analyse the PDF and then convert it to text.

It’s about finding the best tool for the job, not depending only on my (mediocre) programming skills to bring results.

There’s, of course, an issue of technical debt (if parsr is not being developed anymore, my project can slow down as well). I think this is not so bad because all the projects I picked are open-source. I would also pick well-documented and popular projects to reduce this risk.

It’s all a pipeline, baby.

The only way the above is possible is a paradigm shift from making one single package of code to thinking about the work as a process. There are discrete parts to a task, and the code is suited for that particular task.

I was inspired to change the way I thought about zeeker when I saw the flow chart for OpenLaw NZ’s Data Pipeline.

OpenLaw NZ’s data pipeline structure looks complicated, but it’s easy to follow for me!

It’s made of several AWS components and services (with some Azure). The steps are small, like receiving an event, sending it to a serverless function, putting the data in an S3 bucket, and then running another serverless function.

The key insight is to avoid building a monolith. I am not committed to building a single program or website. Instead, a project is broken into smaller parts. Each part is only intended to do a small task well. In this instance, zeekerscrapers is only a scraper. It looks at the webpage, takes the information already present on the web page, and saves or downloads the information. It doesn't bother with machine learning, displaying the results or any other complicated processing.

Besides using the right tool for the job, it is also easier to maintain.

The modularity also makes it simple to chop and change for different types of data. For example, you need to OCR a scanned PDF but don’t need to do that for a digital PDF. If the workflow is a pipeline, you can take that task out of the pipeline. Furthermore, some tasks, such as downloading a file, are standard fare. If you have a code you can reuse over several pipelines, you can save much coding time.

On the other hand, I would be relying heavily on cloud infrastructure to accomplish this, which is by no means cheap or straightforward.

Experiments continue

Photo by Alex Kondratiev / Unsplash

I have been quite busy lately, so I have yet to develop this at the pace I would like. For now, I have been converting pdpc-decisions to seeker. It’s been a breeze even though I took so much time.

On the other hand, my leisurely pace also allowed me to think about more significant issues, like what I can generalise and whether I will get bad vibes from this code in the future. Hopefully, the other scrapers can develop at breakneck speed once I complete thinking through the issues.

I have also felt more and more troubled by what to prioritise. Should I write more scrapers? Scrape what? Should I focus on adding more features to existing scrapers (like extracting entities and summarisation etc.)? When should I start writing the front end? When should I start advertising this project?

It’d be great to hear your comments. Meanwhile, keep watching this space!

#zeeker #Programming #PDPC-Decisions #Ideas #CloudComputing #LegalTech #OpenSource #scrapy #SQLModel #spaCy #WebScraping

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

I run a docassemble server at work, ostensibly introducing co-workers to a different way of using templates to generate their agreements. It's been pretty useful, so much so that I use it myself for my work. However, due to the pandemic, it's not been easy to go and sell it. Maybe I am going to have better luck soon.

In the meantime, I decided to move the server from AWS to DigitalOcean.

Why move?

I liked the wide variety of features available on AWS, such as CodeCommit, Lambda and SES. DigitalOcean is not comparable in that regard. If I wanted to create a whole suite of services for my application, I would probably find something on AWS's glorious one-page filled with services.

However, with great functions come great complexity. I had a headache trying to exploit them. I was not going to be able to make full use of their ecosystem. (I shall never scoff at AWS certification anymore.)

On the other hand, I was more familiar with DigitalOcean and liked their straightforward pricing. So, if I wanted to move my pet project somewhere, I would have liked it to be in my backyard.

Let's get moving!

Lesson 1: Respect the shutdown

The docassemble docs expressly ask you to shut down your docassemble server gracefully. This is not the usual docker stop <container> command but with a timeout flag. It isn't fatal to forget the timeout flag in many simple use cases, so you would never actually notice it.

However, there's another way to kill your server in the cloud — flip the switch on your cloud instance on the management console. It doesn't feel like that when you click the red button, but it has the same effect. The cloud instance is sent straight to heaven, and there is nothing you can do about it.

The shutdown is important because docassemble does quite a lot of work when it shuts down. It dumps the database records in your storage. If the storage is located in the cloud (like AWS's S3 or DigitalOcean's Spaces), there is some lag when sending all the files there. If the shutdown is not respected, the server's state is not saved, and you might not be able to restore it when you start the container.

So with my AWS container gone in a cloud of dust, I found my files in my S3 storage were not updated. The last copy was over several months ago — the last time I had shut down my container normally. This meant that several months of work was gone! 😲

Lesson 2: Restore from backup

This blog could have ended on that sad note. Luckily for CloudOps newbies like me, docassemble automatically stores backups of the server state. These are stored in the backup folder of your storage and are arranged by date.

If you, like me, borked your docassemble server and set it back to August 2020, you can grab your latest backup and replace the main directory files (outside backup). The process is described in the docassemble docs here. Instead of having no users back in August 2020, I managed to retrieve all my users in the Postgres database stored in the backups. Phew!

Lesson 3: Check your config.yml file

After this exercise, I decided to go with a DigitalOcean Droplet and AWS S3. Given that I was already on S3 and the costs of S3 are actually fairly negligible, this seems like a cost-effective combo. DigitalOcean spaces cost $5 no matter how big they are, whereas my S3 usage rarely comes up to more than a dollar.

Before giving your new docassemble server a spin, do check your config.yml file. You can specify environment variables when you start a container, but once the server is running free, it uses the config.yml file found in the storage. If the configuration file was specially set for AWS, your server might not be able to run properly on DigitalOcean. This means you have to download the config.yml file on the storage (I used the web interface of S3 to do this) and edit it manually to fit your new server.

In my setup, my original configuration file was set up for an AWS environment. This meant that my EC2 instance used security policies to access the S3. At the time, it simplified the set-up of the server. However, my Droplet cannot use these features. Generate an access key and secret key, and input these details and more in your updated config.yml file. Oh, and turn off ec2.

If you are going to use Spaces, you will transfer the files in your old S3 to Spaces (I used s4cmd) and fill in the details of your S3 in the configuration file.

Conclusion

To be honest, the migration was essentially painless. The design of the docassemble server allows it to be restored from a single source of truth — the storage method you choose. Except for the problems that come from hand-editing your old config.yml (I had to type my SecretKey a few times 😢), you probably don't need to enter the docker and read initialize error logs. Given my positive experience, I will be well prepared to move back to AWS again! (Just kidding for now.)

#tech #docassemble #AWS #DigitalOcean #docker #OpenSource #tutorial #CloudComputing

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu