Love.Law.Robots. by Ang Hou Fu

docker

Two people seated at a table. conversation. A robot barista behind them. Friendly. Warm. Inviting.

Since receiving support requests and development ideas all over the place for redlines, I decided to try running a #Matrix room to centralise some discussion and an informal venue to ask weird and “stupid” questions. Since I am in the mood for experimentation, I wanted to run robots in my room. For my redlines room, I wanted a robot to track the activity on the GitHub repo, and welcome new users by posting what is the purpose of the room etc.

I was surprised to find very little documents or bots on how to run a matrix bot. I decided to go with maubot because I like the idea of plugins. While the interface and the docs leaves something to be desired, it’s actually relatively straightforward. Here’s a short write-up/#tutorial of how I did it in case it helps someone out there.

Read more...

I run a docassemble server at work, ostensibly introducing co-workers to a different way of using templates to generate their agreements. It's been pretty useful, so much so that I use it myself for my work. However, due to the pandemic, it's not been easy to go and sell it. Maybe I am going to have better luck soon.

In the meantime, I decided to move the server from AWS to DigitalOcean.

Why move?

I liked the wide variety of features available on AWS, such as CodeCommit, Lambda and SES. DigitalOcean is not comparable in that regard. If I wanted to create a whole suite of services for my application, I would probably find something on AWS's glorious one-page filled with services.

However, with great functions come great complexity. I had a headache trying to exploit them. I was not going to be able to make full use of their ecosystem. (I shall never scoff at AWS certification anymore.)

On the other hand, I was more familiar with DigitalOcean and liked their straightforward pricing. So, if I wanted to move my pet project somewhere, I would have liked it to be in my backyard.

Let's get moving!

Lesson 1: Respect the shutdown

The docassemble docs expressly ask you to shut down your docassemble server gracefully. This is not the usual docker stop <container> command but with a timeout flag. It isn't fatal to forget the timeout flag in many simple use cases, so you would never actually notice it.

However, there's another way to kill your server in the cloud — flip the switch on your cloud instance on the management console. It doesn't feel like that when you click the red button, but it has the same effect. The cloud instance is sent straight to heaven, and there is nothing you can do about it.

The shutdown is important because docassemble does quite a lot of work when it shuts down. It dumps the database records in your storage. If the storage is located in the cloud (like AWS's S3 or DigitalOcean's Spaces), there is some lag when sending all the files there. If the shutdown is not respected, the server's state is not saved, and you might not be able to restore it when you start the container.

So with my AWS container gone in a cloud of dust, I found my files in my S3 storage were not updated. The last copy was over several months ago — the last time I had shut down my container normally. This meant that several months of work was gone! 😲

Lesson 2: Restore from backup

This blog could have ended on that sad note. Luckily for CloudOps newbies like me, docassemble automatically stores backups of the server state. These are stored in the backup folder of your storage and are arranged by date.

If you, like me, borked your docassemble server and set it back to August 2020, you can grab your latest backup and replace the main directory files (outside backup). The process is described in the docassemble docs here. Instead of having no users back in August 2020, I managed to retrieve all my users in the Postgres database stored in the backups. Phew!

Lesson 3: Check your config.yml file

After this exercise, I decided to go with a DigitalOcean Droplet and AWS S3. Given that I was already on S3 and the costs of S3 are actually fairly negligible, this seems like a cost-effective combo. DigitalOcean spaces cost $5 no matter how big they are, whereas my S3 usage rarely comes up to more than a dollar.

Before giving your new docassemble server a spin, do check your config.yml file. You can specify environment variables when you start a container, but once the server is running free, it uses the config.yml file found in the storage. If the configuration file was specially set for AWS, your server might not be able to run properly on DigitalOcean. This means you have to download the config.yml file on the storage (I used the web interface of S3 to do this) and edit it manually to fit your new server.

In my setup, my original configuration file was set up for an AWS environment. This meant that my EC2 instance used security policies to access the S3. At the time, it simplified the set-up of the server. However, my Droplet cannot use these features. Generate an access key and secret key, and input these details and more in your updated config.yml file. Oh, and turn off ec2.

If you are going to use Spaces, you will transfer the files in your old S3 to Spaces (I used s4cmd) and fill in the details of your S3 in the configuration file.

Conclusion

To be honest, the migration was essentially painless. The design of the docassemble server allows it to be restored from a single source of truth — the storage method you choose. Except for the problems that come from hand-editing your old config.yml (I had to type my SecretKey a few times 😢), you probably don't need to enter the docker and read initialize error logs. Given my positive experience, I will be well prepared to move back to AWS again! (Just kidding for now.)

#tech #docassemble #AWS #DigitalOcean #docker #OpenSource #tutorial #CloudComputing

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

I’m always trying my best to be a good husband. Unfortunately, compared to knitting, cooking and painting, computer programming does not look essential. You sit in front of a computer, furiously punching on your keyboard. Something works, but most people can’t see it. Sure, your phone works, but how did your sitting around contribute to that? Geez. It’s time to contribute around the house by automating with Python!

The goal is to create a script that will download a sudoku puzzle from daily sudoku and print it in at home. My wife likes doing these puzzles, so I was sure that she would appreciate having one waiting for her to conquer in the printer tray.

You can check out the code in its repository and the docker hub for the image. An outline of the solution is provided below:

  1. Write a Python script that does the following things: (1) Set up a schedule to get the daily sudoku and download it, (2) sends an email to my HP Printer email address.
  2. HP ePrint prints the puzzle.
  3. Dockerize the container so that it can be run on my home server.
  4. Wait patiently around 9:25 am every day at the printer for my puzzle.

Coding highlights

This code is pretty short since I cobbled it together in about 1 night. You can read it on your own, but here are some highlights.

Download a file by constructing its path

As I mentioned before, you have to study your quarry carefully in order to use it. For the daily sudoku website, I found a few possibilities to automate the process of getting your daily sudoku.

  • Visit the main web page and “click” on the using an automated web browser.
  • Parse the RSS feed and get the puzzle you would like.
  • “Construct” the URL to the PDF of the puzzle of the day

I went with the last option because it was the easiest to implement without needing to download additional packages or pursue extra steps.

now = datetime.now()
r = requests.get( f”http://www.dailysudoku.com/sudoku//pdf/{now.year}/" f”{now.strftime('%m')}/{now.strftime('%Y-%m-%d')}S1N1.pdf”, timeout=5)

Notice that the python code using f-strings and strftime, which provides a text format for you to fill your URL.

This method ain’t exactly foolproof. If the structure of the website is changed, the whole code is useless.

Network printing — a real PITA

My original idea was to send a PDF directly to a network printer. However, it was far more complicated than I expected. You could send a file through the Internet Print Protocol, Line Printer Daemon or even HP’s apparently proprietary port 9100. First, though, you might need to convert the PDF file to a Postscript file. Then locate or open a socket… You could install CUPS in your container…

Errm never mind.

Sending an email to print

Luckily for me, HP can print PDF attachments sent by email. It turns out that sending a simple email using Python is quite straightforward.

msg = EmailMessage() msg['To'] = printemail msg['From'] = smtpuser msg['Subject'] = 'Daily sudoku' msg.addattachment(r.content, maintype='application', subtype='pdf', filename='sudoku.pdf') with SMTP(smtpserver, 587) as s: s.starttls() s.login(smtpuser, smtppassword) s.send_message(msg)

However, HP’s requirements for sending a valid email to their HP ePrint servers is kind of quirky. For example, your attachment will not print if:

  • There is no subject in the email.
  • The attachment has no filename (stating the MIME type of the file is not enough)
  • The person who emails the message must be a permitted user. You have to go to the HP Connected website to set these allowed senders manually.

Setting the local timezone for the docker container

The schedule package does not deal with time zones. To be fair, if you are not really serving an international audience, that’s not too bad. However, for a time-sensitive application like this, there’s a big difference between waiting for your puzzle at 9:30 am local time and 9:30 am UTC (that’s nearly time to knock off work in Singapore!).

Setting your time zone in a docker container depends on the base image of the Operating System you used. I used Debian for this container, so the code is as follows.

RUN ln -sf /usr/share/zoneinfo/Asia/Singapore /etc/localtime

Note that the script sleeps for 3 hours before executing pending jobs. This means that while the job is submitted at 9:30 am, it may be quite sometime later before it is executed.

Environment variables

The code does not make it very obvious, but you must set environment variables in order to use the script. (Oh read the README for crying out loud) This can be done in a cinch with a docker-compose file.

sudoku: image: “houfu/dailysudoku:latest” hostname: sudoku containername: dailysudoku environment: – PRINT[email protected] – SMTPSERVER=smtp.gmail.com – SMTP[email protected] – SMTP-PWD=blah blah blah

Update 17/6/2020: There was a typo in the address of Gmail’s SMTP server and this is now rectified.

Conclusion

I hastily put this together. It’s more important to have a puzzle in your hand that configuration variables and plausibly smaller containers. Since the project is personal, I don’t think I will be updating it much unless it stops working for me. I’ll be happy to hear if you’re using it and you may have some suggestions.

In the meantime, do visit www.dailysudoku.com for more puzzles, solutions, hints, books and other resources!

#Programming #docker #Python

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Things you can only do during a lockdown – install a new server. As I mentioned previously, I got round to installing the latest Ubuntu LTS on my home server. Instead of using apt get for new software, I would run my server services through Docker instead. First up, I got Pi-Hole working and blocking ads. I's been sweet.

Let’s Play with: Pi-HoleI try to install Pi-Hole Server to block all ads and tracking websites at home.Love.Law.Robots.Houfu

My conviction to use containers started with docassemble. You can use docassemble to generate contracts from answering questions. It's relevant to my work and I am trying to get more of my (non-legal) colleagues to try it. Unlike other software I have tried, docassemble recommends just using docker. With one command, docker run -d jhpyle/docassemble, I would get a fully-featured server. My mind was blown.

DocassembleA free, open-source expert system for guided interviews and document assembly, based on Python, YAML, and Markdown.Docassemble

However, as I became more familiar with how to get docker to do what I want, the limitations of that simple command began to restrict me. Docassemble uses several ports. Many other applications share the same port, especially for a web server: 80 and 443. If docker and docassemble took these ports, no one else was going to get them. I wasn't sure if I wanted my home server to be just a docassemble server.

Furthermore, using secure ports (HTTPS) became a serious problem. I wanted to use my home server's docassemble installation as a development base, so it should be accessible to the outside world. For some reason, docassemble wouldn't accept my wildcard certs. If I planned to use it for anything serious, having an unsecured website was impossible.

It got so frustrating that I gave up.

Enter the Reverse-Proxy: Traefik

The short answer to my problems was to use a reverse proxy. A reverse proxy is a kind of server that gets information from another server for a client. Or in this case, a traefik server receives a request and figures out which docker container it should go to. A traefik server can also do other things, such as providing end to end security for your communications by obtaining free SSL certificates from Let's Encrypt.

TraefikTraefik Documentationlogo

I was convinced to go for this because it claimed that it would make “publishing your services a fun and easy experience”. When I read that, I let a tear go. Is it actually possible for this program to automatically detect the right configuration for your services? Even for something as big as docassemble?

I'll let you be the judge of that at the end of this article.

Step 1: Set up Traefik

Of course, you would need to have docker set up and good to go.

There are a bunch of ways to get Traefik going, but I would be using a docker-compose.yml file like in the QuickStart.

The documentation for docassemble does not mention anything about docker compose. It is a shame because I found it to be a more user-friendly tool than the docker command-line interface. So instead of writing a bash script just to shorten my long docker run command, I would write out the blueprint of my setup in the docker-compose.yml. After that, I can run docker-compose up -d and the services in the file will start altogether.

This is very important in my setup, because there are other services in my home server like plex or grocy (another lockdown project) too. For the sake of convenience, I decided to include all these like projects in the same docker-compose.yml file. This is the blueprint of my home server!

Back to Traefik, this is the section of my docker-compose.yml file setting out the reverse proxy server:

services: reverse-proxy: # The official v2 Traefik docker image image: traefik:v2.2 containername: traefik # Enables the web UI and tells Traefik to listen to docker command: —api.insecure=true —providers.docker ports: # The HTTP/HTTPS port – “80:80” – “443:443” # The Web UI (enabled by —api.insecure=true) – “8080:8080” volumes: # So that Traefik can listen to the Docker events – /var/run/docker.sock:/var/run/docker.sock – /home/houfu/traefik/:/etc/traefik/ environment: DOAUTH_TOKEN: XXX restart: unless-stopped

Just a few notes:

  • This line /home/houfu/traefik/:/etc/traefik/ under volumes allows me to have access to the configuration file used by traefik.
  • This line DO_AUTH_TOKEN: XXX under environment is to generate SSL certificates using my personal domain, which is managed by DigitalOcean.

Step 2: Prepare Traefik to generate SSL Certificates

Instead of having docassemble obtain the SSL certificates to use HTTPS, I decided to get Traefik to do it instead. Reverse proxies do this job much better, and I wouldn't need to “enter” the docassemble container to hunt down why SSL is not working.

Besides, my other services on my home server were already getting their certificates through Traefik, so getting docassemble to do the same would be straightforward right?

For this step, you would need to define a certificate resolver for Traefik to use. Please read the documentation as it is quite informative. For my set-up, I decided to use DigitalOcean as I was already using it for my DNS.

In the configuration file (traefik.toml), add a section to define the certificate resolver.

[certificatesResolvers.docassembleResolver.acme] email = “[email protected]” storage = “acme.json”

[certificatesResolvers.docassembleResolver.acme.dnsChallenge] # used during the challenge provider = “digitalocean”

The final step, especially if you have chosen DigitalOcean as a provider, is to get an API key and provide it to Traefik so that the process of getting a certificate can be automated. This was the DO_AUTH_TOKEN in the docker-compose.yml file referred to in the first step.

Step 3: Provide a blueprint for the Docassemble service

Once we have the reverse proxy set up, it’s time to get docassemble to run. This is the final form of the docker-compose.yml file for the docassemble service.

docassemble: image: “jhpyle/docassemble:latest” hostname: docassemble containername: docassemble stopgrace_period: 1m30s environment: – CONTAINERROLE=all – DBPREFIX=postgresql+psycopg2:// – DBNAME=docassemble – DBUSER=docassemble – DBPASSWORD=abc123 – DBHOST=localhost – USEHTTPS=false – DAHOSTNAME=docassemble.example.com – USELETSENCRYPT=false – S3ENABLE=true – S3ACCESSKEY=ABCDEFGH – S3SECRETACCESSKEY=1234567 – S3BUCKET=docassemble – S3ENDPOINTURL=https://xxxx.sgp1.digitaloceanspaces.com – TIMEZONE=Asia/Singapore – DAPYTHONVERSION=3 labels: – traefik.backend=docassemble – traefik.http.routers.docassemble.rule=Host(docassemble.example.com) – traefik.http.services.docassemble.loadbalancer.server.port=80 – traefik.http.routers.docassemble.tls=true – traefik.http.routers.docassemble.tls.certresolver=docassembleResolver – traefik.http.middlewares.docassemble-redirect.redirectscheme.scheme=https – traefik.http.middlewares.docassemble-redirect.redirectscheme.permanent=true – traefik.http.routers.docassemble.middlewares=docassemble-redirect

One of the most important aspects of setting up your own docassemble server is figuring out the environment variables. The docassemble documentation recommends that we use an env.list file or pass a list of configuration values to the docker run command. For our docker-compose file, we pass them as a dictionary to the environment section of the service blueprint. Feel free to add or modify these options as you need. For example, you can see that I am using DigitalOcean Spaces as my S3 compatible storage.

So where does the magic of Trafik’s automatic configuration come in? Innocuously under the label section of the blueprint. Let’s split this up for easy explanation.

labels: – traefik.backend=docassemble – traefik.http.routers.docassemble.rule=Host(docassemble.example.com) – traefik.http.services.docassemble.loadbalancer.server.port=80

In the first block of labels, we define the name and the host of the docassemble server. Traefik now knows what to call this server, and to direct queries from “docassemble.example.com” to this server. As docassemble exposes several ports, we also help prod traefik to use the correct port to access the server.

labels: – traefik.http.routers.docassemble.tls=true – traefik.http.routers.docassemble.tls.certresolver=docassembleResolver

In this block of labels, we tell Traefik to use HTTPS and to use the certificate provider we defined earlier to get these certificates.

labels: – traefik.http.middlewares.docassemble-redirect.redirectscheme.scheme=https – traefik.http.middlewares.docassemble-redirect.redirectscheme.permanent=true – traefik.http.routers.docassemble.middlewares=docassemble-redirect

Finally we tell traefik to use a middleware here — a redirect. The redirect middleware ensures that uses will use HTTPS to communicate with the server.

Note that in our environment variables for the docassemble server, we tell docassemble not to use https (“USEHTTPS=false”). This is because traefik is already taking care of it. We don’t need docassemble to bother with it.

It works!

Docassemble servers take a bit of time to set up. But once you get it up, you will see my favourite screen in the entire application.

docassemble server is working.I would like to thank my...

Notice the grey padlock in the address bar of my Firefox browser? That’s right, HTTPS baby!!

Final Thoughts

I am glad I learnt a lot about docker from docassemble, and its documentation is top-notch for what it is. However, running one is not easy. Using docker-compose helped iron out some of the pain. In any case, I am glad I got over this. It’s time to get developing! What should I work on next?

#blog #docassemble #docker #tutorial #tech #Traefik #HTTPS

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

Update 11 May 2020 : A few days after I wrote this post, Pi-Hole released version 5.0. Some of the new features impact the content here. Since it’s only been days, I have updated the content accordingly.

It was a long weekend, so it was time to play. Ubuntu 20.04 LTS just came out. This is important because of the “LTS” at the back of its name. I took the opportunity to upgrade “Ursula”, my home server. I have not been installing OSes like changing my clothes since High School, but I had big plans for this one.

Ad Blocking on a Network Level

Securing your internet is tough. I have “fond” memories of earlier days of the internet when browsing the internet exposed you to porn. How about flash movies that install software on your computer? It now seems quaint that people are surprised that they can be tricked over the internet with phishing and social engineering.

I value my privacy and I would like to control what goes on about me and my computers. I don’t like ads or tracking technologies. More people seem to be on my side on this one: with every browser claiming that they will block ads or trackers.

Browsers are important because they are the main window for ads or trackers. However, other activities also generate such risks, such as handphones, smart gadgets, and other internet-connected devices.

If you are accessing the internet outside of your browser, your browser won’t protect you. The more comprehensive solution is to protect on a network level.

To protect yourself on a network level, you will adjust your internet router settings and how your internet traffic is processed so that all requests are caught. A blacklist of trackers and suspicious websites is usually maintained. If a query meets the blacklist, they are not processed.

As you might expect, fidgeting with your internet router settings, finding out what your ISP’s upstream servers are, or even niggling around config files is very daunting for most users.

Enter the Pi-Hole

I first learned about Pi-Hole through the DigitalOcean Marketplace. It was great that it was designed for containers from the start, because I wanted “Ursula” to serve services using containers instead of the complexity of figuring out Ubuntu Linux’s oddities.

Home1. Install a supported operating systemYou can run Pi-hole in a container, or deploy it directly to a supported operating system via our automated installer.DPi-hole logotelekrmorPi-hole Web Page

Previously I implemented my internet blacklist using response policy zones in a bind9 server. I am not entirely sure how I did it… which would be a disaster if my server gets wiped out.

The best thing about dockers is that you would write the configuration in one file (like a docker-compose.yml for me) and it’s there. Once you have reviewed the configuration, you would just call docker-compose up and the program starts up for you.

Once you have the server running, you can ogle at its work with pi-hole’s gorgeous dashboard:

So many queries, so many blocked. ( Update 11/5/20 : Screenshot updated to show the new version 5.0 interface. So many bars now!)

I could make a few conclusions from the work of my Pi-Hole server so far:

  • Several queries were blocked from my handphone. This shows that phones are a hotbed for ad trackers. Since most of us use our phones for web browsing, advertising on the internet has not taken a hit even though more browsers feature some form of adblocking.
  • The second chart (labelled “Clients “Over time)”) roughly corresponds to the computers used during the day. During this circuit breaker period, you can see your work computers dialling “home”. At night, more home computers are sending queries.

Installation Headaches

Using Pi-Hole as a local LAN DNS server

My previous LAN DNS server was meant to serve DNS queries for my home network. My home server and Network Attached Storage device were its main customers. I also exposed some of the services (like my Plex) to the outside world. If my LAN server was not around, I will have to remember many octets (read IP addresses).

Update 11/5/2020 : In the original post, I complained about setting local LAN hostnames being hidden. Version 5.0 now allows you to set hostnames through the admin dashboard. This is one feature that I would be using! Turns out, it was quick and easy!

The dashboard used to add local DNS domains. New in version 5.0.

Installing Pi-Hole Behind a Traefik Server/Reverse Proxy

I didn’t wreck my Ubuntu 18.04 LTS server so that I could install Pi-Hole. I wanted to be able to serve several services through my Home Server without having to be limited by one set of 80 (HTTP) and 443 (HTTPS) ports. Pi-Hole uses both of those ports. I will not be able to have any more web servers.

A reverse proxy routes a request to the correct server. My forays with Nginx and the traffic server had not been successful. Traefik got me curious because it claimed it could automatically figure out configurations automatically. If I could get Traefik to work, Traefik could sort out how to have several applications on one host!

Traefik, The Cloud Native Application Proxy | Traefik LabsTraefik is the leading open-source reverse proxy and load balancer for HTTP and TCP-based applications that is easy, dynamic and full-featured.Traefik Labs: Makes Networking Boring

So getting Traefik to work was a priority, but I also really wanted to set up Pi-Hole first. Curiously, there are some resources on getting both to work together correctly. Since this was the first time I was using both Traefik and Pi-Hole, I needed to experiment badly. In the end, I went down with this configuration in my docker-compose file:

version: '3'

services: reverse-proxy: # The official v2 Traefik docker image image: traefik:v2.2 containername: traefik # Enables the web UI and tells Traefik to listen to docker command: —api.insecure=true —providers.docker ports: # The HTTP/HTTPS port – “80:80” – “443:443” # The Web UI (enabled by —api.insecure=true) – “8080:8080” volumes: # So that Traefik can listen to the Docker events – /var/run/docker.sock:/var/run/docker.sock – /home/houfu/traefik/:/etc/traefik/ environment: DOAUTH_TOKEN: [... Token provided by Digital Ocean for SSL certificate generation] restart: unless-stopped

### pi-hole

pihole: containername: pihole domainname: xxx.home hostname: pihole image: pihole/pihole:latest dns: – 127.0.0.1 – 1.1.1.1 ports: – '0.0.0.0:53:53/tcp' – '0.0.0.0:53:53/udp' #– '0.0.0.0:67:67/udp' – '0.0.0.0:8052:80/tcp' – “0.0.0.0:8443:443/tcp” volumes: – ./etc-pihole/:/etc/pihole/ – ./etc-dnsmasqd/:/etc/dnsmasq.d/ # run touch ./pihole.log first unless you like errors # – ./pihole.log:/var/log/pihole.log environment: ServerIP: 192.168.2.xxx PROXYLOCATION: pihole VIRTUALHOST: pihole.xxx VIRTUALPORT: 80 TZ: 'Asia/Singapore' WEBPASSWORD: PASSWORD DNS1: [VQ Server 1] DNS2: [VQ Server 2] restart: unless-stopped labels: # required when using —docker.exposedbydefault=false – “traefik.enable=true” # https://www.techjunktrunk.com/docker/2017/11/03/traefik-default-server-catch-all/ – “traefik.frontend.rule=HostRegexp:pihole.xxx,{catchall:.*}” – “traefik.frontend.priority=1” – “traefik.backend=pihole” – “traefik.port=80” – “traefik.port=443”

(Some private information, like the names of my private servers and the IP of my ISP’s DNS servers, have been anonymised.)

Conclusion

I could not have done this without the copious time at home created by the circuit breaker. For now, though, I hope I can run this and many experiments on this server and report it on this blog. Is there something I should try next? Let me know in the comments!

#blog #tech #docker #DigitalOcean #Updated #OpenSource

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu

Feature image

I have always liked Jupyter Notebooks. They were my first contact with python and made the language fun and easy. It is even better when you want to explore a data set. You can just get the data and then plot it into a graph quickly. It’s even good enough to present them to other people.

However, if you want to run the same Jupyter Notebook server in different environments, it can be very challenging. Virtual environments can be useful but very difficult to sync. I own a MacOS, a Windows PC and a Linux server. Getting these parts to play with each other nicely has not been a happy experience. Repositories are inconsistent and I have spent time figuring out quirks in each operating environment.

Solution: Dockerize your Jupyter

Since being exposed to docker after setting up a docassemble server, I have been discovering all the wonders of virtual computers. The primary bugbear which is solved by docker is that it provides the means for you not to care about your computer’s background environment, instead of building “blocks” to create your application.

For this application, I wanted the following pieces of software to work together:

  • A python environment for data analysis
  • spaCy for natural language processing, given its prebuilt models and documentation
  • Jupyter notebooks to provide a web-based editor I can use anywhere
  • A MongoDB driver since the zeeker database is currently hosted in a MongoDB Atlas service.

Once all this is done, instead of writing jupyter notebook on a console and praying that I am in the right place, I will run docker run houfu/zeeker-notebooks and create a server from a build in the cloud.

The Magic: The Dockerfile

Shocking! All you need to create a docker image is a simple text file! To put it simply, it is like a recipe book that contains all the steps to build your docker image. Therefore, by working out all the ingredients in your recipe, you can create your very own docker image. The “ingredients” in this case, would be all the pieces of software I listed in the previous section.

Of course, there are several ways to slice an apple, and creating your own docker image is no different. Docker has been around for some time and has enjoyed huge support, so you can find a docker image for almost every type and kind of open-source software. For this project, I first tried docker-compose with a small Linux server and installing python and Linux.

However quite frankly, I didn’t need to reinvent the wheel. The jupyter project already has several builds for different flavours and types of projects. Since I knew that some of my notebooks needed matplotlib and pandas, I chose the scipy-notebook base image. This is the first line in my “recipe”:

FROM jupyter/scipy-notebook

The other RUN lines allow me to install dependencies and other software like spaCy and MongoDB. These are the familiar instructions you would normally use on your computers. I even managed to download the spaCy models into the docker image!

RUN conda install —quiet —yes -c conda-forge spacy && \ python -m spacy download encorewebsm && \ python -m spacy download encorewebmd && \ conda clean —all -f -y

RUN pip install —no-cache-dir pymongo dnspython

Once docker has all these instructions, it can build the image. Furthermore, since the base image already contains the functionality of the jupyter notebook, there’s no need to include any further instructions on EXEC or ENTRYPOINT.

Persistence: Get your work saved!

The Dockerfile was only 12 lines of code. At this point, you are saying — this is too good to be true! Maybe, you are right. Because what goes on in docker, stays in docker. Docker containers are not meant to be true computers. Just so that they are lightweight and secure, they can easily be destroyed. If you created a great piece of work in the docker container, you have to ensure that it gets saved somewhere.

The easiest way to deal with this is to bind a part of your computer’s filesystem to a part of your docker container’s filesystem. This way your jupyter server in the docker container “sees” your files. Then your computer gets to keep those files no matter what happens to the container. Furthermore, if you run this from a git repository, your changes can be reflected and merged in the origin server.

This is the command you call in your shell terminal:

$ docker run -p 8888:8888 —mount type=bind,source=“$(pwd)”,target=/home/jovyan/work houfu/zeeker-notebooks

Since user accounts in the base image are named jovyan (another way of saying Jupiter), this is where we bind them. The $(pwd) is an abbreviation that allows the “present working directory” to be filled in as a source, no matter where or how you saved the repository.

Screenshot of jupyter notebook page showing file directory.

There you have it! Let’s get working!

Bonus: Automate your docker image builds

It is great that you have got docker to create your standard environment for you. Now let’s take it one step further! Won’t it be great if docker would update the image on the repository whenever there are changes to the code? That way the “latest” tag actually means what it says,

You can do that in a cinch by setting up auto-build if you are publishing on Docker hub. Link your source repository on GitHub to your docker hub repository and configure automated builds. This way, every time there is a push to your repository, an image is automatically built, providing the latest image to all using the repository.

Webpage in Docker showing automated builds.

Conclusion

It’s fun to explore new technologies, but you must consider whether they will help your current workflow or not. In this case, docker will help you to take your environment with you to share with anyone and on any computer. You will save time figuring out how to make your tools work, focusing on how to make your tools work for you. Are you seeing any other uses for docker in your work?

#Programming #docker #Python

Author Portrait Love.Law.Robots. – A blog by Ang Hou Fu