I have always liked Jupyter Notebooks. They were my first contact with python and made the language fun and easy. It is even better when you want to explore a data set. You can just get the data and then plot it into a graph quickly. It’s even good enough to present them to other people.
However, if you want to run the same Jupyter Notebook server in different environments, it can be very challenging. Virtual environments can be useful but very difficult to sync. I own a MacOS, a Windows PC and a Linux server. Getting these parts to play with each other nicely has not been a happy experience. Repositories are inconsistent and I have spent time figuring out quirks in each operating environment.
Solution: Dockerize your Jupyter
Since being exposed to docker after setting up a docassemble server, I have been discovering all the wonders of virtual computers. The primary bugbear which is solved by docker is that it provides the means for you not to care about your computer’s background environment, instead of building “blocks” to create your application.
For this application, I wanted the following pieces of software to work together:
- A python environment for data analysis
- spaCy for natural language processing, given its prebuilt models and documentation
- Jupyter notebooks to provide a web-based editor I can use anywhere
- A MongoDB driver since the zeeker database is currently hosted in a MongoDB Atlas service.
Once all this is done, instead of writing
jupyter notebook on a console and praying that I am in the right place, I will run
docker run houfu/zeeker-notebooks and create a server from a build in the cloud.
The Magic: The Dockerfile
Shocking! All you need to create a docker image is a simple text file! To put it simply, it is like a recipe book that contains all the steps to build your docker image. Therefore, by working out all the ingredients in your recipe, you can create your very own docker image. The “ingredients” in this case, would be all the pieces of software I listed in the previous section.
Of course, there are several ways to slice an apple, and creating your own docker image is no different. Docker has been around for some time and has enjoyed huge support, so you can find a docker image for almost every type and kind of open-source software. For this project, I first tried docker-compose with a small Linux server and installing python and Linux.
However quite frankly, I didn’t need to reinvent the wheel. The jupyter project already has several builds for different flavours and types of projects. Since I knew that some of my notebooks needed matplotlib and pandas, I chose the
scipy-notebook base image. This is the first line in my “recipe”:
The other RUN lines allow me to install dependencies and other software like spaCy and MongoDB. These are the familiar instructions you would normally use on your computers. I even managed to download the spaCy models into the docker image!
RUN conda install --quiet --yes -c conda-forge spacy && \ python -m spacy download en_core_web_sm && \ python -m spacy download en_core_web_md && \ conda clean --all -f -y RUN pip install --no-cache-dir pymongo dnspython
Once docker has all these instructions, it can build the image. Furthermore, since the base image already contains the functionality of the jupyter notebook, there’s no need to include any further instructions on EXEC or ENTRYPOINT.
Persistence: Get your work saved!
The Dockerfile was only 12 lines of code. At this point, you are saying — this is too good to be true! Maybe, you are right. Because what goes on in docker, stays in docker. Docker containers are not meant to be true computers. Just so that they are lightweight and secure, they can easily be destroyed. If you created a great piece of work in the docker container, you have to ensure that it gets saved somewhere.
The easiest way to deal with this is to bind a part of your computer’s filesystem to a part of your docker container’s filesystem. This way your jupyter server in the docker container “sees” your files. Then your computer gets to keep those files no matter what happens to the container. Furthermore, if you run this from a git repository, your changes can be reflected and merged in the origin server.
This is the command you call in your shell terminal:
$ docker run -p 8888:8888 --mount type=bind,source="$(pwd)",target=/home/jovyan/work houfu/zeeker-notebooks
Since user accounts in the base image are named jovyan (another way of saying Jupiter), this is where we bind them. The
$(pwd) is an abbreviation that allows the “present working directory” to be filled in as a source, no matter where or how you saved the repository.
There you have it! Let’s get working!
Bonus: Automate your docker image builds
It is great that you have got docker to create your standard environment for you. Now let’s take it one step further! Won’t it be great if docker would update the image on the repository whenever there are changes to the code? That way the “latest” tag actually means what it says,
You can do that in a cinch by setting up auto-build if you are publishing on Docker hub. Link your source repository on GitHub to your docker hub repository and configure automated builds. This way, every time there is a push to your repository, an image is automatically built, providing the latest image to all using the repository.
It’s fun to explore new technologies, but you must consider whether they will help your current workflow or not. In this case, docker will help you to take your environment with you to share with anyone and on any computer. You will save time figuring out how to make your tools work, focusing on how to make your tools work for you. Are you seeing any other uses for docker in your work?