For the past 3 years, I have heard a lot of buzz about docker containers. I wanted to figure out about this technology & how it could help for a productive developer or data scientist.
I tried to convey my findings through this blog so you don’t need to parse all the information out there. Let’s get started.
Docker is an open-source project based on Linux containers. It uses Linux Kernel features like namespaces and control groups to create containers on top of an operating system. Docker is a tool designed to make easier to create, deploy and run applications by using containers.
We can think docker as lightweight virtual machines that contain everything you need to run an application. Even biggies like Google, Amazon, VMware have built services to support it. That’s all you need to know about docker for now.
Containers aren’t new to the tech world, Google has been using their own container technology for years.
A virtual machine-like VMware provides hardware-level virtualization, the container provides operating system level virtualization. The big difference between the VM and container is that containers share the host machine’s kernel with other co-hosted containers
Who’s using Docker?
Docker is a tool that is designed to benefit both developers and system administrators, making it a part of many DevOps (developers + operations) toolchains.
For developers, it means that they can focus on writing code without worrying about the system that it will ultimately be running on.
For the operations team, Docker gives flexibility and potentially reduces the number of systems needed because of its lightweight and OS-level virtualization.
Why should you need Docker?
Docker containers running on a single machine share that machine’s operating system kernel; they start instantly and use less compute and RAM. Images are constructed from filesystem layers and share common files. This minimizes disk usage and image downloads are much faster.
It based on open standards and run on all major Linux distributions, Microsoft Windows, and on any infrastructure including VMs, bare-metal and in the cloud.
Docker containers isolate applications from another and from the underlying infrastructure. Docker provides the strongest default isolation to limit issues to a single container instead of the entire machine.
As a data scientist in machine learning, being able to rapidly changing environment can significantly affect your productivity. Data science work often begins with data cleaning, data transformation, and model building. This work often occurs in your laptop, however often comes a moment where different compute resources like more CPUs, RAM could speed up your work. Docker makes the process of porting your environment to a remote machine or cloud environment. Even you can participate in Kaggle competitions by taking advantage of porting your environment in any of the cloud services like AWS or Google cloud. The common saying is “Build once run anywhere “
Image: Is a blueprint for what you want to build.
Ex: Ubuntu + Spark + Python and a running Jupyter server.
Container: It is an instantiation of an image that you have brought to running state. You can also have multiple copies of the same image running.
Dockerfile: Cookbook for creating an image, A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image.
DockerHub/Image Registry: A place where a user can share their images publicly or privately. Users can pull the existing images and use it locally for creating the containers.
Commit: Like git, Docker containers offer version control. Dockers generally stateless unless you make a call to stateful. You can save the state of your container at any time as a new image with versions by committing the changes.
I will be using the above terminologies for the rest of the tutorial, please refer to the above terms if you lost somewhere.
You can download and install Docker community edition for free here
Create Your First Docker Image
Get ready to get your hands dirty by creating a first docker image file. Let’s go through the below Dockerfile slowly. With the help of below Dockerfile, you can create Pyspark cluster (Local Mode) with Notebook enabled. The same Dockerfile works for most of the python for data science packages as well.
The FROM statement is the first line of Dockerfile. This specifies the base image for your application. Here we are using Ubuntu:16.04 as a base image. This image is minimal installation which means it doesn’t have all the packages that general ubuntu contains.
As soon as it executes the statement the docker looks for the image locally (in your system), if it is not present locally then docker looks into the DockerHub/Image registry to pull the ubuntu official image locally. You can also build a container on top of pre-built application image such as Anaconda image. You can also push your own image to DockerHub.
Worth mention the versions of images are defined after the colon (:) for each image. In our Ubuntu:16.04 image 16.04 is version tag, even you can set your tag while building your own image like latest.
Important note about the docker images, while you are pulling images from DockerHub need to be cautious as some random images from random people could potentially damage your system. For best practice use official images from respective products.
This statement adds metadata to your image and is completely optional. I add this such that others know who to contact about the image and also so I can search for my docker containers, especially when there are many of them running concurrently on a server.
Run command is used to install any packages required to the image. Like you can run apt-get update and install wget, git and so.
As I mentioned before the base image has minimal packages so we need to install the regular packages which a user required like vim, wget, git are required in further steps of Dockerfile.
As most of the Linux users aware of the environment variables. ENV is used to set the environment variables inside the container.
Docker container generally doesn’t expose the ports to the outside world, even to the local system you are working on. So we need to explicitly allow the port which is required to, in our case I am exposing the ports 8888 and 4040 for jupyter notebooks and Spark
The WORKDIR statement is used to make the directory as a present working directory. This comes in handy when you want to issue a command from the application directory.
ADD statement exports your local files to Docker container. You can even export the folders as well.
From documentation of docker
ADD instruction copies the files, folders from <src> to <dest> location
CMD statement issues executable command at the end of the Dockerfile.
There can only be one CMD instruction in a Dockerfile. If you list more than one CMD then only the last CMD will take effect.
The main purpose of a CMD is to provide defaults for an executing container. These defaults can include an executable, or they can omit the executable, in which case you must specify an ENTRYPOINT instruction as well.
Building a Docker Image
We are good to build an image using the recipe of Dockerfile. You can accomplish through the below command.
The above command able to build a docker image with latest as a version. Please note that this can build docker image, not container (read the terminology at the beginning of this post if you don’t remember the difference)
Run Container from Docker Image
Finally, we are ready to run our first and shiny new container. The below command 8888 port exposes to the outside world so that we can have access to the jupyter notebooks.
After the command execution, the jupyter notebook is up and running. Copy the URL and paste it in your browser to open the jupyter notebook.
List running containers
To the list the running containers in docker environment
Push your docker image to Dockerhub
When you decided to share your work publicly or with your colleagues, just push this image to Dockerhub. It is a public repository for images
Helpful docker commands