In my previous blog, I proposed a way to easily run large scale machine learning task in cloud using Docker container and Azure Batch. I also use this approach at work for some of my projects. One thing I start realizing is the size of the contianer image can grow very quickly as we add more functionality into the ML training task.
Use open source tools such as scikit-learn, nltk etc. will bring additional dependencies into the container image. For example, some of us may use mini conda, but it can easily introduce a few hundred MBs into the docker container image. The Ubuntu 16.04 base image is about 120MB, then very quickly I start seeing my container image size go beyond 1GB, then 3GB after install some other tools.
Azure Batch’s starting task gets longer and longer because it needs signifcantly more time to pull the image of 3G than a few MBs. Therefore, I decided to reduce the container image size.
Reduce Layers
The first try is to reduce layers in the Dockerfile above. For example, combine those RUN command together could reduce a lot of layers which leads to smaller container size. But the size is still 2.x GB. What’s next?
The Dockerfile above install mini conda, but eseentially we only use the standard Python library plus numpy, scipy etc. in the conda install command. Obviously, this is a big chunk to remove. So we decided to use a Python base image from Debian. This save us about 1GB of space and the image size is now 1.3GB. In this round, we will have to specify which version of numpy, scipy etc. runs because some newer version, such as numpy 1.14.3 is not compatible with matplotlib at runtime. Though this brings challenge to debug and match the right version, while mini conda does this for you, it still worth the effort of reducing 1GB of the size.
At this point, I only install the libs I really need at runtime so it reaches a minimal point of image size. But I still wonder if I can break the 1GB limit. There is a way! Alpine Linux Project offers a 5MB bare minimal linux environment to start with. Yes, it is 5mb, comparing to 115MB ubuntu image. Python also has a base image for alpine python:3.6-alpine which is 75MB, comparing to 918MB of Ubuntu Python image. If I can use Alpine Linux, I am reducing another big chunk.
Since Alpine linux has its own package management system, it took me some time to satisfy all the dependencies to build and install all my dependencies and make LightGBM running the same as Ubuntu container. Now the size is 834MB. This blog’s effort saves 2.5GB overhead for Azure Batch’s starting task of pulling container images.