When you process data what you often do is writing a script or notebook, load the data and run your analysis. When you start improving the script you want to put it in something like a git repo to get some version control.
Eventually your data process becomes a regular routine and you make it something like a cronjob that runs by itself.
Your problems start when you need to migrate your process from your developer environment to a production environment, because in order to do this you need to replicate your environment and install all dependencies your script may have.
Eventually your data process becomes a regular routine and you make it something like a cronjob that runs by itself.
Your problems start when you need to migrate your process from your developer environment to a production environment, because in order to do this you need to replicate your environment and install all dependencies your script may have.
We learned that docker can be of great help in those situations and we want to make the case that using docker right from the start is easy and useful.
Assume you have some data that you want to process using pandas. For our example we have data stored in a input.json and we simply want to use pandas dataframe describe method which computes quantities like the mean, quantiles and min/max values. Here is how the code could look like:
Now instead of using the script as it is we create a docker image with it. We like to use anaconda to install our dependencies, which is why we derive our docker images from the anconda image.
Finally you can build and run the container as if you were running the script.
For our example we have to mount the folders where input and output data shall be read from and written to and that's basically it.
df = pd.read_json('input/input.json')processed = df.describe()processed.to_json('output/processed.json')
Now instead of using the script as it is we create a docker image with it. We like to use anaconda to install our dependencies, which is why we derive our docker images from the anconda image.
FROM continuumio/anaconda3WORKDIR /appADD . /appRUN conda install pandasCMD python app.py
Finally you can build and run the container as if you were running the script.
docker build -t describe .
For our example we have to mount the folders where input and output data shall be read from and written to and that's basically it.
docker run --rm -v /path/to/input:/app/input -v /path/to/output:/app/output describe
Here is the full example in our github. This container can now be pushed to a docker repo or stored in a tar file. That way you can deploy it by either pulling from the repo or loading the tar file on any machine that you wish.
Keine Kommentare:
Kommentar veröffentlichen