05. April 2019

How to setup your Python Data Science Projects to save you Hassle, Time & Money

A minimal setup to let you crunch numbers like a pro.

In this article you will learn everything you need to know about Python environments, package management, folder structure, tests and how to deploy your project to production. It will save you a lot of hassle and time.

Coming from the Ruby and JavaScript community I had some problems figuring out how to do a proper Python project setup.

It’s not the lack of resources out there, it’s the ubiquity of different approaches. Here is a short list of virtual environment and package management solutions — let alone the mystery of the setup.py:

  • Virtualenv
  • Pipenv
  • Anaconda / Conda environments
  • Pip
  • Conda channels

Another big question mark was the general project structure, e.g. the files and folders every Python project should have, and hints on how to organize your own application code. I was facing challenges on how to use modules, fighting against the PYTHON_PATH, and again the mysteries of the setup.py.

The following should be seen as my personal state of thought on a minimal Python setup. I’m happy for any hints on how to make it even better and any corrections where I got things wrong.

The Environment

First let me explain why you want to have your project in its own environment. Python is so wildly spread that almost all operating systems ship a version of Python by default. This used to be and probably will be for a couple of years Python 2.7 until it eventually will be replaced by a 3.x version. What if you start a new project and want to use 3.6 or even 3.7? You will need to install Python 3.7 and now the nightmare begins. To execute your program you need to remember that it runs with Python 3.7

# python 3.x
$ pip3 install pandas
$ python3 main.py

# python 2.7
$ pip install pandas
$ python main.py

This becomes especially annoying if you also work in a Python 2.7 project and have to think about which version you are working with. Virtual environments to the rescue:

# python 3.x
$ conda activate python-3-env
(python-3-env) $ pip install pandas
(python-3-env) $ python main.py

# python 2.7
$ conda activate python-2-env
(python-2-env) $ pip install pandas
(python-2-env) $ python main.py

# Note that conda displays the active environment

The virtual environment is “a self-contained directory tree that contains a Python installation for a particular version of Python, plus a number of additional packages”. What this means is that if you activate the environment you will not need to think about which version your project is running on but just use python and pip and it will just work.

Which brings us to the next reason to use an environment: Dependencies. Have you ever heard about the term Dependency hell? It has its own Wikipedia article.

Dependency hell is a colloquial term for the frustration of some software users who have installed software packages which have dependencies on specific versions of other software packages. — Michael Jang (2006)

Let’s say you have two projects that need pandas. When you started the first project you pip installed pandas with the version 0.19.0 which was the most recent version at that time. In the second project you also pip installed pandas but at that time pandas version 0.23.3 was the most recent version. Both rely on numpy but on different versions of numpy. What happens if you import numpy or pandas in your project? Which version will it import? Welcome to dependency hell! The unexpected import of the wrong version could break you production system or give your co-worker a headache because she or he has yet another version of pandas and numpy installed on their system.

Because the environment is self-contained and isolated it will only have one version of pandas and numpy installed. You can go to the other project and switch to the environment that contains the old 0.19.0 version of pandas. This isolation makes it super easy to work in different projects. Another benefit is if you change the project in your company or your employer and start working for another company you can just delete your now useless environments and move on to new challenges on the horizon.

Before we go into more details on how to manage the dependencies in an environment let’s figure out how to install virtual environments.

Miniconda environments

A friend of mine told me that the way to go is to use Anaconda, or more precise Miniconda. To be honest, I never tried Virtualenv or Pipenv because the Conda setup worked out of the box and it felt familiar to list all the dependencies in a file, like I do it in my Gemfile or the package.json.

The docs for installing Miniconda are a bit cluttered and make it look like it’s hard to set it up. It’s not! Here are the three commands you need — depending on your OS.

Linux

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p ~/miniconda
rm ~/miniconda.sh
export PATH="$HOME/miniconda/bin:$PATH"
conda init

Mac OS X

curl -fSL https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o ~/miniconda.sh
bash ~/miniconda.sh -b -p ~/miniconda
rm ~/miniconda.sh
export PATH="$HOME/miniconda/bin:$PATH"
conda init

Package Management

Instead of using the Conda channels which are not always up to date, we take advantage of Conda’s ability to install pip packages. Since pip is the commonly used way to distribute packages, we will always get the most recent version of the packages. My recommendation is to pin your package versions. See the environment.yml section below and read about Semantic Versioning, to get a basic understanding of what

MAJOR.MINOR.PATCH

in a version number stands for.

Furthermore, I would recommend to never use the Anaconda Navigator to manage your packages in an environment, but rather always update your environment.yml file instead. This way your teammates will be aware of all the required packages and thus, just need to run:

conda env update -f environment.yml

to be up to date.

Project Structure

I provided an example repository on GitHub. Get over there and clone it. Follow the instructions in the README to hack around.

You can use it as a blueprint for your own projects. Just pay attention to the first section in the README and replace the environment and project name with your own.

GitHub Repository

Our project will have the following folders and files:

tree --dirsfirst -aCFL 1
.
├── .git/
├── data/
├── notebooks/
├── src/
├── tests/
├── .gitignore
├── .travis.yml
├── README.md
├── environment.yml
├── setup.cfg
└── setup.py*

Data

All the data you work with in the project should live here. The data folder has a .gitignore entry to avoid your valuable and probably massive data to be accidentally pushed to git. The better you organize the folder, the easier it is to work with the data. The README should outline how to get the data and where to save it.

I included a utils module in the GitHub repo. The functions you will find there, will help you and your team to access the data and don’t fight against different paths in notebooks and the code base. See the example.

Notebooks

We all love Jupyter notebooks. Here is the place for all your notebook magic. You can organize them by topics, data sets, tasks or however it fits best for you.

If you work with self-written modules in your notebooks — which you should — it’s highly recommended to have your first cell contain this:

%load_ext autoreload
%autoreload 2

This will ensure that if you change your code, Jupyter notebooks will reload the changes without restarting the kernel. See this example.

Update: As Philipp pointed out in the comments you can add this setting to your IPython config so all your notebooks reload. Here are the instructions to make it work:

# Create your IPython profile if you don't have one
ipython profile create

# Open the config in your editor (I use TextMate aka mate)
mate ~/.ipython/profile_default/ipython_config.py

# Search for exec_lines and replace it with
c.InteractiveShellApp.exec_lines = [
    '%load_ext autoreload',
    '%autoreload 2',
]

# Save and restart your notebook servers

src

This is the home of all your modules. As said above I included a tiny utils module that helps with absolute and relative paths. See the code and the examples.

Tests

Should mirror your src folder structure and contain all the necessary test files. More on this later.

The README

I consider the README as one of the most crucial parts in a software project. It is the entry point for all new developers and should outline the basic steps to get you as a developer up and running. In my career I witnessed plenty of time getting wasted on project setups which always seemed to be a frustrating thing to do. The cause of it was always outdated information in the README or sometimes lacking any information on how to do the setup. So try to keep your README up to date and add informations on how to solve common problems if you or one of you teammates run into any.

Every README should at least have the following sections where the section Data is specific to data science projects:

  1. Setup
  2. Usage
  3. Data
  4. Tests

If you use GitHub you can take advantage of the fact that they render READMEs for every folder. That means you can have a specific README.md in the data folder that explains how to get the data and how it should be organized. Link from the project README to this specific data README so your project README doesn’t get bloated. See the example README.

Ok, no more README — I promise!

The environment.yml

Note that we pinned all our dependency to the latest version. The * indicates that we always want to have the latest patch releases, but not the major or minor releases since these could break your entire code base. Depending on the package, you may also include minor updates. In case you are not sure stick to the patch releases and update minor version bumps manually.

Whenever you make changes to this file you will need to run

conda env update -f environment.yml

to install the new packages. This also holds true if you get an import error after pulling from GitHub. Probably one of your teammates added a new package that is not yet installed in your local environment.

One more thing to notice is the first pip entry:

dependencies:
  - pip:
    - '--editable=.'

It will execute the setup.py and make all your modules in the src directory available in your project. The effect is that you don’t need to import src.utils but just import utils. It will also create a data_science_project.egg-info in the src directory which you should NOT delete. It is the reason why the imports work. If you accidentally deleted it just run

conda env update -f environment.yml

again. It will recreate it for you.

The setup.py

Actually I didn’t solve all the mystery about the setup.py, yet. But I figured that if I have the '--editable=.' in the environment.yml and the following keyword arguments for the setup function all my imports work. Which is enough for the moment.

setup(
    name='data-science-project',
    packages=find_packages('src'),
    package_dir={'': 'src'},
)

Testing

There have been books written about testing. I won’t promise I can add a whole lot of value here, I just encourage you not to avoid it, but rather start early writing tests so your confidence in the code grows and most certainly you would not regret it in the future.

The only thing I would like to highlight is that we can take full advantage of our Conda environment to have the same isolated environment for our tests. The example repository contains a .travis.yml that does what we have already done to get the “exact” same environment. Whenever you open a Pull Request to the repository Travis will run the tests for you.

install:
  # Install Miniconda
  - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
  - bash miniconda.sh -b -p $HOME/miniconda
  - export PATH="$HOME/miniconda/bin:$PATH"
  # Create the environment
  - conda env create -f environment.yml
  - source activate data-science-project

Deployment

As the deployment process is different in every project and heavily depends on the deployment targets (e.g. classic server, container or cloud functions), I’ll leave it for another post and not go into great details.

The most simplistic example for classic server deployments I came up with, boils down to this:

Closing Thoughts

I used this setup for a two week Python programming course, to kick off a three month Data Science Bootcamp for Neue Fische in Hamburg.

With this setup the students could develop in a company-like environment. Utilising GitHub and Travis CI they where able to open Pull Requests, do Reviews and wait for CI to become green.

The setup already proved itself useful, but still lacks some more sophisticated requirements of software development projects — like configuration and deployment environments, or the challenges of cloud deployments (e.g. classic server, container or cloud functions). Nevertheless, it’s a solid base to improve upon and because of its minimalistic nature, also easy to extend.

Author: Manuel Wiedenmann