Be Aware Of These Python Project Management Conventions

Discussing packaging, modules, imports, dependency and virutal envs, linting, formatting, and versioning of Python projects.

Apr 09, 2023

Proper structuring of Python projects can often be a bit hard to grasp, especially for new Python developers. It can even be perplexing for people who’ve been in the arena for a while.

This article is all about Python packaging, dependency management, and how to navigate these concepts in building bigger and more complex projects involving data pipelines.

As we dive in, let’s first discuss…

What are Python packages?

A Python package is a collection of files and directories that includes the code, documentation, and other necessary files that build up our project. In order to facilitate sharing and reusing code within a project, we use packages.

When we want to reuse complex code, we choose to use Python packages instead of script files or Jupyter notebooks. Script files can become cluttered and difficult to maintain, while notebooks are typically used for exploratory work and are not easily reusable.

Python projects may consist of multiple modules, each containing a specific set of related functions and variables which are built with a goal of reusability. These reusable modules can be embedded into your own code with specific kinds of import statements, which we’ll get into later.

The Standard Package Manager

Pip is the standard package manager that is built-in into Python and can install packages from many different sources.

The pip automates package management by first resolving all dependencies and then proceeding to install the request packages in either:

1. The default global virtual environment that’s on your system path, or,

2. In a virtual environment that you’ve specifically defined.

If you’ve used pip install before then you’ve installed a package through the Python Package Index (PyPI).

It’s best to avoid installing packages into global environments because if you do that, you’re making that package accessible globally to all projects and there are high odds of dependency conflicts in the future.

The difference between modules and packages

All Python code written within a single .py script/file is a module. Individual modules can be cobbled together like building blocks to create a larger project.

As your codebase expands, it can become challenging to effectively manage and maintain all the code contained within a single module.

Packages offer a solution to this problem by allowing you to organize and divide your code into multiple modules, all while maintaining a sense of organization and reusability.

If you have, for instance, a src/ directory for your project, a package can simply be created by creating a new directory within it.

So, this can be your new project structure:

- src/
  - main.py
  - mypackage/
      __init__.py
      mymodule.py

The __init__.py is a special file that tells the Python interpreter that mypackage is a Python package. It serves as an entry point to the package and is the first to be executed upon an import of the package.

Now, if your mymodule.py has a function such as myfunction(), and you want to to import it into, for example, main.py, you can do:

import mypackage.mymodule as mod

# do something with the function
mod.myfunction()

The special trick associated with __init__.py is that you can further simplify the import statement above by including this line of code in your __init__.py:

from .mymodule import myfunction

Now in your main.py import, you can do:

from mypackage import myfunction

myfunction()

Highly convenient, isn’t it? It’ crucial to understand how packages and modules can be imported, and I hope this makes it clearer.

How to use virtual environments?

I’ve been a big proponent of Pipenv but recently, I’ve started recommending Poetry for nearly all use cases now because not only does it help you avoid a ton of headaches associated with dependency management but it also streamlines the packaging and/or deployment of a project further down the line.

Using Poetry is like playing Python dependency management on easy mode.

One of the core files associated with handling dependencies is the pyproject.toml file. It’s a configuration file that is used by pip to install your package and its dependencies and is an alternative to the older setup.py file. Its format is much simpler than setup.pythus making it easier to read and maintain.

The good news is that Poetry automatically provides a simple and concise syntax for specifying dependencies and version constraints in your pyproject.toml file.

A poetry initiated project may look like this:

my-poetry-project
├── my_poetry_project
    └── module1/
    ├── __init__.py
    └── mymodule.py
├── pyproject.toml
├── README.rst
└── tests
    ├── __init__.py
    └── test_my_poetry_project.py

When using Node.js, a package-lock.json file is created to lock the dependency versions and similarly, a poetry.lock file is created by Poetry to lock dependency versions for packages installed from Pip.

Storing the version of every installed package helps Poetry resolve dependency conflicts while installing and updating packages and ensures consistency and repeatability of the process.

Your virtual environments are easily created and used with dedicated commands, and it’s insanely helpful to know that the versions are locked unless updated manually.

You can easily use a Makefile to execute commands for testing, updating, and installing regular and dev dependencies as needed.

Collaborating on Projects

Structure

Version control plays a heavy role in ensuring a team is using the correct version of code and the accurate dependencies when working within the project.

Using Git makes your team track changes, collaborate on specific versions of code, work independently on those versions and merge them without creating conflicts with other members of the team. The complete history of the codebase that includes changes made by every developer who’s ever made a commit on the project is very useful for understanding the evolution of code.

Typically, the project file also contains a README file that provides a brief introduction to your project.

More comprehensiveness can be achieved through an explicitly maintained documentation, using a documentation builder tool like Sphinx.

Testing in Python

Engaging in code testing provides numerous advantages. It boosts everyone’s confidence in the code’s expected behaviour and ensures that modifications to the code will not result in any regressions.

Pytest is a popular testing library in Python that guides you towards manifest dependency declarations that are still reusable through fixtures. Pytest fixtures are functions that can provide data, test doubles, or establish a system state for the test suite. Any test that requires a fixture must explicitly include this fixture function as an argument to the test function.

Linting and formatting in Python

Linting is a procedure that identifies bugs and style-related errors in your code. This process is performed by analysis tools called ‘linters’, which are widely accessible for all major programming languages.

Linters can highlight issues and violations of coding style rules in your code, similar to how a spell checker works for regular text.

Moreover, there are various ‘auto-formatters’ available alongside linters, which can carry out these checks for you and even make necessary changes automatically. Let’s take a look at two of these popular tools.

Pylint: this library helps you look for errors and enforces a coding standard that is close to PEP8.
Flake8: this library checks your source code for errors and violations of some of the PEP8 style conventions.

They are both very similar in basic functions and are even available as extensions within a code editor like VSCode.

An autoformatter like Black is incredible for helping you format your code and help it be consistent within the entire ecosystem of your team or organization. Another great tool is isort which helps automatically sort import statements as it’s needed.

Versioning of your project

Like with linting, having a good versioning system helps speed up the development process, keep things organized, and avoid future headaches for your project.

Version numbers are important because:

It helps identify one version from another and communicate changes implicitly
Managing packaging dependencies become important when you have one version of project is development, another in staging, and yet another in production. Versioning is incredibly important in that aspect.

One of the best versioning paradigms is the SemVer scheme which goes as follows:

Major release: like a 3.0.0, stating backwards incompatibility with previous versions like 2.9.0 or 2.0.0.
Minor release: 3.1.0, stating addition of features without breaking compatibility with previous versions (3.1.0 and 3.0.0, for example)
Patch release: 3.1.1, stating a minor bug fix, security patch that are also backwards compatible with all releases within the same major release(i.e., 3.0.0. onwards).

Thank you for reading! Here are some resources for further information on some of the topics we’ve discussed earlier:

What did you think of this issue? Let me know here in a comment or on my Twitter.