Structure of a python package#

Python Zen: “Namespaces are one honking great idea - let’s do more of those!”

We introduced the concept of python modules in a previous unit. Today we are going into more details and will introduce Python “packages”, which contain more than one module and are the structure used by all larger Python libraries.

Revision: modules, namespaces and scopes#

Python modules simply are *.py files. They can contain executable statements as well as function definitions, in any order.

For example, a module called mymodule.py could look like:

import numpy as np

pi = 3.14
print('Module top level 1')

def circle_area(radius):
    """A cool function"""
    print('In function')
    return pi * radius**2

print('Module top level 2')

if __name__ == '__main__':
    
    print('In main script')
    print('Area of circle: {}'.format(circle_area(10)))

Exercise 23

Can you predict what will be printed on screen if (i) you run import mymodule in a python interpreter, or (ii) run python mymodule.py from the command line? If not, try it yourself!

The example should be self-explanatory, and we will discuss it in class. If you have any question at this point don’t hesitate to ask me! See this tutorial from RealPython for more details.

Continue to read only after the mechanisms happening in the example above are fully understood.

OK? Let’s go on then.

More on scopes#

Here is a more intricate example for mymodule.py:

def print_n():
    print('The number N in the function is: {}'.format(N))

N = 10
print_n()

Exercise 24

Will import mymodule run or will it fail with an error? Think about it for a little while, and if you don’t know for sure, try it out!

(spoiler protection)

Why is this example working, even if N is defined below the function definition? Here again, the order at which things are happening when the module is imported is the important trick:

  1. the function print_n is detected and interpreted, but not executed (nobody called it)

  2. the variable N is assigned the value 10. It is now available at the module scope, i.e. an external module import would allow the command from mymodule import N to run without error

  3. The function print_n is called. In this function, the interpreter seeks for a local scope variable (i.e. at the function level) called N. It doesn’t find it, and therefore looks at the module level. Nice! A variable called N is found at the module level and printed.

Note that this will not always work. See the following example which builds upon the previous one:

def print_n():
    print('The number N in the function is: {}'.format(N))
    N += 1

N = 10
print_n()
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
/tmp/ipykernel_16897/2404790389.py in <cell line: 6>()
      4 
      5 N = 10
----> 6 print_n()

/tmp/ipykernel_16897/2404790389.py in print_n()
      1 def print_n():
----> 2     print('The number N in the function is: {}'.format(N))
      3     N += 1
      4 
      5 N = 10

UnboundLocalError: local variable 'N' referenced before assignment

So how is this example different to the one above, which worked fine as explained? We just added a line below the one that used to work before. So now there is a variable N in the function, and it overrides the module-level one. The python interpreter detects that variable and raises an error at execution, independent to whether or not there is a global variable N available.

Even if it might work to define module level variables anywhere, the recommended order of code in a module is:

  1. import statements

  2. module level variable definitions

  3. functions

  4. if necessary (rare): one or more function calls (e.g. to initialize some global values), __main__

But… If module functions can read variables at the module level, can they also change them? Here is another example messing around with global and local variables:

x = 2
y = 1

def func(x):
    x = x + y
    return x

print(func(3))
print(func(x))
print(x)
print(y)
4
3
2
1

What can we learn from this example? That the local (function) scope variable x has nothing to do with the global scope variable x. For the python interpreter, both are unrelated and their name is irrelevant. What is relevant though is which scope they are attached to (if you are interested to know which variables are currently available in your scope, check the built-in functions dir(), globals() and locals()).

The global statement#

In very special cases, it might be useful for a function to change the value of a global variable. Examples include package level parameter sets such as options and model parameters which will change the behavior of the package after being set. For example, see numpy’s set_printoptions function:

import numpy as np
a = np.array([1.123456789])
print(a)
np.set_printoptions(precision=4)
print(a)
[1.12345679]
[1.1235]

We changed the value of a variable at the module level (we don’t know its name but it isn’t relevant here) which is now taken into account by the numpy print function.

Let’s say we’d like to have a counter of the number of times a function has been called. We can do this with the following syntax:

count = 0

def func():
    global count  # without this, count would be local!
    count += 1

func()
func()
print(count)
2

Note that in practice, global variables that need updating are rarely single integers or floats like in this example. The reasons for this will be explained later on, once you’ve learned more about python packages and the import system.

Are global variables truly “global” in python?#

If by this question we mean “global” as available everywhere in any python script or interpreter, the short answer is no, there is no such thing as a “global” variable in python. The term “global variable” in python always refers to the module level, while “local variables” refer to the embedded scope (often, a function).

If you want to have access to a module’s top-level variable (or function), you have to import it. This system ensures very clear and unpolluted namespaces, where everything can be traced to its source:

import numpy
import math
import scipy

print(math.pi, 'from the math module')
print(numpy.pi, 'from the numpy package')
print(scipy.pi, 'from the scipy package')
3.141592653589793 from the math module
3.141592653589793 from the numpy package
3.141592653589793 from the scipy package

The only exception to the import rule are built-in functions, which are available everywhere and have their own scope. If you want to know more about the four different python scopes, read this blog post by Sebastian Raschka: A Beginner’s Guide to Python’s Namespaces, Scope Resolution, and the LEGB Rule.

Packages#

From the documentation:

Packages are a way of structuring Python’s module namespace by using “dotted module names”. For example, the module name A.B designates a submodule named B in a package named A. Just like the use of modules saves the authors of different modules from having to worry about each other’s global variable names, the use of dotted module names saves the authors of multi-module packages like NumPy or Xarray from having to worry about each other’s module names.

Packages can also be used to organize bigger projects into thematic groups. SciPy for example has more than 12 subpackages, some of them being organized in sub-subpackages.

Now read the few paragraphs from the python documentation on packages (estimated reading time: 5 min).

You are maybe asking yourself: what can I use packages for? Well, for one, it is necessary to understand their basic structure in order to be able to read someone else’s code, for example in the xarray library. Second, because I think that you have everything to win by organizing your personal code (analysis routines, plot functions…) into a single package that you can import from anywhere (e.g. in a Jupyter Notebook or from a different working directory). Therefore, the ClimVis project will bring you to write a small package.

The structure of a package#

I’ve written a simple package template called “scispack” to help you getting started with your future packages. You will find the code on github and a link to download it (green button in the top right). It is released in the public domain, feel free to use it for your projects. You will have more time to get familiar with it during the assignments (the climvis package is based on it). Here, I’ll simply repeat its basic structure:

Directory root (./)

  • .gitignore: for git users only

  • LICENSE.txt: always license your code

  • README.md: this page

  • setup.py: this is what makes your package installable by pip. It contains a set of simple instructions regarding e.g. the name of the package, it’s version number, or where to find command line scripts

The actual package (./scispack)

  • __init__.py: tells python that the directory is a package and enables the “dotted module names” import syntax. It is often empty, but here we added some entry points to the package’s API and the version string.

  • cfg.py, utils.py and cli.py: the modules

  • cli.py: entry point for the command line interface

  • cfg.py: container module for the package parameters and constants

The tests (./scispack/tests)

One test file per module. Their name has to start with test_ in order to be detected by pytest.

Installing a local package#

By starting your python interpreter from the root directory of a package (for example the template package) you will have access to the familiar syntax (e.g. from scispack.utils import area_of_circle). But if you start an interpreter from anywhere else the package won’t be available.

Remember what we wrote about the python sys.path a couple of weeks ago? In order to “install” the package we have two options:

  1. we add the path to the package directory to sys.path

  2. we copy the package into a folder already listed in sys.path

From the two solutions, number 2 is by far the easiest and most sustainable. In fact, this is what happens when you do pip install packagename or conda install packagename. The two commands are very similar in that they are looking for the package in an online repository, download it and copy it in the current environment’s sys.path. If you want to know where an installed package is located, you can do:

import numpy
numpy.__file__
'/home/mowglie/.miniconda3/envs/py3/lib/python3.10/site-packages/numpy/__init__.py'

The same installation options are offered to us for our self-made package. The simplest is to navigate to our package’s root directory and run:

$ pip install -e .

The pip install . command will look for a setup.py file in the current folder (this is why the dot . is used) and if found, use it to determine the package’s name and other installation options. The -e optional argument installs the package in “editable” or “development” mode. In simple terms, this option will create a symbolic link to the package directory instead of copying the files. Therefore, any changes to the code will always be available the next time you open a new python interpreter.

Tip

pip install -e . is the recommended way to install any local package, in pip or in conda environments. At the university (or on computers where you don’t have the super-user permissions), use pip install --user -e .

Advanced applications: packaging and sharing code#

The simple template you are going to start with in ClimVis is not fundamentally different from larger packages like xarray or numpy. There are a couple of big differences though, and I’ll list some of them here:

  • rioxarray (as a pure python package) has a setup.py much like yours. It will have some more options related to the version control of the package and will have a separate folder for HTML documentation. The rest of the root directory files are related to testing and continuous integration.

  • numpy (as a mix of python and C code) will be quite more complex to deploy. Installing a development version of numpy will imply some compilation of C code, which is quite easy to do on linux machines but needs quite some time.

Tools like pip or conda hide all these things from the users, fortunately. They ship pre-compiled binaries and take care of most of the details in the background. This hasn’t always been that easy though, and a recent post by xkcd reminds us that installing python packages can still be a mess sometimes:

img

Take home points#

  • python has very clear rules regarding the scope of variables, leading to clearly defined namespaces

  • there are no “truely global” variables in python, only namespaces

  • a package is a way to organize several modules under the same namespace. It allows to construct nested modules, like from A.B import C

  • make your package installable simply requires to comply to a couple of simple rules, including defining a setup.py installation file at the root directory folder

  • I recommend to install local packages with pip install -e .