Structure of a python package#
Python Zen: “Namespaces are one honking great idea - let’s do more of those!”
We introduced the concept of python modules in a previous unit. Today we are going into more details and will introduce Python “packages”, which contain more than one module and are the structure used by all larger Python libraries.
Revision: modules, namespaces and scopes#
Python modules simply are *.py
files. They can contain executable statements as well as function definitions, in any order.
For example, a module called mymodule.py
could look like:
import numpy as np
pi = 3.14
print('Module top level 1')
def circle_area(radius):
"""A cool function"""
print('In function')
return pi * radius**2
print('Module top level 2')
if __name__ == '__main__':
print('In main script')
print('Area of circle: {}'.format(circle_area(10)))
Can you predict what will be printed on screen if (i) you run import mymodule
in a python interpreter, or (ii) run python mymodule.py
from the command line? If not, try it yourself!
The example should be self-explanatory, and we will discuss it in class. If you have any question at this point don’t hesitate to ask me! See this tutorial from RealPython for more details.
Continue to read only after the mechanisms happening in the example above are fully understood.
…
OK? Let’s go on then.
More on scopes#
Here is a more intricate example for mymodule.py
:
def print_n():
print('The number N in the function is: {}'.format(N))
N = 10
print_n()
Will import mymodule
run or will it fail with an error? Think about it for a little while, and if you don’t know for sure, try it out!
…
(spoiler protection)
…
Why is this example working, even if N
is defined below the function definition? Here again, the order at which things are happening when the module is imported is the important trick:
the function
print_n
is detected and interpreted, but not executed (nobody called it)the variable
N
is assigned the value10
. It is now available at the module scope, i.e. an external module import would allow the commandfrom mymodule import N
to run without errorThe function
print_n
is called. In this function, the interpreter seeks for a local scope variable (i.e. at the function level) calledN
. It doesn’t find it, and therefore looks at the module level. Nice! A variable calledN
is found at the module level and printed.
Note that this will not always work. See the following example which builds upon the previous one:
def print_n():
print('The number N in the function is: {}'.format(N))
N += 1
N = 10
print_n()
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Cell In[1], line 6
3 N += 1
5 N = 10
----> 6 print_n()
Cell In[1], line 2, in print_n()
1 def print_n():
----> 2 print('The number N in the function is: {}'.format(N))
3 N += 1
UnboundLocalError: cannot access local variable 'N' where it is not associated with a value
So how is this example different to the one above, which worked fine as explained? We just added a line below the one that used to work before. So now there is a variable N
in the function, and it overrides the module-level one. The python interpreter detects that variable and raises an error at execution, independent to whether or not there is a global variable N
available.
Even if it might work to define module level variables anywhere, the recommended order of code in a module is:
import statements
module level variable definitions
functions
if necessary (rare): one or more function calls (e.g. to initialize some global values),
__main__
But… If module functions can read variables at the module level, can they also change them? Here is another example messing around with global and local variables:
x = 2
y = 1
def func(x):
x = x + y
return x
print(func(3))
print(func(x))
print(x)
print(y)
4
3
2
1
What can we learn from this example? That the local (function) scope variable x
has nothing to do with the global scope variable x
. For the python interpreter, both are unrelated and their name is irrelevant. What is relevant though is which scope they are attached to (if you are interested to know which variables are currently available in your scope, check the built-in functions dir(), globals() and locals()).
The global
statement#
In very special cases, it might be useful for a function to change the value of a global variable. Examples include package level parameter sets such as options and model parameters which will change the behavior of the package after being set. For example, see numpy’s set_printoptions function:
import numpy as np
a = np.array([1.123456789])
print(a)
np.set_printoptions(precision=4)
print(a)
[1.12345679]
[1.1235]
We changed the value of a variable at the module level (we don’t know its name but it isn’t relevant here) which is now taken into account by the numpy print function.
Let’s say we’d like to have a counter of the number of times a function has been called. We can do this with the following syntax:
count = 0
def func():
global count # without this, count would be local!
count += 1
func()
func()
print(count)
2
Note that in practice, global variables that need updating are rarely single integers or floats like in this example. The reasons for this will be explained later on, once you’ve learned more about python packages and the import system.
Are global variables truly “global” in python?#
If by this question we mean “global” as available everywhere in any python script or interpreter, the short answer is no, there is no such thing as a “global” variable in python. The term “global variable” in python always refers to the module level, while “local variables” refer to the embedded scope (often, a function).
If you want to have access to a module’s top-level variable (or function), you have to import it. This system ensures very clear and unpolluted namespaces, where everything can be traced to its source:
import numpy
import math
import scipy
print(math.pi, 'from the math module')
print(numpy.pi, 'from the numpy package')
print(scipy.pi, 'from the scipy package')
3.141592653589793 from the math module
3.141592653589793 from the numpy package
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/.mambaforge/envs/oggm_env/lib/python3.12/site-packages/scipy/__init__.py:137, in __getattr__(name)
136 try:
--> 137 return globals()[name]
138 except KeyError:
KeyError: 'pi'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
Cell In[5], line 7
5 print(math.pi, 'from the math module')
6 print(numpy.pi, 'from the numpy package')
----> 7 print(scipy.pi, 'from the scipy package')
File ~/.mambaforge/envs/oggm_env/lib/python3.12/site-packages/scipy/__init__.py:139, in __getattr__(name)
137 return globals()[name]
138 except KeyError:
--> 139 raise AttributeError(
140 f"Module 'scipy' has no attribute '{name}'"
141 )
AttributeError: Module 'scipy' has no attribute 'pi'
The only exception to the import rule are built-in functions, which are available everywhere and have their own scope. If you want to know more about the four different python scopes, read this blog post by Sebastian Raschka: A Beginner’s Guide to Python’s Namespaces, Scope Resolution, and the LEGB Rule.
Packages#
From the documentation:
Packages are a way of structuring Python’s module namespace by using “dotted module names”. For example, the module name A.B
designates a submodule named B
in a package named A
. Just like the use of modules saves the authors of different modules from having to worry about each other’s global variable names, the use of dotted module names saves the authors of multi-module packages like NumPy or Xarray from having to worry about each other’s module names.
Packages can also be used to organize bigger projects into thematic groups. SciPy for example has more than 12 subpackages, some of them being organized in sub-subpackages.
Now read the few paragraphs from the python documentation on packages (estimated reading time: 5 min).
You are maybe asking yourself: what can I use packages for? Well, for one, it is necessary to understand their basic structure in order to be able to read someone else’s code, for example in the xarray
library. Second, because I think that you have everything to win by organizing your personal code (analysis routines, plot functions…) into a single package that you can import from anywhere (e.g. in a Jupyter Notebook or from a different working directory). Therefore, the ClimVis project will bring you to write a small package.
The structure of a package#
I’ve written a simple package template called “scispack” to help you getting started with your future packages. You will find the code on github and a link to download it (green button in the top right). It is released in the public domain, feel free to use it for your projects. You will have more time to get familiar with it during the assignments (the climvis
package is based on it). Here, I’ll simply repeat its basic structure:
Directory root (./
)
.gitignore
: for git users onlyLICENSE.txt
: always license your codeREADME.md
: this pagesetup.py
: this is what makes your package installable bypip
. It contains a set of simple instructions regarding e.g. the name of the package, it’s version number, or where to find command line scripts
The actual package (./scispack
)
__init__.py
: tells python that the directory is a package and enables the “dotted module names” import syntax. It is often empty, but here we added some entry points to the package’s API and the version string.cfg.py
,utils.py
andcli.py
: the modulescli.py
: entry point for the command line interfacecfg.py
: container module for the package parameters and constants
The tests (./scispack/tests
)
One test file per module. Their name has to start with test_
in order to be
detected by pytest.
Installing a local package#
By starting your python interpreter from the root directory of a package (for example the template package) you will have access to the familiar syntax (e.g. from scispack.utils import area_of_circle
). But if you start an interpreter from anywhere else the package won’t be available.
Remember what we wrote about the python sys.path a couple of weeks ago? In order to “install” the package we have two options:
we add the path to the package directory to
sys.path
we copy the package into a folder already listed in
sys.path
From the two solutions, number 2 is by far the easiest and most sustainable. In fact, this is what happens when you do pip install packagename
or conda install packagename
. The two commands are very similar in that they are looking for the package in an online repository, download it and copy it in the current environment’s sys.path
. If you want to know where an installed package is located, you can do:
import numpy
numpy.__file__
'/Users/uu23343/.mambaforge/envs/oggm_env/lib/python3.12/site-packages/numpy/__init__.py'
The same installation options are offered to us for our self-made package. The simplest is to navigate to our package’s root directory and run:
$ pip install -e .
The pip install .
command will look for a setup.py
file in the current folder (this is why the dot .
is used) and if found, use it to determine the package’s name and other installation options. The -e
optional argument installs the package in “editable” or “development” mode. In simple terms, this option will create a symbolic link to the package directory instead of copying the files. Therefore, any changes to the code will always be available the next time you open a new python interpreter.
Tip
pip install -e .
is the recommended way to install any local package, in pip or in conda environments. At the university (or on computers where you don’t have the super-user permissions), use pip install --user -e .
Advanced applications: packaging and sharing code#
The simple template you are going to start with in ClimVis
is not fundamentally different from larger packages like xarray or numpy. There are a couple of big differences though, and I’ll list some of them here:
rioxarray (as a pure python package) has a
setup.py
much like yours. It will have some more options related to the version control of the package and will have a separate folder for HTML documentation. The rest of the root directory files are related to testing and continuous integration.numpy (as a mix of python and C code) will be quite more complex to deploy. Installing a development version of numpy will imply some compilation of C code, which is quite easy to do on linux machines but needs quite some time.
Tools like pip
or conda
hide all these things from the users, fortunately. They ship pre-compiled binaries and take care of most of the details in the background. This hasn’t always been that easy though, and a recent post by xkcd reminds us that installing python packages can still be a mess sometimes:
Take home points#
python has very clear rules regarding the scope of variables, leading to clearly defined namespaces
there are no “truely global” variables in python, only namespaces
a package is a way to organize several modules under the same namespace. It allows to construct nested modules, like
from A.B import C
make your package installable simply requires to comply to a couple of simple rules, including defining a
setup.py
installation file at the root directory folderI recommend to install local packages with
pip install -e .