Object Oriented Programming: why?

In the first OOP unit you learned the basic semantics of OOP in python. This week we will attempt to provide concrete examples of the use of objects in python (and other OOP languages) and provide arguments in favor and against the use of OOP in python.

Introduction

OOP is a tool that, when used wisely, can help you to structure your programs in a way which might be more readable, easier to maintain and more flexible than purely procedural programs. But why is that so? In this class we will provide examples for five core concepts of OOP:

  • abstraction
  • encapsulation
  • modularity
  • polymorphism
  • inheritance

Abstraction

Data abstraction refers to the separation between the abstract properties of an object and its internal representation. By giving a name to things and hiding unnecessary details from the user, objects provide an intuitive interface to concepts which might be extremely complex internally.

Going back to our examples from last week: we used the term "objects" in programming as surrogate for actual "objects" in the real world: a cat, a pen, a car... These objects have a "state" (in OOP: attributes) and realize actions (in OOP: methods). For a pen, the attributes could be: the color of the ink, the volume of remaining ink, the size of the point, etc. The actions (methods) could be: write(), fill_ink(), etc.

Leaving the real world and entering the programming world: if a concept in your program is easily describable in terms of "state" and "actions", it might be a good candidate for writing a class.

Taking a widely used object in Python: an instance of the class string:

In [1]:
a = 'hello!'

The "state" of our object is relatively simple to describe: it is the sentence (list of characters) stored in the object. We have access to this state (we can print its values) but the way these values are stored in memory is abstracted away. We don't care about the details, we just want a string. Now, a string provides many actions:

In [2]:
a.capitalize()
Out[2]:
'Hello!'
In [3]:
a.capitalize().istitle()
Out[3]:
True
In [4]:
a.split('l')
Out[4]:
['he', '', 'o!']

Abstractions should be as simple and well defined as possible. Sometimes there is more than one possible way to provide an abstraction to the user, and it becomes a debate among the programmers of a project whether these abstractions are useful or not.

Well defined abstractions can be composed together. A good example is provided by the xarray library: an xarray.DataSet is composed of several xarray.DataArray objects. These xarray.DataArray objects have the function to store data (a numpy.ndarray object) together with coordinates (other numpy.ndarray objects) and attributes (units, name, etc.). This chain of abstractions is possible only if each of these concepts has a clearly defined role: xarray does not mess around with numbers in arrays: numpy does. Inversely, numpy does not care whether an array has coordinates or not: xarray does.

Encapsulation

Encapsulation is tied to the concepts of abstraction. By hiding the internal implementation of a class behind a defined interface, users of the class do not need to know details about the internals of the class to use it. The implementation of the class can be changed (or internal data can be modified) without having to change the code of the users of the class.

In python, encapsulation is more difficult than in other languages like Java. Indeed, Java implements the concept of private methods and attributes, which are hidden from the user per definition. In python, nothing is hidden from the user: however, developers make use of conventions to inform the users that a method or attribute is not meant to be used by the class alone, not by users. Let's take an xarray DataArray as an example:

In [5]:
import xarray as xr
import numpy as np
da = xr.DataArray([1, 2, 3])
In [6]:
np.array(dir(da))  # we use np.array to prevent lengthy print
Out[6]:
array(['T', '_DataArray__default', '_DataArray__default_name',
       '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__array__',
       '__array_priority__', '__array_ufunc__', '__array_wrap__',
       '__bool__', '__class__', '__complex__', '__contains__', '__copy__',
       '__dask_graph__', '__dask_keys__', '__dask_optimize__',
       '__dask_postcompute__', '__dask_postpersist__',
       '__dask_scheduler__', '__deepcopy__', '__delattr__', '__delitem__',
       '__dict__', '__dir__', '__div__', '__doc__', '__enter__', '__eq__',
       '__exit__', '__float__', '__floordiv__', '__format__', '__ge__',
       '__getattr__', '__getattribute__', '__getitem__', '__gt__',
       '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__',
       '__imul__', '__init__', '__int__', '__invert__', '__ior__',
       '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__',
       '__le__', '__len__', '__long__', '__lt__', '__mod__', '__module__',
       '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__',
       '__pos__', '__pow__', '__radd__', '__rand__', '__reduce__',
       '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmod__',
       '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__',
       '__rxor__', '__setattr__', '__setitem__', '__sizeof__', '__str__',
       '__sub__', '__subclasshook__', '__truediv__', '__weakref__',
       '__xor__', '_all_compat', '_attr_sources', '_binary_op',
       '_calc_assign_results', '_copy_attrs_from',
       '_cum_extra_args_docstring', '_dask_finalize',
       '_from_temp_dataset', '_get_axis_num', '_getitem_coord',
       '_groupby_cls', '_in_memory', '_initialized', '_inplace_binary_op',
       '_ipython_key_completions_', '_item_key_to_dict', '_item_sources',
       '_iter', '_level_coords', '_reduce_extra_args_docstring',
       '_reduce_method', '_replace', '_replace_indexes',
       '_replace_maybe_drop_dims', '_resample_cls',
       '_resample_immediately', '_result_name', '_rolling_cls',
       '_title_for_slice', '_to_dataset_split', '_to_dataset_whole',
       '_to_temp_dataset', '_unary_op', 'all', 'any', 'argmax', 'argmin',
       'argsort', 'assign_attrs', 'assign_coords', 'astype', 'attrs',
       'bfill', 'broadcast_equals', 'chunk', 'chunks', 'clip', 'close',
       'combine_first', 'compute', 'conj', 'conjugate', 'coords', 'copy',
       'count', 'cumprod', 'cumsum', 'data', 'diff', 'differentiate',
       'dim_0', 'dims', 'dot', 'drop', 'dropna', 'dt', 'dtype',
       'encoding', 'equals', 'expand_dims', 'ffill', 'fillna',
       'from_cdms2', 'from_dict', 'from_iris', 'from_series',
       'get_axis_num', 'get_index', 'groupby', 'groupby_bins',
       'identical', 'imag', 'indexes', 'interp', 'interp_like',
       'interpolate_na', 'isel', 'isel_points', 'isin', 'isnull', 'item',
       'load', 'loc', 'max', 'mean', 'median', 'min', 'name', 'nbytes',
       'ndim', 'notnull', 'persist', 'pipe', 'plot', 'prod', 'quantile',
       'rank', 'real', 'reduce', 'reindex', 'reindex_like', 'rename',
       'reorder_levels', 'resample', 'reset_coords', 'reset_index',
       'roll', 'rolling', 'round', 'searchsorted', 'sel', 'sel_points',
       'set_index', 'shape', 'shift', 'size', 'sizes', 'sortby',
       'squeeze', 'stack', 'std', 'sum', 'swap_dims', 'to_cdms2',
       'to_dataframe', 'to_dataset', 'to_dict', 'to_index', 'to_iris',
       'to_masked_array', 'to_netcdf', 'to_pandas', 'to_series',
       'transpose', 'unstack', 'values', 'var', 'variable', 'where'],
      dtype='<U28')

In this long list of methods and attributes, some of them are available and documented. For example:

In [7]:
da.values
Out[7]:
array([1, 2, 3])

Other methods/attributes start with one underscore. This underscore has no special meaning in the Python language other than a warning to the users: "if you use this method/attribute, do it at your own risk". For example:

In [8]:
da._in_memory
Out[8]:
True

_in_memory is an attribute which is meant for internal use in the class (it is called private). Setting it to another value might have unpredictable consequences, and relying on it for your own code is not recommended: the xarray developers might rename it or change it without notice.

The methods having two trailing and leading underscores have a special meaning in Python and are part of the language specifications. We already encountered __init__ for our class instantiation, and we will talk about the others later in this chapter.

Modularity

Modularity is a technique to separate different portions of the program (modules) based on some logical boundary. Modularity is a general principle in programming, although object-oriented programming typically makes it more explicit. Taking the example of xarray.DataArray and numpy.Array: both classes have very clear domains of functionality. The latter shines at doing fast numerical computations on arrays, the former provides an intuitive abstraction to the internal arrays by giving names and coordinates to its axes.

Polymorphism

Polymorphism represents the technique of creating multiple classes that obey the same interface. Objects from these classes can then be mixed at runtime. In other words, polymorphism originates from the fact that a certain action can have well defined but different meanings depending on the objects they apply to.

An example of polymorphism is provided by the addition in python:

In [9]:
1 + 1
Out[9]:
2
In [10]:
1 + 1.2
Out[10]:
2.2
In [11]:
[1, 2] + [3, 4]
Out[11]:
[1, 2, 3, 4]
In [12]:
np.array([1, 2]) + [3, 4]
Out[12]:
array([4, 6])

Each of these addition operations are performing a different action depending on the object they are applied to. OOP relies on polymorphism to provide higher levels of abstraction. In our Cat and Dog example from last week, both classes provided a say_name() method: the internal implementation, however, was different in each case.

Many OOP languages (including Python) provide powerful tools for the purpose of polymorphism. One of them is operator overloading:

In [13]:
class ArrayList(list):
    def __add__(self, other):
        """Don't do this at home!"""
        return [a + b for a, b in zip(self, other)]

What did we just do? The class definition (class ArrayList(list)) indicates that we created a subclass of the parent class list, a well known data type in python. Our child class has all the attributes and methods of the original parent class:

In [14]:
a = ArrayList([1, 2, 3])
In [15]:
len(a)
Out[15]:
3

Now, we defined a method __add__, which allows us to do some python magic: __add__ is the method which is actually called when two objects are added together. This means that the two statements below are totally equivalent:

In [16]:
[1] + [2]
Out[16]:
[1, 2]
In [17]:
[1].__add__([2])
Out[17]:
[1, 2]

Now, what does that mean for the addition on our ArrayList class? Let's try and find out:

In [18]:
a + [11, 12, 13]
Out[18]:
[12, 14, 16]

We just defined a new way to realize additions on lists! This is a very powerful mechanism, but comes with great responsibility: it should be used with caution and only for very clear use cases. A prominent example is provided by numpy: by implementing the __add__ method on ndarray objects, they provide a new functionality which is hidden from the user but intuitive at the same time, avoiding bad surprises.

Inheritance

TODO: describe examples from OGGM mass-balance model, duck-typing.

Take home points

TODO

What's next?