Advanced xarray#

Dimension vs non-dimension coordinates#

link

Map projections and dimension / non-dimension coordinates#

Specific for atmospheric science: be careful about your map projections! You have to make a choice: is it useful for you to work with lat/lon? Or eastings/northings (constant grid spacing in kilometers)? Because of the map projections, it could happen that once you select a box of certain lat/lon extent, you will end up with some NaN padding on the edges of your domain if your dimensions coordinates are specified as x/y. Note that you cant select a “box” if (lon, lat) are 2D coordinates. In this case you might use xarray where with drop=True

For dealing with projections, check out cartopy (plotting) and xESMF (regridder for geospatial data, that can also deal with (x,y)-non-dimensional coordinates): https://xesmf.readthedocs.io

netcdf to memory#

  • old-school way to check out the content of netcdf files without loading them: ncdump (ncdump -h filename.nc, where h stands for human-readable)

  • CF conventions: conventions of metadata (the form they need to have)

  • Reading is a costly operation! Thus, when we first open a netcdf file, xarray parses only the necessary information (coords, dims), but the actual data is not loaded - it only says how many values the dataset contains (“lazy loading/evaluation” - this happen only at the moment when the values are needed)

writing to netcdf and encoding#

Compression:

  • Files can be kept smaller (by a factor of 2) by using the “short” format for integers, i.e. 16 bits (16-bit floats are referred to as “half-precision”) instead of float 32-bit format, therefore 2 more attributes (a scale-factor and an offset are stored, which are used to ‘unpack’ the dataset whilst loading). This is lossy compression - precision is lost because of how the data is stored. The scale factor and offset are chosen so that this loss is minimal - still spanning the full range of data.

  • lossless compression: zip/gzip - these use redundancy in the stored data. Netcfd is a “naive” data format - it doesn’t take into account redundancy - each value is stored individually, no matter how many times the data is repeated, whether all values are NaNs or unique floats. On the other hand, lossless compression uses these non-unique values the more repeated values there are, the more compression we can get.

    • Advantages: less space on the disk

    • Disdvantages: we can’t read just parts of the data (or do it way slower)

  • lossy + lossless compression can be combined.

Xarray docs on the subject

Dask#

Big buzzword in the big-data community! (Big data = your memory does not fit the data you need to load).

By default, xarray does not use dask. Among other things which can make our lives easier, in the background dask uses multiprocessing.

Dask allows us to do operation on data which does not fit in the memory by:

  • delaying evaluation to the latest possible moment

  • chunking the data (e.g. ds = ds.chunk({'time':12}) divides the dataset into 12 chunks - individual slices - along the time dimension)

When we request an operation on a dask array, it only knows what are all the steps necessary for the calculation (e.g. first add 2, then multiply by 3…). Once we e.g. plot the data, only then all the operations are evaluated.

  • To check out the pipeline, we can easily visualize the graph: chunked_ds.data.visualize('image.svg'). This is useful for understanding what’s going on - what are all the steps of our computation.

  • if we use xp.open_mfdataset() (i.e. open multi-file dataset), dask is use automatically and the data is chunked by files.

We will talk about the homework, and this PR and this post

The place to be for big data analytics: https://pangeo.io

The vision for the future of computing with large datasets: The idea of Pangeo is to not download large amounts of data, but to store the data only at one place (cloud) and providing access to people who wants to use it. Copernicus data: CDS toolbox

Extending xarray#

We will then talk about xarray accessors: https://docs.xarray.dev/en/stable/internals/extending-xarray.html

And this will be the introduction to decorators in python: https://realpython.com/primer-on-python-decorators