Advanced xarray#
Dimension vs non-dimension coordinates#
Map projections and dimension / non-dimension coordinates#
Specific for atmospheric science: be careful about your map projections! You have to make a choice: is it useful for you to work with lat/lon? Or eastings/northings (constant grid spacing in kilometers)?
Because of the map projections, it could happen that once you select a box of certain lat/lon extent, you will end up with some NaN padding on the edges of your domain if your dimensions coordinates are specified as x/y. Note that you cant select a “box” if (lon, lat) are 2D coordinates. In this case you might use xarray where with drop=True
For dealing with projections, check out cartopy
(plotting) and xESMF
(regridder for geospatial data, that can also deal with (x,y)-non-dimensional coordinates): https://xesmf.readthedocs.io
netcdf to memory#
old-school way to check out the content of netcdf files without loading them:
ncdump
(ncdump -h filename.nc
, whereh
stands for human-readable)CF conventions: conventions of metadata (the form they need to have)
Reading is a costly operation! Thus, when we first open a netcdf file,
xarray
parses only the necessary information (coords, dims), but the actual data is not loaded - it only says how many values the dataset contains (“lazy loading/evaluation” - this happen only at the moment when the values are needed)
writing to netcdf and encoding#
Compression:
Files can be kept smaller (by a factor of 2) by using the “short” format for integers, i.e. 16 bits (16-bit floats are referred to as “half-precision”) instead of float 32-bit format, therefore 2 more attributes (a scale-factor and an offset are stored, which are used to ‘unpack’ the dataset whilst loading). This is lossy compression - precision is lost because of how the data is stored. The scale factor and offset are chosen so that this loss is minimal - still spanning the full range of data.
lossless compression:
zip
/gzip
- these use redundancy in the stored data. Netcfd is a “naive” data format - it doesn’t take into account redundancy - each value is stored individually, no matter how many times the data is repeated, whether all values are NaNs or unique floats. On the other hand, lossless compression uses these non-unique values the more repeated values there are, the more compression we can get.Advantages: less space on the disk
Disdvantages: we can’t read just parts of the data (or do it way slower)
lossy + lossless compression can be combined.
Dask#
Big buzzword in the big-data community! (Big data = your memory does not fit the data you need to load).
By default, xarray
does not use dask
. Among other things which can make our lives easier, in the background dask
uses multiprocessing.
Dask allows us to do operation on data which does not fit in the memory by:
delaying evaluation to the latest possible moment
chunking the data (e.g.
ds = ds.chunk({'time':12})
divides the dataset into 12 chunks - individual slices - along the time dimension)
When we request an operation on a dask array, it only knows what are all the steps necessary for the calculation (e.g. first add 2, then multiply by 3…). Once we e.g. plot the data, only then all the operations are evaluated.
To check out the pipeline, we can easily visualize the graph:
chunked_ds.data.visualize('image.svg')
. This is useful for understanding what’s going on - what are all the steps of our computation.if we use
xp.open_mfdataset()
(i.e. open multi-file dataset), dask is use automatically and the data is chunked by files.
We will talk about the homework, and this PR and this post
The place to be for big data analytics: https://pangeo.io
The vision for the future of computing with large datasets: The idea of Pangeo is to not download large amounts of data, but to store the data only at one place (cloud) and providing access to people who wants to use it. Copernicus data: CDS toolbox
Extending xarray#
We will then talk about xarray accessors: https://docs.xarray.dev/en/stable/internals/extending-xarray.html
And this will be the introduction to decorators in python: https://realpython.com/primer-on-python-decorators