Lesson: working with netCDF data#

Last week you learned how to use the basic features of the python language with the numpy and matplotlib libraries. The purpose of this lesson is to introduce you to the main tool that you will use in the semester: xarray.

This is a dense lesson. Please do it entirely and try to remember its structure and content. This code will provide a template for your own code, and you can always come back to these examples when you’ll need them. I don’t expect you to understand all details, but I hope that you are going to get acquainted with the “xarray way” of manipulating multi-dimensional data. You will have to copy and adapt parts of the code below to complete the exercises.

Remember that I will never ask you to use tools you didn’t use in a lesson before! If this happens it was a mistake from my side, sorry!

NetCDF Files#

In order to open and plot NetCDF files, you’ll need to install the xarray and netcdf4 packages: if you haven’t done so already, follow the installation instructions!

Imports and options#

First, let’s import the tools we need. Remember why we need to import our tools? If not, ask Fabien!

# Import the tools we are going to need today:
import matplotlib.pyplot as plt  # plotting library
import numpy as np  # numerical library
import xarray as xr  # netCDF library
import cartopy  # Map projections libary
import cartopy.crs as ccrs  # Projections list
# Some defaults:
plt.rcParams['figure.figsize'] = (12, 5)  # Default plot size

Get the data#

The data we are going to use today is from the CERES (Clouds and the Earth’s Radiant Energy System) mission. We are going to use the EBAF-TOA and the EBAF-Surface data products (both freely available on this webpage) as climatologies (i.e. monthly averages 2005-2015).

The data quality summary of these data (PDF) can be found here, and more accessible publications can be found here for TOA and here for Surface.

You can download the files here for TOA, here for Surface.

Read the data#

Most of today’s meteorological data is stored in the NetCDF format (*.nc). NetCDF files are binary files, which means that you can’t just open them in a text editor. You need a special reader for it. Nearly all the programming languages offer an interface to NetCDF. For this course we are going to use the xarray library to read the data:

# Here I downloaded the file in the "data" folder which I placed in a folder close to this notebook
# The variable name "ds" stands for "dataset"
ds = xr.open_dataset(r'../data/CERES_EBAF-TOA_Ed4.1_Clim-2005-2015.nc')

You’ll have to give an absolute or relative path to the file for this to work. For example r'C:\PATH\TO\FILE\CERES_EBAF-TOA_Ed2.8_Avg-2001-2014.nc' on windows.

Windows users: don’t forget to add the r before the path, which allows to use backlashes in the string.

Note: you can also open files without downloading them via an url. This is somehow inefficient (it will download all data in memory each time you run the notebook), but might be useful e.g. on MyBinder where you can’t download things locally. See this instructions if you are interested in doing that.

If you are getting an “HDF Error” related to NetCDF4 when reading the file, install the h5netcdf library (mamba install h5netcdf), restart your kernel, and open the file with:

ds = xr.open_dataset(fpath, engine='h5netcdf')
# See what we have
ds
<xarray.Dataset> Size: 16MB
Dimensions:            (lon: 360, lat: 180, month: 12)
Coordinates:
  * lon                (lon) float32 1kB 0.5 1.5 2.5 3.5 ... 357.5 358.5 359.5
  * lat                (lat) float32 720B -89.5 -88.5 -87.5 ... 87.5 88.5 89.5
  * month              (month) int32 48B 1 2 3 4 5 6 7 8 9 10 11 12
Data variables:
    toa_sw_all_clim    (month, lat, lon) float32 3MB ...
    toa_lw_all_clim    (month, lat, lon) float32 3MB ...
    toa_sw_clr_c_clim  (month, lat, lon) float32 3MB ...
    toa_lw_clr_c_clim  (month, lat, lon) float32 3MB ...
    solar_clim         (month, lat, lon) float32 3MB ...
Attributes:
    title:        CERES EBAF (Energy Balanced and Filled) TOA Fluxes. Monthly...
    institution:  NASA/LaRC (Langley Research Center) Hampton, Va
    Conventions:  CF-1.4
    comment:      Climatology from 07/2005 to 06/2015
    version:      Edition 4.1; Release Date May 28, 2019
    DOI:          10.5067/TERRA-AQUA/CERES/EBAF-TOA_L3B004.1

The NetCDF dataset is constituted of various elements: Dimensions, Coordinates, Variables, Attributes:

  • the dimensions specify the number of elements of each data coordinate, their names should be understandable and specific

  • the attributes provide some information about the file (metadata)

  • the variables contain the actual data. In our file there are five variables. All have the dimensions [month, lat, lon], so we can expect an array of size [12, 180, 360]

  • the coordinates locate the data in space and time

Note: you can disable the “fancy output” and replace it by old-fashioned text with the setting: xr.set_options(display_style="text")

Coordinates#

Let’s have a look at the month coordinate first:

ds.month
<xarray.DataArray 'month' (month: 12)> Size: 48B
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12], dtype=int32)
Coordinates:
  * month    (month) int32 48B 1 2 3 4 5 6 7 8 9 10 11 12
Attributes:
    long_name:      Climatological Monthly Means Based on 07/2005 to 06/2015
    standard_name:  climatology time
    climatology:    climatology_bounds
    units:          months of a climatology year
    delta_t:        0000-01-00 00:00:00

The array contains numbers from 1 to 12, they represent the months of the year. From the attribute “title”, we know that these represent the average for each month for the period 07/2005 - 06/2015.

The location coordinates are also self-explaining:

ds.lon
<xarray.DataArray 'lon' (lon: 360)> Size: 1kB
array([  0.5,   1.5,   2.5, ..., 357.5, 358.5, 359.5], dtype=float32)
Coordinates:
  * lon      (lon) float32 1kB 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
Attributes:
    long_name:      Longitude
    standard_name:  longitude
    units:          degrees_east
    valid_range:    [-180.  360.]
ds.lat
<xarray.DataArray 'lat' (lat: 180)> Size: 720B
array([-89.5, -88.5, -87.5, -86.5, -85.5, -84.5, -83.5, -82.5, -81.5, -80.5,
       -79.5, -78.5, -77.5, -76.5, -75.5, -74.5, -73.5, -72.5, -71.5, -70.5,
       -69.5, -68.5, -67.5, -66.5, -65.5, -64.5, -63.5, -62.5, -61.5, -60.5,
       -59.5, -58.5, -57.5, -56.5, -55.5, -54.5, -53.5, -52.5, -51.5, -50.5,
       -49.5, -48.5, -47.5, -46.5, -45.5, -44.5, -43.5, -42.5, -41.5, -40.5,
       -39.5, -38.5, -37.5, -36.5, -35.5, -34.5, -33.5, -32.5, -31.5, -30.5,
       -29.5, -28.5, -27.5, -26.5, -25.5, -24.5, -23.5, -22.5, -21.5, -20.5,
       -19.5, -18.5, -17.5, -16.5, -15.5, -14.5, -13.5, -12.5, -11.5, -10.5,
        -9.5,  -8.5,  -7.5,  -6.5,  -5.5,  -4.5,  -3.5,  -2.5,  -1.5,  -0.5,
         0.5,   1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,   9.5,
        10.5,  11.5,  12.5,  13.5,  14.5,  15.5,  16.5,  17.5,  18.5,  19.5,
        20.5,  21.5,  22.5,  23.5,  24.5,  25.5,  26.5,  27.5,  28.5,  29.5,
        30.5,  31.5,  32.5,  33.5,  34.5,  35.5,  36.5,  37.5,  38.5,  39.5,
        40.5,  41.5,  42.5,  43.5,  44.5,  45.5,  46.5,  47.5,  48.5,  49.5,
        50.5,  51.5,  52.5,  53.5,  54.5,  55.5,  56.5,  57.5,  58.5,  59.5,
        60.5,  61.5,  62.5,  63.5,  64.5,  65.5,  66.5,  67.5,  68.5,  69.5,
        70.5,  71.5,  72.5,  73.5,  74.5,  75.5,  76.5,  77.5,  78.5,  79.5,
        80.5,  81.5,  82.5,  83.5,  84.5,  85.5,  86.5,  87.5,  88.5,  89.5],
      dtype=float32)
Coordinates:
  * lat      (lat) float32 720B -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
Attributes:
    long_name:      Latitude
    standard_name:  latitude
    units:          degrees_north
    valid_range:    [-90.  90.]

Q: what is the spatial resolution of CERES data?

# your answer here

Variables#

Variables can also be accessed directly from the dataset:

ds.toa_sw_all_clim
<xarray.DataArray 'toa_sw_all_clim' (month: 12, lat: 180, lon: 360)> Size: 3MB
[777600 values with dtype=float32]
Coordinates:
  * lon      (lon) float32 1kB 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
  * lat      (lat) float32 720B -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
  * month    (month) int32 48B 1 2 3 4 5 6 7 8 9 10 11 12
Attributes:
    long_name:      Top of The Atmosphere Shortwave Flux, All-Sky conditions,...
    standard_name:  TOA Shortwave Flux - All-Sky
    CF_name:        toa_outgoing_shortwave_flux
    comment:        none
    units:          W m-2
    valid_min:            0.00000
    valid_max:            600.000

The attributes of a variable are extremely important, they carry the metadata and must be specified by the data provider. Here we can read in which units the variable is defined, as well as a description of the variable (the “long_name” attribute).

Q: what other information can we read from this printout? Explore the other data variables and see if you understand all of them. Note: you can expand each variable’s attributes in the html display, or use the method ds.info() to list all vars and attributes.

# your answer here

Simple analyses#

Analysing climate data is extremely easy in Python thanks to the xarray and cartopy libraries. First we are going to compute the time average of the TOA Shortwave Flux over the year:

sw_avg = ds.toa_sw_all_clim.mean(dim='month')

What did we just do? From the netcdf dataset, we took the toa_sw_all_clim variable (ds.toa_sw_all_clim) and we applied the function .mean() to it. So an equivalent formulation could be:

# Equivalent code:
sw = ds.toa_sw_all_clim
sw_avg = sw.mean(dim='month')

What is sw_avg by the way?

sw_avg
<xarray.DataArray 'toa_sw_all_clim' (lat: 180, lon: 360)> Size: 259kB
array([[122.93867, 122.93867, 122.93867, ..., 122.93867, 122.93867,
        122.93867],
       [120.55958, 120.55958, 120.55958, ..., 120.55958, 120.55958,
        120.55958],
       [119.85125, 119.85125, 119.85125, ..., 119.85125, 119.85125,
        119.85125],
       ...,
       [103.60708, 103.60708, 103.60708, ..., 103.60708, 103.60708,
        103.60708],
       [103.67892, 103.67892, 103.67892, ..., 103.67892, 103.67892,
        103.67892],
       [103.71167, 103.71167, 103.71167, ..., 103.71167, 103.71167,
        103.71167]], dtype=float32)
Coordinates:
  * lon      (lon) float32 1kB 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
  * lat      (lat) float32 720B -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5

So sw_avg is a 2-dimensional array of dimensions [lat, lon] (note that the month dimension has disapeared).

When we applied the mean() function, we added an argument (called a keyword argument): dim='month'. With this argument, we told the function to compute the average over the month dimension.

Let’s remove this keyword and compute the mean again:

sw.mean()
<xarray.DataArray 'toa_sw_all_clim' ()> Size: 4B
array(102.22539, dtype=float32)

Ha! We now have an array without dimensions: a single element array, also called a scalar. This is the total average over all the dimensions. We’ll come back to this later…

Note: scalar output is quite verbose in xarray… Your can print simpler scalars on screen with the .item() method:

sw.mean().item()
102.22538757324219

Q: what should we expect from the folowing commands:

sw.mean(dim='lon')
sw.mean(dim='month').mean(dim='lon')
sw.mean(dim=['month', 'lon'])

Try them out!

# Try the commands above. Do they work as expected? 

E: what is the maximum shortwave radiation value radiated back to space? And the minimum? (hint)

# your answer here

A first plot#

2d data#

We are now going to plot the average Top of The Atmosphere Shortwave Flux on a map:

# Define the map projection (how does the map look like)
ax = plt.axes(projection=ccrs.EqualEarth())
# ax is an empty plot. We now plot the variable sw_avg onto ax
sw_avg.plot(ax=ax, transform=ccrs.PlateCarree()) 
# the keyword "transform" tells the function in which projection the data is stored 
ax.coastlines(); ax.gridlines(); # Add gridlines and coastlines to the plot
../_images/14d3c1a734bf0901545f5ef2971e8b7f3134642508e740ddc6fbfebb70d814de.png

We are looking at the average TOA outgoing shorwage flux, expressed in W m\(^{-2}\). Such time averages are often writen with a bar on top of them:

\(\overline{SW_{TOA}} = temporal\_mean(SW_{TOA})\)

Q: look at the basic features of the plot. Can you explain most of the patterns that you observe? Where are the highest values? The lowest ones?

1d data#

It is equally easy to plot 1d data. In this case, we are going to compute the zonal average of sw_avg. “Zonal average” means “along a latitude circle”. It is often writen with [] or <> in formulas:

\(\left[ \overline{SW_{TOA}} \right] = zonal\_mean(temporal\_mean(SW_{TOA}))\)

Note that the two operators are commutative, i.e.:

\(\left[ \overline{SW_{TOA}} \right] = \overline{\left[ SW_{TOA} \right]}\)

Let’s compute it and plot it right away:

sw_avg.mean(dim='lon').plot();
../_images/382df3a1d619d4b1c495b1559cfe25afac3a203f617232c0bea401186a431671.png

Q: look at the basic features of the plot. Can you recognize the important features from the map above?

More data manipulation with xarray#

As you have probably noted already, xarray’s objects (called Dataset for the whole netCDF file or DataArray for single variables) are quite powerful, and can do more than much arrays know from other languages. Last week we talked about the differences between python’s lists and numpy’s arrays. Today we introduced this new object (DataArray) which is one level higher in usability.

But don’t worry if this sounds confusing at first! From now on we are going to use DataArrays only. The best thing about them is that they carry their dimension names and coordinates with them. This is the reason why it was so easy to make a plot with the right axis labels in just one command. They have very useful other properties, and we will learn these step by step.

One of the first nice properties is that they behave just like regular arrays. That is, you can do arithmetic with them. Our first task will be to compute the net energy balance at the top of the atmosphere:

\[\overline{EB_{TOA}} = \overline{SW_{In}} - \overline{SW_{TOA}} - \overline{LW_{TOA}} \approx 0\]

Arithmetics and averages on a sphere#

# Note that there are many different ways to get to the same result. For the sake of clarity we use the simple way:
eb_avg = ds.solar_clim.mean(dim='month') - ds.toa_sw_all_clim.mean(dim='month') - ds.toa_lw_all_clim.mean(dim='month')

E: plot eb_avg on a map. Why did xarray use another colormap? Describe the basic features of the plot. Where is the climate system gaining energy? Losing energy?

# your answer here

We said that the energy balance should be close to zero (balanced). Fortunately, it is easy to check:

eb_avg.mean().item()
-28.665794372558594

But, wait? This is quite far from zero!!! What’s going on here?

Well, it’s simpler than it seems. This is an anoying problem with our planet: it happens to be a sphere. (Or something close to a sphere).

So when we average without taking this into account, we get wrong results. How wrong is it? A regular plot of the data will help us to see what happens here:

eb_avg.plot();
../_images/d8e3502fc76d4dd043af178bd0ef296d76bc9868d8c8cad3392bfed99e1f9bb6.png

Which has to be compared to a sphere. When averaging [lon, lat] data, one gives too much weight to high latitudes.

Fortunately, this can be solved by noting that we have to weight each latitudinal band by the cosinus of the latitude, i.e. \(\cos \varphi\). We are going to compute a new average, but weighted this time. First, let’s make a weight array:

weight = np.cos(np.deg2rad(ds.lat))
weight = weight / weight.sum()
weight.plot();
../_images/f509395639745c3daeade522ca2dd37f00e97ec379e613ea58944f38197d5941.png

Q: can you follow each step? If not, redo each step one by one, and use the ? to get help about each of these functions!

# your answer here

Weight is an array of 180 elements, which is normalised so that it’s sum is 1. This is exactly what we need to compute a weighted average! First, we have to average over the longitudes (this is fine, because along a latitude circle all points have the same weight), and then compute the weighted average.

zonal_eb_avg = eb_avg.mean(dim='lon')  # important! Always average over longitudes first
# this averaging is needed so that the arithmetic below makes sense 
# (multiply two arrays of 180 elements together)
weighted_eb_avg = np.sum(zonal_eb_avg * weight)
weighted_eb_avg.item()
0.9690555334091187

Aaaah, this looks much better now. Not exactly zero, but much closer.

Note: the remaining value (called the residual) is a combination of measurement errors, geometrical approximations (the Earth is not a perfect sphere, see e.g. this post for a more correct implementation which yields ~1.07 instead of 0.97), and anthropogenic energy imbalance (refs 1, 2).

Data selection and multiline plots#

We have seen that DataArrays can be averaged along one dimension as follows:

eb_lon_avg = ds.solar_clim.mean(dim='lon')
eb_lon_avg
<xarray.DataArray 'solar_clim' (month: 12, lat: 180)> Size: 9kB
array([[495.7     , 495.60004 , 495.2822  , ...,   0.      ,   0.      ,
          0.      ],
       [306.      , 305.89996 , 305.69998 , ...,   0.      ,   0.      ,
          0.      ],
       [ 59.189995,  60.143776,  62.062664, ...,  22.60511 ,  20.660666,
         19.69    ],
       ...,
       [212.1     , 212.      , 211.9     , ...,   0.      ,   0.      ,
          0.      ],
       [440.7     , 440.5     , 440.28442 , ...,   0.      ,   0.      ,
          0.      ],
       [549.10004 , 549.      , 548.6267  , ...,   0.      ,   0.      ,
          0.      ]], dtype=float32)
Coordinates:
  * lat      (lat) float32 720B -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
  * month    (month) int32 48B 1 2 3 4 5 6 7 8 9 10 11 12

The resulting array has dimensions (month, lat). A common thing we like to do is for example select certain months, this is an easy task with xarray and the method .sel():

avg_jan = eb_lon_avg.sel(month=1)

E: plot avg_jan to make sure that it is indeed what you think it is.

# your answer here

With the help of a few commands, it is not a big deal to make a nice looking plot:

eb_lon_avg.sel(month=1).plot(label='January')
eb_lon_avg.sel(month=7).plot(label='July')
eb_lon_avg.mean(dim='month').plot(label='Annual Avg', linewidth=3)
plt.xlim(-90, 90)
plt.title('Incoming solar radiation - zonal average')
plt.legend(loc='best')
plt.ylabel('W m$^{-2}$');
../_images/3db007c4b90c726c8f749efb93bd3b66567a9e1b2613d9fcf5d54d37dac29df1.png