Lesson: organise your datasets, select data by country, droughts

Lesson: organise your datasets, select data by country, droughts#

In this lesson, we will now go back to NetCDF files and gridded data analysis. We will mix up xarray and pandas to get to our new goal today: studying droughts.

Select data by country#

To select data by country, we’ll use the regionmask package.

If you don’t already have it installed (sorry I forgot to add it to the list this year), you can add it to your environment using either conda or mamba. In a terminal, do:

conda activate qcr

And then (depending on wether you use conda or mamba):

conda install -c conda-forge regionmask

or

mamba install -c conda-forge regionmask

Once installed, you can import it as usual:

import regionmask

Regionmask helps to create masks by countries (or any other vector boundary). We’ll stick to countries here:

# Load built-in country polygons
countries = regionmask.defined_regions.natural_earth_v5_0_0.countries_110

# Print all available names
print(countries.names)

['Fiji', 'Tanzania', 'W. Sahara', 'Canada', 'United States of America', 'Kazakhstan', 'Uzbekistan', 'Papua New Guinea', 'Indonesia', 'Argentina', 'Chile', 'Dem. Rep. Congo', 'Somalia', 'Kenya', 'Sudan', 'Chad', 'Haiti', 'Dominican Rep.', 'Russia', 'Bahamas', 'Falkland Is.', 'Norway', 'Greenland', 'Fr. S. Antarctic Lands', 'Timor-Leste', 'South Africa', 'Lesotho', 'Mexico', 'Uruguay', 'Brazil', 'Bolivia', 'Peru', 'Colombia', 'Panama', 'Costa Rica', 'Nicaragua', 'Honduras', 'El Salvador', 'Guatemala', 'Belize', 'Venezuela', 'Guyana', 'Suriname', 'France', 'Ecuador', 'Puerto Rico', 'Jamaica', 'Cuba', 'Zimbabwe', 'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin', 'Niger', 'Nigeria', 'Cameroon', 'Togo', 'Ghana', "Côte d'Ivoire", 'Guinea', 'Guinea-Bissau', 'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Rep.', 'Congo', 'Gabon', 'Eq. Guinea', 'Zambia', 'Malawi', 'Mozambique', 'eSwatini', 'Angola', 'Burundi', 'Israel', 'Lebanon', 'Madagascar', 'Palestine', 'Gambia', 'Tunisia', 'Algeria', 'Jordan', 'United Arab Emirates', 'Qatar', 'Kuwait', 'Iraq', 'Oman', 'Vanuatu', 'Cambodia', 'Thailand', 'Laos', 'Myanmar', 'Vietnam', 'North Korea', 'South Korea', 'Mongolia', 'India', 'Bangladesh', 'Bhutan', 'Nepal', 'Pakistan', 'Afghanistan', 'Tajikistan', 'Kyrgyzstan', 'Turkmenistan', 'Iran', 'Syria', 'Armenia', 'Sweden', 'Belarus', 'Ukraine', 'Poland', 'Austria', 'Hungary', 'Moldova', 'Romania', 'Lithuania', 'Latvia', 'Estonia', 'Germany', 'Bulgaria', 'Greece', 'Turkey', 'Albania', 'Croatia', 'Switzerland', 'Luxembourg', 'Belgium', 'Netherlands', 'Portugal', 'Spain', 'Ireland', 'New Caledonia', 'Solomon Is.', 'New Zealand', 'Australia', 'Sri Lanka', 'China', 'Taiwan', 'Italy', 'Denmark', 'United Kingdom', 'Iceland', 'Azerbaijan', 'Georgia', 'Philippines', 'Malaysia', 'Brunei', 'Slovenia', 'Finland', 'Slovakia', 'Czechia', 'Eritrea', 'Japan', 'Paraguay', 'Yemen', 'Saudi Arabia', 'Antarctica', 'N. Cyprus', 'Cyprus', 'Morocco', 'Egypt', 'Libya', 'Ethiopia', 'Djibouti', 'Somaliland', 'Uganda', 'Rwanda', 'Bosnia and Herz.', 'North Macedonia', 'Serbia', 'Montenegro', 'Kosovo', 'Trinidad and Tobago', 'S. Sudan']

Let’s build a mask for all countries, same shape as our data:

# Create a mask for all countries
mask = countries.mask(ds['longitude'], ds['latitude'])
mask.plot();

../_images/199bebbcf7b0cf62eb56ea633dd167a587896fb67a97b66f218efd89f576be54.png

And look-up what ID the UK has:

# Look up the UK’s region number
uk_id = countries.map_keys('United Kingdom')
print(uk_id) 

With this information, we are now able to select the data we are interested in, which in xarray means relplacing all pixels outside of it with NaNs:

ds_sel = ds.where(mask == uk_id)
ds_sel['t2m'].isel(time=0).plot();

../_images/91ce42138107d979d02e57155c468c5674d928de8d9c88a3bce35df7c1266148.png

It is sometimes better to also remove all the unnecessary NaNs around our selection:

ds_uk = ds.where(mask == uk_id, drop=True)
ds_uk['t2m'].isel(time=0).plot();

../_images/c62f34cbf2a5bdd3cfec77b0ebbbbf6d2e3e2d6c8b1f6f1977ddbca61523c429.png

And finally, let’s learn how to make a nicer map from this:

ax = plt.axes(projection=ccrs.LambertConformal(central_longitude=-3, central_latitude=50))
ax.set_extent([-8, 3, 49, 60], ccrs.PlateCarree())
ax.coastlines(); ax.gridlines(); ax.add_feature(cartopy.feature.BORDERS);
ds_uk['t2m'].isel(time=0).plot(ax=ax, transform=ccrs.PlateCarree());

../_images/6cd870102852290f584c654cbf3f82aec886a0c1ef06406adb615c42e62e8629.png

You’ll be given more inputs on maps with a “recipe” soon.

Droughts and SPEI#

In this final section, we’ll take the data we’ve prepared — temperature, precipitation, and PET — and use it to analyse drought variability. This section is mostly about the code and explaining the SPEI - we’ll focus on interpretations in the assignment.

Let’s select our time series of temperature, precipitation, and PET close to London:

dsl = ds.sel(latitude=51.5072, longitude=0.1276, method='nearest')

Note that I could also have averaged the data over the entire UK — and if I did so, I should not forget to use weighted averages, right?

Because we now have time series data for a single grid cell, it’s convenient to convert it to a pandas DataFrame.
We don’t need the latitude and longitude columns for a single-point series, so we’ll drop them:

df = dsl.to_dataframe().drop(columns=['latitude', 'longitude'])
df['t2m'] = df['t2m'] - 273.15  # To Celsius
df

	t2m	tp	pet
time
1979-01-01	0.474518	62.971115	1.011671
1979-02-01	1.430878	67.050934	3.770684
1979-03-01	5.016632	93.008041	20.652092
1979-04-01	7.825409	83.656311	38.842426
1979-05-01	10.742035	117.191315	64.859535
...	...	...	...
2024-08-01	18.661682	29.859543	111.808380
2024-09-01	15.060547	98.590851	72.991951
2024-10-01	12.001129	52.623749	48.645039
2024-11-01	7.781097	58.050156	23.933279
2024-12-01	7.230377	48.780441	20.503212

552 rows × 3 columns

Do you remember what t2m, tp and pet represent, and what their units are? If not, go back to the top of this notebook.

Now plot a scatter plot of t2m vs pet, and interpret what you see.

# Your answer here
df.plot.scatter(x='t2m', y='pet');

../_images/01580336064af2b8bc80a84c0be3281b80b847c1b53ecf9eaedac0728935a145.png

Water balance#

Now let’s compute the monthly water balance (mm/month), defined as:

\[ WB = P - PET \]

where positive values indicate a water surplus (precipitation exceeds PET), and negative values indicate a deficit.

wb = df['tp'] - df['pet']  # Water balance

wb.plot(label='Monthly water balance');
wb.rolling(window=12).mean().plot(label='12-months averages');
plt.ylabel('mm/month'); plt.legend();
plt.title('Monthly water balance from ERA5 near London, UK');

../_images/50546ef30e6541a21e5c9eab7829ec938f655a0996a1d96f19bc29de334d963c.png

Can you already identify periods of drought (sustained negative water balance) in the monthly data?
How do these appear when looking at the 12-month moving average (orange line)?
Which series (monthly or 12-month) would you trust more to identify long-term droughts, and why?

Introducing the Standardised Precipitation–Evapotranspiration Index (SPEI)#

So far, we’ve looked at the monthly water balance (P − PET) — the difference between water supply and loss by evaporation. This gives us a first impression of when and where water deficits (droughts) or surpluses (wet periods) occur.

However, the raw water balance has some limitations:

Its mean and variance depend on the local climate — a value of −50 mm/month may mean something very different in Spain than in Scotland.
It’s not easy to compare anomalies across regions or between seasons.
We don’t have a clear threshold for when conditions become “dry enough” to be considered a drought.

To overcome these issues, scientists developed the Standardised Precipitation–Evapotranspiration Index (SPEI, Vicente-Serrano et al., 2010).

What SPEI does#

SPEI transforms the raw water balance into a dimensionless index that tells us how unusual a given month’s conditions are compared to the long-term record at that location.

It does this in three main steps:

Accumulate the water balance over a given time window (e.g. 1, 3, 6, or 12 months).
- Short timescales capture short, intense dry spells.
- Long timescales capture multi-season or hydrological droughts.
Fit a probability distribution to the long-term series of accumulated water balance values
(typically a log-logistic or Pearson type III distribution).
Standardise the values (transform them into z-scores), so that:
- 0 → average conditions
- −1 → moderate drought (about the 15th percentile)
- −2 → severe drought
- +1 → unusually wet period
- etc

The result is comparable across regions and climates — an SPEI = −2 means an equally rare dry event anywhere on Earth.

SPEI is now one of the most widely used indices for analysing and monitoring climatic droughts,
because it captures both precipitation deficits and the increasing evaporative demand due to warming.

Computing SPEI in Python#

We’ll use the open-source spei Python package, which implements the methods described above.

We can simply feed our water balance time series into the function and compute SPEI for different accumulation timescales (e.g. 3, 12, 24 months) — just as it is done in published drought datasets such as SPEIbase.

Let’s try it out!

Here again we need to install the packag (sorry I forgot to add it to the list this year), you can add it to your environment using a slightly different method than usual. In a terminal, do:

conda activate qcr

And then (regardless on wether you use conda or mamba):

pip install spei

Once installed, you can import it as usual:

import spei

The spei package allows us to compute the SPEI from the water balance at our chosen timescales:

spei3 = spei.spei(wb, timescale=3)
spei12 = spei.spei(wb, timescale=12)
spei24 = spei.spei(wb, timescale=24)

And it also comes with plotting functions:

f, ax = plt.subplots(3, 1, figsize=(12, 8), sharex=True)

# choose a colormap to your liking:
spei.plot.si(spei3, ax=ax[0], cmap="vik_r")
spei.plot.si(spei12, ax=ax[1], cmap="roma")
spei.plot.si(spei24, ax=ax[2], cmap="seismic_r")
# ax[0].set_xlim(pd.to_datetime(["1994", "1998"]))
[x.grid() for x in ax]
[ax[i].set_ylabel(n, fontsize=14) for i, n in enumerate(["SPEI3", "SPEI12", "SPEI24"])];
plt.suptitle('SPEI from ERA5 near London');

../_images/12ef6cdd4721cc938d3c2a8f3ef4a02d27ca4724ce528ce2e2a2a57a1489d5e6.png

Quite pretty! We will revisit these time series and what they mean in the weekly assignment.

What’s next?#

You are now ready for this week’s assignment!