The scientific data analysis workflow#

In this final (and very short) unit I wrote down some basic principles which might come handy when analyzing data of any kind in your thesis or any class project. I’ll keep it short, but try to remember some of my tips when you’ll do some “real” programming work in the future!

If you haven’t done so already, visit The scientific python stack for an overview of several scientific python packages that you will find useful in your studies and future career.

A typical workflow#

Almost all data analysis workflows follow the same structure:

  1. Read the data: typically from a file, e.g. a text file, some NetCDF or Geotiff, or from domain specifi formats.

  2. Cleanse it: remove missing data, delete useless variables, select a time period, remove outliers…

  3. Process and add value to the data: compute additional diagnostic variables out of the raw data, merge several datasets into one frame (e.g. on a map projection or to the same time coordinates)

  4. Reduce the data: this is a statistical term, meaning that in order to be understood by a human the data must be “simplified”: e.g. temporal averages, smoothing, data classification, histograms…

  5. Plot or output to table: this is the final step and the only reason is to make the data understandable by humans.

  6. Evaluation of the final result: Happy? Stop. Not happy? Repeat.

Some recommendations#

With almost no exception, all your scientific workflows will follow a similar structure (the most notable exception is modeling work, which is another story…). Sometimes a step is very short (e.g. when the data is very “clean” or already averaged), but in general each of these steps has to be applied at least once.

Here are some general recommendations:

  • resist the temptation to put one step before the other: for example, it can be appealing to reduce the data before processing it. If this is not strictly necessary (i.e. for data management and storage issues), don’t do it: if you find out later that you still need high-resolution data, you’ll regret it.

  • store the output of the individual steps in well-documented intermediate files. In particular if your data is messy, storing it in clean intermediate files will spare you a lot of trouble when you’ll go back to your code after the week-end. This is particularly true for data intensive workflows, where repeating each step can take hours or days

  • use conventions for metadata, and use popular data formats to store your data: CSV files are okay but not great (“what’s the unit of the variable WS_200 again?”). NetCDF is much better. All kind of proprietary data formats (or obscure domain specific formats) are usually bad.

  • organize your code around each step above (maybe one file per step, or one Jupyter Notebook section) and write formal tests using asserts in code (okay) or pytest test_ modules (better)

  • store your code and data in some place safe (github or gitlab are a good idea for code, not so good for data) and in a structure which will remind you what was done when you go back to your code in 6 months from now. For example, use a directory structure following each step: 0_RAW_DATA, 1_FILTERED_DATA, 2_PROCESSED, etc.).

  • have someone look at your code, or use code that works: always ask yourself the same question: “Am I the first person facing this problem? Where could I find someone who has done that already?”. If you find yourself implementing a function to compute relative humidy out of water vapor pressure, you may have searched not long enough.

  • write tests: I did say that already, but I’m saying it again: write tests