Data Management in the Ocean, Weather, and Climate Sciences

Data Provenance

Learning Objectives

  • Record data processing steps in the history attribute of netCDF files
  • Include this provenance tracking in a personal data management plan

Now that we've developed some code for reading in zonal and meridional surface current data and calculating the speed, the logical next step is to put that code in a script called calc_current_speed.py so that we can repeat the process quickly and easily.

The output of that script will be a new netCDF file containing the current speed data, however there's one more thing to consider before we go ahead and start creating new files. Looking closely at the global attributes of IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc you can see that the entire history of the file, all the way back to its initial download, has been recorded in the history attribute.

print(acorn_DATA.history)
2012-10-09T03:31:35 Convert totl_TURQ_2012_10_08_1500.tuv to netcdf format using CODAR_Convert_File.
2012-10-09T03:31:36 Write CODAR file totl_TURQ_2012_10_08_1500.tuv. Modification of the NetCDF format by eMII to visualise the data using ncWMS

This practice of recording the history of the file ensures the provenance of the data. In other words, a complete record of everything that has been done to the data is stored with the data, which avoids any confusion in the event that the data is ever moved, passed around to different users, or viewed by its creator many months later.

If we want to create our own entry for the history attribute, we'll need to be able to create a:

  • Time stamp
  • Record of what was entered at the command line in order to execute calc_current_speed.py
  • Method of indicating which verion of the script was run (i.e. because the script is in our git repository)

Time stamp

A library called datetime can be used to find out the time and date right now:

import datetime
 
time_stamp = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
print(time_stamp)
2015-02-17T20:26:25

The strftime function can be used to customise the appearance of a datetime object; in this case we've made it look just like the other time stamps in our data file.

Command line record

In the lesson on command line programs we met sys.argv, which contains all the arguments entered by the user at the command line:

import sys
print(sys.argv)
['/Applications/anaconda/envs/default/lib/python3.5/site-packages/ipykernel/__main__.py', '-f', '/Users/irv033/Library/Jupyter/runtime/kernel-c5b2a29d-9991-412c-83a1-2e69528034a0.json']

In launching this IPython notebook, you can see that a number of command line arguments were used. To join all these list elements up, we can use the join function that belongs to Python strings:

args = " ".join(sys.argv)
print(args)
/Applications/anaconda/envs/default/lib/python3.5/site-packages/ipykernel/__main__.py -f /Users/irv033/Library/Jupyter/runtime/kernel-c5b2a29d-9991-412c-83a1-2e69528034a0.json

While this list of arguments is very useful, it doesn't tell us which Python installation was used to execute those arguments. The sys library can help us out here too:

exe = sys.executable
print(exe)
/Applications/anaconda/envs/default/bin/python

Git hash

In the Software Carpentry lessons on git we learned that each commit is associated with a unique 40-character identifier known as a hash. We can use the git Python library to get the hash associated with the script:

from git import Repo
import os

git_hash = Repo(os.getcwd()).heads[0].commit
print(git_hash)
bec6077a7d8e6d5b0b9370ca9cca6a071ed62b32

We can now put all this information together for our history entry:

entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])
print(entry)
2017-02-05T18:00:53: /Applications/anaconda/envs/default/bin/python /Applications/anaconda/envs/default/lib/python3.5/site-packages/ipykernel/__main__.py -f /Users/irv033/Library/Jupyter/runtime/kernel-c5b2a29d-9991-412c-83a1-2e69528034a0.json (Git hash: bec6077)

Putting it all together

So far we've been experimenting in the IPython notebook to familiarise ourselves with netCDF4 and the other Python libraries that might be useful for calculating the surface current speed. We should now go ahead and write a script, so we can repeat the process with a single entry at the command line:

$ cat code/calc_current_speed.py
import os, sys, argparse
import datetime
from git import Repo
from netCDF4 import Dataset


def calc_speed(u, v):
    """Calculate the speed"""

    speed = (u**2 + v**2)**0.5

    return speed


def copy_dimensions(infile, outfile):
    """Copy the dimensions of the infile to the outfile"""
        
    for dimName, dimData in iter(infile.dimensions.items()):
        outfile.createDimension(dimName, len(dimData))


def copy_variables(infile, outfile):
    """Create variables corresponding to the file dimensions 
    by copying from infile"""
            
    for var_name in ['TIME', 'LATITUDE', 'LONGITUDE']:
        varin = infile.variables[var_name]
        outVar = outfile.createVariable(var_name, varin.datatype, 
                                        varin.dimensions, 
                                        fill_value=varin._FillValue)
        outVar[:] = varin[:]
            
        var_atts = {}
        for att in varin.ncattrs():
            if not att == '_FillValue':
                var_atts[att] = eval('varin.'+att) 
        outVar.setncatts(var_atts)


def create_history():
    """Create the new entry for the global history file attribute"""

    time_stamp = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
    exe = sys.executable
    args = " ".join(sys.argv)
    git_hash = Repo(os.getcwd()).heads[0].commit

    return """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])


def read_data(ifile, uVar, vVar):
    """Read data from ifile corresponding to the U and V variable"""

    input_DATA = Dataset(ifile)
    uData = input_DATA.variables[uVar][:]
    vData = input_DATA.variables[vVar][:]

    return uData, vData, input_DATA


def set_global_atts(infile, outfile):
    """Set the global attributes for outfile.
        
    Note that the global attributes are simply copied from
    infile and the history attribute updated accordingly.
        
    """
        
    global_atts = {}
    for att in infile.ncattrs():
        global_atts[att] = eval('infile.'+att)  
        
    new_history = create_history()
    global_atts['history'] = """%s\n%s""" %(new_history,  global_atts['history'])
    outfile.setncatts(global_atts)


def write_speed(infile, outfile, spData):
    """Write the current speed data to outfile"""
        
    u = infile.variables['UCUR']   
    spcur = outfile.createVariable('SPCUR', u.datatype, u.dimensions, fill_value=u._FillValue)
    spcur[:,:,:] = spData    
    
    spcur.standard_name = 'sea_water_speed'
    spcur.long_name = 'sea water speed'
    spcur.units = u.units
    spcur.coordinates = u.coordinates
    

def main(inargs):
    """Run the program"""
    
    inFile = inargs.infile
    uVar = inargs.uvar
    vVar = inargs.vvar
    outfile_name = inargs.outfile

    # Read input data 
    uData, vData, input_DATA = read_data(inFile, uVar, vVar)
    
    # Calculate the current speed
    spData = calc_speed(uData, vData)
    
    # Write the output file
    outfile = Dataset(outfile_name, 'w', format='NETCDF4')
    set_global_atts(input_DATA, outfile)
    copy_dimensions(input_DATA, outfile)
    copy_variables(input_DATA, outfile)
    write_speed(input_DATA, outfile, spData) 
    
    outfile.close()


if __name__ == '__main__':

    extra_info ="""example:
  python calc_current_speed.py http://thredds.aodn.org.au/thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz UCUR VCUR IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc 

author:
  Damien Irving, irving.damien@gmail.com

"""

    description='Calculate the current speed'
    parser = argparse.ArgumentParser(description=description,
                                     epilog=extra_info,
                                     formatter_class=argparse.RawDescriptionHelpFormatter)

    parser.add_argument("infile", type=str, help="Input file name")
    parser.add_argument("uvar", type=str, help="Name of the zonal flow variable")
    parser.add_argument("vvar", type=str, help="Name of the meridional flow variable")
    parser.add_argument("outfile", type=str, help="Output file name")

    args = parser.parse_args()            

    print 'Input file: ', args.infile
    print 'Output file: ', args.outfile  

    main(args)

Conserving metadata

It took four separate functions in calc_current_speed.py to create the output file, because we had to copy the dimensions and most of the global and variable attributes from the original file to the new file. This is such a common problem that both the Iris and xarray libraries conserve metadata whenever possible. When xarray is used to read a netCDF file, for instance, the data are stored as an xarray.DataArray (as opposed to a numpy.ndarray). These special data arrays carry their dimension information and variable atributes with them, which means you don't have to retrieve them manually. Iris and xarray also come with a bunch of convenience functions for doing typical weather/climate/ocean tasks (calculating climatologies, anomalies, etc), which can be a pain using numpy.

Using the help information that the argparse library provides, we can now go ahead and run this script:

$ python calc_current_speed.py -h
usage: calc_current_speed.py [-h] infile uvar vvar outfile

Calculate the current speed

positional arguments:
  infile      Input file name
  uvar        Name of the zonal flow variable
  vvar        Name of the meridional flow variable
  outfile     Output file name

optional arguments:
  -h, --help  show this help message and exit

example:
  python calc_current_speed.py http://thredds.aodn.org.au/thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz UCUR VCUR IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc 

author:
  Damien Irving, irving.damien@gmail.com
$ python calc_current_speed.py http://thredds.aodn.org.au/thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz UCUR VCUR IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc 
Input file:  http://thredds.aodn.org.au/thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz
Output file:  IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc

The finished product

We can now inspect the attributes in our new file:

$ ncdump -h IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc
netcdf IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z {
dimensions:
    I = 55 ;
    J = 57 ;
    TIME = 493 ;
variables:
    double TIME(TIME) ;
        TIME:_FillValue = -9999. ;
        TIME:calendar = "gregorian" ;
        TIME:comment = "Given time lays at the middle of the averaging time period." ;
        TIME:long_name = "time" ;
        TIME:axis = "T" ;
        TIME:units = "days since 1950-01-01 00:00:00" ;
        TIME:standard_name = "time" ;
        TIME:local_time_zone = 8. ;
        TIME:valid_min = 0. ;
        TIME:valid_max = 999999. ;
    double LATITUDE(I, J) ;
        LATITUDE:_FillValue = 9999. ;
        LATITUDE:long_name = "latitude" ;
        LATITUDE:axis = "Y" ;
        LATITUDE:units = "degrees_north" ;
        LATITUDE:reference_datum = "geographical coordinates, WGS84 projection" ;
        LATITUDE:standard_name = "latitude" ;
        LATITUDE:valid_min = -90. ;
        LATITUDE:valid_max = 90. ;
    double LONGITUDE(I, J) ;
        LONGITUDE:_FillValue = 9999. ;
        LONGITUDE:long_name = "longitude" ;
        LONGITUDE:axis = "X" ;
        LONGITUDE:units = "degrees_east" ;
        LONGITUDE:reference_datum = "geographical coordinates, WGS84 projection" ;
        LONGITUDE:standard_name = "longitude" ;
        LONGITUDE:valid_min = -180. ;
        LONGITUDE:valid_max = 180. ;
    float SPCUR(TIME, I, J) ;
        SPCUR:_FillValue = 9999.f ;
        SPCUR:standard_name = "sea_water_speed" ;
        SPCUR:long_name = "sea water speed" ;
        SPCUR:units = "m s-1" ;
        SPCUR:coordinates = "TIME LATITUDE LONGITUDE" ;

// global attributes:
        :keywords = "Oceans" ;
        :geospatial_lat_min = -29.5386582, -29.5927878, -29.6469169, -29.7010456, -29.7551738, -29.8093016, -29.863429, -29.9175559, -29.9716823, -30.0258084, -30.0799339, -30.1340591, -30.1881837, -30.242308, -30.2964318, -30.3505551, -30.404678, -30.4588004, -30.5129224, -30.5670439, -30.621165, -30.6752857, -30.7294059, -30.7835256, -30.8376449, -30.8917637, -30.9458821, -31., -31.0541175, -31.1082345, -31.162351, -31.2164671, -31.2705828, -31.324698, -31.3788127, -31.432927, -31.4870408, -31.5411542, -31.5952671, -31.6493795, -31.7034915, -31.7576031, -31.8117141, -31.8658247, -31.9199349, -31.9740446, -32.0281538, -32.0822625, -32.1363708, -32.1904787, -32.244586, -32.2986929, -32.3527994, -32.4069054, -32.4610109 ;
        :abstract = "These data have not been quality controlled. The ACORN facility is producing NetCDF files with radials data for each station every ten minutes.  Radials represent the surface sea water state component  along the radial direction from the receiver antenna  and are calculated from the shift of an area under  the bragg peaks in a Beam Power Spectrum.  The radial values have been calculated using software provided  by the manufacturer of the instrument. eMII is using a Matlab program to read all the netcdf files with radial data for two different stations  and produce a one hour average product with U and V components of the current. The final product is produced on a regular geographic (latitude longitude) grid More information on the data processing is available through the IMOS MEST  http://imosmest.aodn.org.au/geonetwork/srv/en/main.home. This file is an aggregation over a month of distinct hourly averaged files." ;
        :geospatial_lon_min = 112.3100111, 112.3090067, 112.3080002, 112.3069914, 112.3059805, 112.3049674, 112.3039521, 112.3029346, 112.3019149, 112.3008929, 112.2998688, 112.2988424, 112.2978139, 112.2967831, 112.29575, 112.2947147, 112.2936772, 112.2926374, 112.2915953, 112.290551, 112.2895044, 112.2884556, 112.2874044, 112.286351, 112.2852953, 112.2842373, 112.283177, 112.2821143, 112.2810494, 112.2799821, 112.2789125, 112.2778406, 112.2767663, 112.2756897, 112.2746108, 112.2735294, 112.2724458, 112.2713597, 112.2702713, 112.2691805, 112.2680873, 112.2669917, 112.2658937, 112.2647933, 112.2636905, 112.2625853, 112.2614777, 112.2603676, 112.2592551, 112.2581402, 112.2570228, 112.2559029, 112.2547806, 112.2536558, 112.2525286 ;
        :geospatial_vertical_max = 0. ;
        :geospatial_lon_units = "degrees_east" ;
        :title = "IMOS ACORN Turquoise Coast site (WA) (TURQ), monthly aggregation of one hour averaged current data" ;
        :source = "Terrestrial HF radar" ;
        :citation = "Citation to be used in publications should follow the format:\"IMOS.[year-of-data-download],[Title],[Data access URL],accessed [date-of-access]\"" ;
        :id = "IMOS/ACORN/au/TURQ/2012-10-08T15:00:00Z.sea_state" ;
        :acknowledgement = "Data was sourced from Integrated Marine Observing System (IMOS) - an initiative of the Australian Government being conducted as part of the National Calloborative Research Infrastructure Strategy." ;
        :geospatial_lat_units = "degrees_north" ;
        :institution = "Australian Coastal Ocean Radar Network (ACORN)" ;
        :acknowledgment = "Data was sourced from the Integrated Marine Observing System (IMOS) - IMOS is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy (NCRIS) and the Super Science Initiative (SSI)." ;
        :file_version_quality_control = "Data in this file has not been quality controlled" ;
        :project = "Integrated Marine Observing System (IMOS)" ;
        :author_email = "laurent.besnard@utas.edu.au" ;
        :date_modified = "2012-10-30T16:31:50Z" ;
        :time_coverage_start = "2012-10-01T00:00:00Z" ;
        :institution_references = "http://www.imos.org.au/acorn.html" ;
        :geospatial_lon_max = 115.7758038, 115.7766744, 115.7775469, 115.7784212, 115.7792975, 115.7801756, 115.7810557, 115.7819376, 115.7828215, 115.7837073, 115.784595, 115.7854846, 115.7863762, 115.7872697, 115.7881651, 115.7890625, 115.7899618, 115.7908631, 115.7917663, 115.7926715, 115.7935787, 115.7944878, 115.7953989, 115.796312, 115.7972271, 115.7981441, 115.7990632, 115.7999843, 115.8009074, 115.8018325, 115.8027596, 115.8036887, 115.8046199, 115.8055531, 115.8064883, 115.8074256, 115.8083649, 115.8093063, 115.8102497, 115.8111952, 115.8121427, 115.8130924, 115.8140441, 115.8149979, 115.8159538, 115.8169118, 115.8178719, 115.8188341, 115.8197984, 115.8207648, 115.8217334, 115.822704, 115.8236768, 115.8246518, 115.8256289 ;
        :Conventions = "IMOS version 1.3" ;
        :geospatial_vertical_units = "m" ;
        :principal_investigator = "Wyatt, Lucy" ;
        :ssr_Stations = "SeaBird (SBRD), Cervantes (CRVT)" ;
        :geospatial_lat_max = -29.5252994, -29.5794147, -29.6335296, -29.687644, -29.741758, -29.7958715, -29.8499845, -29.904097, -29.9582091, -30.0123207, -30.0664318, -30.1205425, -30.1746527, -30.2287624, -30.2828717, -30.3369805, -30.3910888, -30.4451966, -30.499304, -30.5534109, -30.6075173, -30.6616232, -30.7157287, -30.7698337, -30.8239382, -30.8780423, -30.9321458, -30.9862489, -31.0403515, -31.0944536, -31.1485553, -31.2026564, -31.2567571, -31.3108573, -31.364957, -31.4190563, -31.473155, -31.5272533, -31.5813511, -31.6354484, -31.6895452, -31.7436416, -31.7977374, -31.8518328, -31.9059277, -31.960022, -32.0141159, -32.0682094, -32.1223023, -32.1763947, -32.2304867, -32.2845781, -32.3386691, -32.3927596, -32.4468495 ;
        :netcdf_version = "3.6" ;
        :naming_authority = "Integrated Marine Observing System (IMOS)" ;
        :data_centre_email = "info@emii.org.au" ;
        :quality_control_set = "1" ;
        :data_centre = "eMarine Information Infrastructure (eMII)" ;
        :comment = "These data have not been quality controlled. They represent values calculated using software provided by CODAR Ocean Sensors. The file has been modified by eMII in order to visualise the data using the ncWMS software.This NetCDF file has been created using the IMOS NetCDF User Manual v1.2. A copy of the document is available at http://imos.org.au/facility_manuals.html" ;
        :featureType = "grid" ;
        string :history = "2017-02-05T18:12:00: /Applications/anaconda/bin/python code/calc_current_speed.py http://thredds.aodn.org.au/thredds/dodsC/IMOS/eMII/demos/ACORN/monthly_gridded_1h-avg-current-map_non-QC/TURQ/2012/IMOS_ACORN_V_20121001T000000Z_TURQ_FV00_monthly-1-hour-avg_END-20121029T180000Z_C-20121030T160000Z.nc.gz UCUR VCUR IMOS_ACORN_SPCUR_20121001T000000Z_TURQ_monthly-1-hour-avg_END-20121029T180000Z.nc (Git hash: bec6077)\n2012-10-09T03:31:35 Convert totl_TURQ_2012_10_08_1500.tuv to netcdf format using CODAR_Convert_File.\n2012-10-09T03:31:36 Write CODAR file totl_TURQ_2012_10_08_1500.tuv. Modification of the NetCDF format by eMII to visualise the data using ncWMS �\n\t\006 %" ;
        :data_center_email = "info@emii.org.au" ;
        :geospatial_vertical_min = 0. ;
        :local_time_zone = 8. ;
        :date_created = "2012-10-30T16:31:50Z" ;
        :file_version = "Level 0 - Raw data" ;
        :data_center = "eMarine Information Infrastructure (eMII)" ;
        :instrument = "CODAR Ocean Sensors/SeaSonde" ;
        :site_code = "TURQ, Turqoise Coast" ;
        :distribution_statement = "Data may be re-used, provided that related metadata explaining the data has been reviewed by the user, and the data is appropriately acknowledged. Data, products and services from IMOS are provided \"as is\" without any warranty as to fitness for a particular purpose." ;
        :author = "Besnard, Laurent" ;
        :time_coverage_duration = "PT1H19M" ;
        :time_coverage_end = "2012-10-29T18:00:00Z" ;
}

There are a lot of global and variable attributes in our output file, because we copied everything from the input file. It might be tempting to edit calc_current_speed.py or use the ncatted command line utility (which, like ncdump, comes with the NCO install) to start cleaning up (i.e. delete) some of these attributes, but we should resist that urge. The id global attribute, for instance, makes little sense to anyone who doesn't work at IMOS. While this might seem like a reasonable argument for deleting that attribute, once an attribute is deleted it's gone forever. The id attribute is taking up a negligible amount of memory, so why not just leave it there just in case? When in doubt, keep metadata.

A provenance plan

Does your data management plan from the first challenge adequately address this issue of data provenance? If not, go ahead and add to your plan now. Things to consider include:

  • How to capture and store metadata relating to output files that aren't self describing (i.e. unlike .nc files, formats like .csv or .png don't store things like global and variable attributes within them)

Discuss the additions you've made to your plan with your partner.