MeteoIODoc 20240411.d3bdb3cb

Format

This plugins offers a flexible way to read Comma Separated Values (CSV) files. It is however assumed that:

  • each line contains a data record (or is an empty line)
  • each line contains the same number of fields;
  • a single character is consistently used through the file as field delimiter (to split each record into fields);
  • missing data are represented by an empty value, so two delimiters follow directly each other or by a special value (see NODATA in Metadata extraction);
  • the file may contain a header that may contain additional information (metadata), see below.

In order to reduce the amount of manual configuration, it is possible to extract metadata from the headers or the filename, such as the station name, ID, coordinates, etc

Units

**The final units MUST be coherent derived SI units** (section 2.3.4 in the SI-Brochure). If not, the conversion offsets/factors must be provided to convert the data back to SI (see required keywords below) or the units declared (in the headers) and supported by this plugin.

Keywords

This plugin uses the keywords described below, in the [Input] section. First, there are some general plugin options:

  • COORDSYS: coordinate system (see Coords);
  • COORDPARAM: extra coordinates parameters (see Coords);
  • TIME_ZONE: the timezone that should be used to interpret the dates/times (default: 0);
  • METEOPATH: the directory where the data files are available (mandatory);
  • METEOPATH_RECURSIVE: if set to true, the scanning of METEOPATH is performed recursively (default: false);
  • CSV_FILE_EXTENSION: When scanning the whole directory, look for these files (default: .csv). Note that this matching isn't restricted to the end of the file name so if you had files stat1_jan.csv, stat1_feb.csv and stat2_jan.csv you could select January's data by putting "_jan" here;
  • CSV_SILENT_ERRORS: if set to true, lines that can not be read will be silently ignored (default: false, has priority over CSV_ERRORS_TO_NODATA);
  • CSV_ERRORS_TO_NODATA: if true, unparseable fields (like text fields) are set to nodata, but the rest of the line is kept (default: false).

You can now describe the specific format for all files (prefixing the following keys by "CSV_") or for each particular file (prefixing the following keys by "CSV#_" where "#" represents the station index). Of course, you can mix keys that are defined for all files with some keys only defined for a few specific files (the keys defined for a particular station have priority over the global version).

  • CSV#_DELIMITER: field delimiter to use (default: ','), use SPACE or TAB for whitespaces (in this case, multiple whitespaces directly following each other are considered to be only one whitespace);
  • CSV#_NODATA: a space delimited list of strings (of course, this also contains numbers such as -6999) that should be interpreted as nodata (default: NAN NULL);
  • Headers handling
    • CSV#_NR_HEADERS: how many lines should be treated as headers? (default: 1);
    • CSV#_HEADER_DELIMITER: different field delimiter to use in header lines; optional
    • CSV#_HEADER_REPEAT_MK: a string that is used to signal another copy of the headers mixed with the data in the file (the matching is done anywhere in the line) (default: empty);
    • CSV#_UNITS_HEADER: header line providing the measurements units (the subset of recognized units is small, please inform us if one is missing for you); optional
    • CSV#_UNITS_OFFSET: offset to add to each value in order to convert it to SI (if also providing CSV#_UNITS, would be applied afterwards); optional
    • CSV#_UNITS_MULTIPLIER: factor to multiply each value by, in order to convert it to SI (if also providing CSV#_UNITS, would be applied afterwards); optional
  • Data parsing restrictions
    • CSV#_COMMENTS_MK: a single character to use as comments delimiter, everything after this char until the end of the line will be skipped (default: no comments);
    • CSV#_PURGE_CHARS: space delimited list of ascii characters to purge from the input, either directly given or as decimal representation or as hexadecimal representation (prefixed with 0x). Example: 0x40 13 " ;
    • CSV#_EXCLUDE_LINES: a comma delimited list of line ranges (numbers separated by a dash) or line numbers to exclude from parsing (ie the lines within these ranges will be read and discarded immediately). Example: 18 - 36, 52, 55, 167 - 189. Please note that it is not possible to mix CSV#_EXCLUDE_LINES and CSV#_ONLY_LINES and that additional spaces (for more clarity in the input, as in the provided example) can be used although they are not mandatory.
    • CSV#_ONLY_LINES: a comma delimited list of line ranges (numbers separated by a dash enclosed in spaces) or line numbers to restrict the parsing to (ie the lines outside of these ranges will be read and discarded immediately). Example: 18 - 36, 52, 55, 167 - 189. Please note that it is not possible to mix CSV#_EXCLUDE_LINES and CSV#_ONLY_LINES.
  • Fields parsing
    • CSV#_COLUMNS_HEADERS: header line to interpret as columns headers (default: 1, see also special field names);
    • CSV#_FIELDS: one line providing the columns headers (if they don't exist in the file or to overwrite them). If a field is declared as "ID" then only the lines that have the proper ID for the current station will be kept; if a field is declared as "SKIP" it will be skipped; otherwise date/time parsing fields are supported according to Date/Time parsing below (see also the special field names for more); optional
    • CSV#_FILTER_ID: if the data contains an "ID" column, which ID should be kept (all others will be rejected); default: station ID
    • CSV#_UNITS: one line providing space delimited units for each column (including the timestamp), no units is represented as "-". This is an alternative to relying on a units line in the file itself or relying on units_offset / units_multiplier. Please keep in mind that the choice of recognized units is very limited... (C, degC, cm, in, ft, F, deg, pc, % and a few others). If CSV#UNITS_OFFSET / MULTIPLIER were also provided, CSV#_UNITS would be applied first.
    • CSV#_SKIP_FIELDS: a comma-delimited list of fields to skip (first field is numbered 1, ranges such as 12 - 17 are supported as well). Keep in mind that when using parameters such as UNITS_OFFSET, the skipped field MUST be taken into consideration (since even if a field is skipped, it is still present in the file!); optional
    • CSV#_ONLY_FIELDS: a comma-delimited list of fields to keep, all others will be skipped (so this is the opposite of CSV#_SKIP_FIELDS. Please note that if using both CSV#_SKIP_FIELDS and CSV#_ONLY_FIELDS, the most restrictive interpretation will be applied: only fields that are included in the ONLY list and NOT in the SKIP list will be kept); optional
    • CSV#_NUMBER_FIELDS: prefix every given field name by its column index in the original file (this is useful to "debug" CSV files when the columns' content don't match what was expected. Please note that special fields related to date/time, station ID or SKIP are left unchanged); optional
    • CSV#_FIELDS_POSTFIX: postfix every given field name by the provided string (this is also to "debug" CSV files); optional
  • Date/Time parsing. There are several possibilities: the date/time is provided as one or two strings; as a purely decimal number following a given representation; as each component as a separate column.
    • Date/Time as string(s):
      • CSV#_DATETIME_SPEC: mixed date and time format specification (default is ISO_8601: YYYY-MM-DDTHH24:MI:SS);
      • CSV#_DATE_SPEC: date format specification (default: YYYY_MM_DD);
      • CSV#_TIME_SPEC: time format specification (default: HH24:MI:SS);
    • Date/Time decimal representation:
      • CSV#_DECIMALDATE_TYPE: the numerical representation that is used, one of EXCEL, JULIAN, MJULIAN, MATLAB, RFC868 or UNIX (see decimal date representations);
    • Date/Time as separate components:
      • the fields must be named (either from the headers or through the CSV#_FIELDS key) as YEAR, YEAR_2DIGITS (only the last 2 digits of the year, numbers before 40 will be converted to years after 2000), JDAY (number of days since the beginning of the year), MONTH, DAY, NTIME (numerical representation of time, for example 952 for 09:52), HOURS, MINUTES, SECONDS (if minutes or seconds are missing, they will be assumed to be zero). See special field names for accepted synonyms;
      • if/when no year component is provided, it is possible to define a fallback year with the CSV#_FALLBACK_YEAR key;
      • when using CSV#_FALLBACK_YEAR, it will by default assume that all data for times greater than 1st October that appear before data belonging to times before 1st of October are actually data from the year before. Please set CSV#_FALLBACK_AUTO_WRAP to false if this is not desired.
  • Metadata
    • CSV#_NAME: a descriptive station name to use (if provided, has priority over the special headers);
    • CSV#_ID: the (short) station id to use (if provided, has priority over the special headers);
    • CSV#_SLOPE: the slope angle in degrees at the station (if providing a slope, also provide an azimuth);
    • CSV#_AZIMUTH: the slope azimuth in degrees from North at the station ;
    • CSV#_SPECIAL_HEADERS: description of how to extract more metadata out of the headers; optional
    • CSV#_FILENAME_SPEC: pattern to parse the filename and extract metadata out of it; optional
    • The following two keys provide mandatory data for each station, therefore there is no "global" version and they must be defined:
      • STATION#: input filename (in METEOPATH). As many meteofiles as needed may be specified. If nothing is specified, the METEOPATH directory will be scanned for files with the extension specified in CSV_FILE_EXTENSION;
      • POSITION#: coordinates of the station (default: reading key "POSITION", see Coords() for the syntax). This key can only be omitted if lat/lon/altitude are provided in the file name or in the headers (see CSV#_FILENAME_SPEC and CSV#_SPECIAL_HEADERS);

If no ID has been provided, an automatic station ID will be generated as "ID{n}" where n is the current station's index. Regarding the units handling, it is only performed through either the CSV_UNITS_OFFSET key or the CSV_UNITS_OFFSET / CSV_UNITS_MULTIPLIER keys. These keys expect a value for each column of the file, including the date and time.

Special field names

When reading the field names, either from a file header or as provided in the configuration file with the CSV#_FIELDS key, the fields will be attributed to variables bearing the same names. But some field names will be recognized and automatically interpreted as either known internal parameter names (see this list) or date/time parameters or stationID the data belongs to. Besides MeteoIO's internal parameter names, the following field names are also automatically recognized (synonyms are separated by ',' while different parameters are separated by ';'):

  • TIMESTAMP, TS, DATETIME; DATE; TIME; YEAR; JDAY, JDN, YDAY, DAY_OF_YEAR, DOY; MONTH; DAY; NTIME; HOUR, HOURS; MINUTE, MINUTES; SECOND, SECONDS; ID, STATIONID;
  • TEMPERATURE_AIR, AIRTEMP; SOIL_TEMPERATURE, SOILTEMP; PRECIPITATION, PREC; REFLECTED_RADIATION; INCOMING_RADIATION, INCOMINGSHORTWAVERADIATION; WIND_DIRE CTION, WD; RELATIVE_HUMIDITY, RELATIVEHUMIDITY; WIND_VELOCITY, WS; PRESSURE, STATIONPRESSURE; INCOMING_LONGWAVE, INCOMINGLONGWAVERADIATION; SNOWSURFACETEMPERATURE; WS_MAX;
Note
Since most parameter won't have names that are recognized by MeteoIO, it is advised to map them to MeteoIO's internal names. This is done either by using the CSV_FIELDS key or using the EditingMove feature of the Input Data Editing stage.

Date and time specification

In order to be able to read any date and time format, the format has to be provided in the configuration file. This is provided as a string containing the following special markers:

  • YYYY: the 4 digits year;
  • YY: the 2 digits year (using 40 as cutoff year: date greater than 40 get converted to 1900+year, otherwise to 2000+year);
  • MM: the two digits month;
  • DD: the two digits day;
  • HH24: the two digits hour of the day (0-24);
  • MI: the two digits minutes (0-59);
  • SS: the number of seconds (0-59.98), that can be decimal;
  • TZ: the numerical timezone as offset to GMT (see note below).

Any other character is interpreted as itself, present in the string. It is possible to either provide a combined datetime field (so date and time are combined into one single field) or date and time as two different fields. For example:

  • YYYY-MM-DDTHH24:MI:SS described an ISO 8601 datetime field;
  • MM/DD/YYYY described an anglo-saxon date;
  • DD.MM.YYYY HH24:MI:SS is for a Swiss formatted datetime.
Note
When providing a timezone field, it must appear at the end of the string. it can either be numerical (such as "+1.") or an abbreviation such as "CET" (see https://en.wikipedia.org/wiki/List_of_time_zone_abbreviations).

When this plugin identifies the fields by their column headers, it will look for TIMESTAMP or DATETIME for a combined date and time field, or DATE or TIME for (respectively) a date and time field. Usually, other labels will not be recognized.

Metadata extraction

Since there is no unified way of providing metadata (such as the location, station name, etc) in CSV files, this information has to be either provided in the configuration file (see Configuration keywords) or extracted out of either the file name or the file headers. A specific syntax allows to describe where to find which metadata field type.

Metadata fields types

The following field types are supported:

  • NAME;
  • ID (this will be used as a handle for the station);
  • ALT (for the altitude);
  • LON (for the longitude);
  • LAT (for the latitude);
  • EASTING (as per your input coordinate system);
  • NORTHING (if LON/LAT is not used);
  • SLOPE (in degrees);
  • AZI (for the slope azimuth, in degree as read from a compass);
  • NODATA (string to interpret as nodata);
  • PARAM (the extracted metadata will replace the PARAM field either as found in the file's headers or in the CSV_FIELDS user configuration key);
  • SKIP or - (skip this field).

If ID or NAME appear more than once in one specification string, their multiple values will be appended.

Header metadata extraction

This is performed with the "CSV#_SPECIAL_HEADERS" configuration key. This key is followed by as many metadata specifications as necessary, of the form {field}:{line}:{column}.

Therefore, if the station name is available on line 1, column 3 and the station id on line 2, column 5, the configuration would be:

CSV_SPECIAL_HEADERS = name:1:3 id:2:5

Filename metadata extraction

This is performed with the "CSV#_FILENAME_SPEC" configuration key. This key is followed by the metadata specification that will be applied to identify the information to extract as well as substrings that are used as "markers" delimiting the different fields (enclosed within {}).

For example, to parse the filename "H0118_Generoso-Calmasino_-_Precipitation.csv" use (please note that the extension is NOT provided):

CSV_FILENAME_SPEC = {ID}_{NAME}-{SKIP}_-_{PARAM}

If the CSV_FIELDS key is also present, it will have priority. Therefore, it is possible to define one CSV_FILENAME_SPEC for several files and only define CSV#_FIELDS for the files that would require a different handling (for example because their parameter would not be recognized). Moreover, it is possible to set "AUTOMERGE" to "true" in the [InputEditing] section, so all files leading to the same station ID will be merged together into one single station.

Note
Obviously, the {PARAM} metadata field type can only be used for files that contain the time information (either as datetime or separate date and time) and one meteorological parameter. If there are multiple (potentially unimportant) parameters in your file you have to set CSV_SINGLE_PARAM_INDEX to the column number matching your parameter.

Examples

This section contains some exemplary CSV files together with the INI configuration to read them.

In order to read a bulletin file downloaded from IDAWEB, you need the following configuration:

CSV

"COLUMN TO SKIP" "TIMESTAMP" "COL 1" "COL 2" "COL 3" "COL TO SKIP" "COL TO SKIP" "COL 4" "COL TO SKIP" "COL TO SKIP" "COL 5"
000 2023082201 1 2 3 000 000 4 000 000 5
000 2023082202 1 2 3 000 000 4 000 000 5
000 2023082203 1 2 3 000 000 4 000 000 5

INI

METEO = CSV
METEOPATH = ./input/meteo
CSV_DELIMITER = SPACE
CSV_NR_HEADERS = 1
CSV_COLUMNS_HEADERS = 1
CSV_DATETIME_SPEC = YYYYMMDDHH24
CSV_NODATA = -
STATION1 = IDA_station1.csv
POSITION1 = latlon (46.80284, 9.77726, 2418)
CSV1_NAME = TEST
CSV1_ID = myID
CSV1_FIELDS = SKIP TIMESTAMP HS RSWR TA SKIP SKIP RH SKIP SKIP ILWR
CSV1_UNITS_OFFSET = 0 0 0 0 273.15 0 0 0 0 0 0
CSV1_UNITS_MULTIPLIER = 1 1 0.01 1 1 1 1 0.01 1 1 1

In order to read a CSV file produced by a Campbell data logger with Swiss-formatted timestamps, you need the following configuration:

CSV

An arbitrary header line
"Some","Information","About","the","data","arbitrary","no of columns","not used for parsing"
"TIMESTAMP","COLUMN_NAME 1","COLUMN_NAME 2","COLUMN_NAME 3","COLUMN_NAME 4","COLUMN_NAME 5","COLUMN_NAME 6"
"TS","unit 1","unit 2","unit 3","unit 4","unit 5","unit 6"
04.05.2023 13:00:00,0,300,600,900,1200,1500
04.05.2023 14:00:00,0,300,600,900,1200,1500
04.05.2023 15:00:00,0,300,600,900,1200,1500
04.05.2023 16:00:00,0,300,600,900,1200,1500

INI

METEO = CSV
METEOPATH = ./input/meteo
CSV_NR_HEADERS = 4
CSV_COLUMNS_HEADERS = 2
CSV_UNITS_HEADERS = 3
CSV_DATETIME_SPEC = DD.MM.YYYY HH24:MI:SS
CSV_SPECIAL_HEADERS = name:1:2 id:1:4
STATION1 = DisMa_DisEx.csv
POSITION1 = latlon 46.810325 9.806657 2060
CSV1_ID = DIS4

For a logger produced csv file with repeating headers and quoted timestamps, you need the following configuration:

CSV

# a information header line that is skipped
"Some","Information","About","the","data","arbitrary","no of columns","not used for parsing"
"TIMESTAMP","COLUMN_NAME 1","COLUMN_NAME 2","COLUMN_NAME 3","COLUMN_NAME 4","COLUMN_NAME 5","COLUMN_NAME 6"
"TS","unit 1","unit 2","unit 3","unit 4","unit 5","unit 6"
"2023-01-11 13:30:00",0,300,600,900,1200,1500
"2023-01-11 13:40:00",0,300,600,900,1200,1500
"2023-01-11 13:50:00",0,300,600,900,1200,1500
"2023-01-11 14:00:00",0,300,600,900,1200,1500
# a information header line that is skipped
"Some","Information","About","the","data","arbitrary","no of columns","not used for parsing"
"TIMESTAMP","COLUMN_NAME 1","COLUMN_NAME 2","COLUMN_NAME 3","COLUMN_NAME 4","COLUMN_NAME 5","COLUMN_NAME 6"
"TS","unit 1","unit 2","unit 3","unit 4","unit 5","unit 6"
"2023-01-11 15:40:00",0,300,600,900,1200,1500
"2023-01-11 15:50:00",0,300,600,900,1200,1500
"2023-01-11 16:00:00",0,300,600,900,1200,1500

INI

METEO = CSV
METEOPATH = /path/to/input
CSV_DELIMITER = ,
CSV_NR_HEADERS = 3
CSV_HEADER_REPEAT_MK = #
CSV_UNITS_SOURCE = FROM HEADERS
CSV_UNITS_HEADERS = 3
CSV_COLUMNS_HEADERS = 2
CSV_TIMESTAMP = COMBINED
CSV_DATETIME_SPEC = "YYYY-MM-DD HH24:MI:SS"
POSITION = xy(198754, 723458,2200)

In order to read a set of files each containing only one parameter and merge them together (see input data editing for more on the merge feature), extracting the station ID, name and meteorological parameter from the filename:

[Input]
METEO = CSV
METEOPATH = ./input/meteo
CSV_DELIMITER = ;
CSV_HEADER_LINES = 1
CSV_DATE_SPEC = DD/MM/YYYY
CSV_TIME_SPEC = HH24:MI
POSITION = latlon (46.8, 9.80, 1700)
CSV_FILENAME_SPEC = {ID}_{NAME}_-_{SKIP}-{PARAM}
CSV_COLUMNS_HEADERS = 1
STATION1 = H0118_Generoso_-_Calmasino_precipitation.csv
STATION2 = H0118_Generoso_-_Calmasino_temperature.csv #the parameter name is ambiguous, it will not be recognized
CSV2_FIELDS = DATE TIME TA #so we define the parameter manually
CSV2_UNITS_OFFSET = 0 0 273.15
STATION3 = H0118_Generoso_-_Calmasino_reflected_solar_radiation.csv
STATION4 = H0118_Generoso_-_Calmasino_relative_humidity.csv
STATION5 = H0118_Generoso_-_Calmasino_wind_velocity.csv
[InputEditing]
AUTOMERGE = true
@ TIME
Definition: libncpp.h:33

Please note that here the file's headers will look like 'DATE;TIME;PARAM' which will allow PARAM to be replaced by the PARAM special value extracted from the file name.

In order to read a set of files and merge them together (see input data editing for more on the merge feature):

[Input]
METEO = CSV
METEOPATH = ./input/meteo
CSV_DELIMITER = ;
CSV_HEADER_LINES = 1
CSV_DATE_SPEC = DD/MM/YYYY
CSV_TIME_SPEC = HH24:MI
POSITION = latlon (46.8, 9.80, 1700)
CSV_NAME = Generoso
CSV1_FIELDS = DATE TIME PSUM HS
STATION1 = H0118_lg23456.csv
CSV2_FIELDS = DATE TIME TA
STATION2 = H0118_lg7850.csv
CSV2_UNITS_OFFSET = 0 0 273.15
CSV3_FIELDS = DATE TIME RSWR ISWR
STATION3 = H0118_lg64520.csv
CSV4_FIELDS = DATE TIME RH
STATION4 = H0118_lg45302.csv
CSV4_UNITS_MULTIPLIER = 1 1 0.01
CSV5_FIELDS = DATE TIME VW
STATION5 = H0118_wind_velocity.csv
[InputEditing]
ID1::MERGE = ID2 ID3 ID4 ID5

When reading a file containing the data from multiple stations, each line containing the station ID it applies to, it is necessary to provide the requested station ID for each new station as well as declare which is the ID field (if the headers declare an "ID" column or a "STATIONID" column, this will also work)

METEO = CSV
METEOPATH = ./input
CSV_DELIMITER = ,
CSV_NR_HEADERS = 1
CSV_COLUMNS_HEADERS = 1
CSV_DATETIME_SPEC = YYYY-MM-DD HH24:MI:SS
CSV_FIELDS = ID TIMESTAMP SKIP TA RH DW VW SKIP HS SKIP SKIP P SKIP SKIP SKIP ISWR SKIP SKIP SKIP TSG SKIP SKIP ISWR ILWR TSS
CSV_UNITS_OFFSET = 0 0 0 273.15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 273.15
CSV_UNITS_MULTIPLIER = 1 1 1 1 0.01 1 1 1 0.01 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
STATION1 = Extracted_data.csv
POSITION1 = latlon (46.8, 9.81, 1511.826)
CSV1_ID = 109
CSV1_NAME = Station109
STATION2 = Extracted_data.csv
POSITION2 = xy (45.8018, 9.82, 111.826)
CSV2_ID = 105
CSV2_NAME = Station105

Debugging CSV files reading

Unfortunately, there are some cases when the data in one or multiple files does not match the expected field content. This can happen with the data being spread over multiple files and the column order changing between files or even with the column order changing within a given file. Depending on the amount of data and the number of files, this can be quite cumbersome to correct, if even possible at all. Here is some rough procedure to help correcting such issues:

  1. Split the data at the time of changes. If the data is provided for example in yearly files and such changes appear between files, created one station ID per file (such as {base_id}_{year}). If such changes appear within one file, split the file by reading it multiple times (with multiple STATION# statements) with different lines exclusions commands and attribute a different station ID for each.
  2. Name each column with its column index, using the CSV#_NUMBER_FIELDS key. You can combine it with CSV#_FIELDS_POSTFIX in order to also have an indication of the origin of the data (such as the year or the line ranges).
  3. Plot all the fields for each station ID (like a contact print or as individual plots) with the field name clearly visible. It is often not that hard to distinguish between different meteorological parameters (such as between snow height and wind speed), so it will be possible to quickly spot obvious mis-labeled fields.
  4. Load successive data periods together on the same plot, select one parameter from one period and cycle through the parameters of the other period until you find that there is a smooth transition between the two periods. The column index in the field names (as advised to do in (2)) allow to know which field is at which index for which period.