Five phases of data publishing (2)


Phase 2: Format and QC data tables



Although we accept most file formats, we recommend that you use tabular data (comma or tab delimited ASCII text) and geospatial data types.  For multi-year observations, we strongly encourage you to compile your tabular data into a single file delimited by commas or tab spaces.  If you are having trouble with this, we will be here to help you out. Your geospatial data files should be compressed into a single or multiple .zip directories.

Below are basic rules for scientists preparing data for archive: Use consistent data organization.  You may be planning to submit your data manually to us in several tables (e.g., organized by year). If so, each table must have the same structure; that is, the attributes must have the same order and identical names in all the tables so we can write code to process your data. Run quality control checks on your data to ensure they are ready for publication. Keep track of these steps so others know what has been done to these data.

  1. “Consistent formatting” also means:
    • Columns with the same name across tables have the same
      • Unit
      • precision
      • date format
        • if one table has the ‘date’ column formatted as YYYY-MM-DD, then all should have this format
        • EDI recommends dates be in ISO 8601 format (as above), although other formats are allowed.
    • type (numeric or character)
      • e.g., do not enter a range of values in a column that contains numeric values (e.g., “< .02” for nitrate concentration). Your data will be entered into a database, and databases reject tables with columns having mixed data types. See below under “missing values”.
  2. Be careful of character formatting (e.g. superscript) or symbols (e.g. degree, accent marks, smart quotes) within the data table. Even in fields typed as “character” these may produce unintelligible characters during conversion, or if emailed.
  3. Specify (in the metadata) the code you use for missing values in your tables. We recommend that missing fields (values) in data are NOT left blank.Software interprets fields with a missing value code before ingesting the data table. Multiple missing values are allowed in one column. You will need to specify a definition for each missing value code, e.g.,
    • “NA” = not collected
    • “trace” = trace amount (e.g., instead of  “< .02” for a nitrate value)
    • “-99999” = not available (some researchers prefer to keep their missing values of the same type as the data)


  • See here for a video on “Creating clean data for archiving”.
  • See here for “Ecology Workshop” lessons by the Carpentries.
  • See here programming with R and Python lessons by the Carpentries.
  • See here for a video on “How to clean and format data using R, OpenRefine, Excel”. Presentation slides are available on GitHub here.
  • Instructions on data cleaning exercise can be found here.