Five phases of data publishing (2)


Phase 2: Format and QC data tables



Although we accept most file formats, we recommend that you use tabular data (comma or tab delimited ASCII text) and geospatial data types.  For multi-year observations, we strongly encourage you to compile your tabular data into a single file delimited by commas or tab spaces.  If you are having trouble with this, we will be here to help you out. Your geospatial data files should be compressed into a single or multiple .zip directories.

Below are basic rules for scientists preparing data for archive: Use consistent data organization.  You may be planning to submit your data manually to us in several tables (e.g., organized by year). If so, each table must have the same structure; that is, the attributes must have the same order and identical names in all the tables so we can write code to process your data. Run quality control checks on your data to ensure they are ready for publication. Keep track of these steps so others know what has been done to these data.

  1. “Consistent formatting” also means:
  2. Be careful of character formatting (e.g. superscript) or symbols (e.g. degree, accent marks, smart quotes) within the data table. Even in fields typed as “character” these may produce unintelligible characters during conversion, or if emailed.
  3. Specify (in the metadata) the code you use for missing values in your tables. We recommend that missing fields (values) in data are NOT left blank.Software interprets fields with a missing value code before ingesting the data table. Multiple missing values are allowed in one column. You will need to specify a definition for each missing value code, e.g.,
    • “NA” = not collected
    • “trace” = trace amount (e.g., instead of  “< .02” for a nitrate value)
    • “-99999” = not available (some researchers prefer to keep their missing values of the same type as the data)



  • See here for a video on “Creating clean data for archiving”.
  • Publications on preparing data tables for archiving:
    • Cook et al. (2001) Best Practices for Preparing Ecological Data Sets to Share and Archive. Bulletin of the Ecological Society of America. Vol. 82, No. 2 (Apr., 2001), pp. 138-141.
    • Campbell, J.L., Rustad, L.E., Porter, J.H., Taylor, J.R., Dereszynski, E.W., Shanley, J.B., Gries, C., Henshaw, D.L., Martin, M.E., Sheldon, W.M. and Boose, E.R., 2013. Quantity is nothing without quality: Automated QA/QC for streaming environmental sensor data. BioScience, 63(7), pp.574-585.
    • Karl W. Broman & Kara H. Woo (2018) Data Organization in Spreadsheets. The American Statistician, 72:1, 2-10, DOI: 10.1080/00031305.2017.1375989.
  • See here for “Ecology Workshop” lessons by the Carpentries.
  • See here programming with R and Python lessons by the Carpentries.
  • See here for a video on “How to clean and format data using R, OpenRefine, Excel”. Presentation slides are available on GitHub here.
  • Instructions on data cleaning exercise can be found here.