Five phases of data publishing (1) – draft

 
OVERVIEW 1. ORGANIZE 2. CLEAN 3. DESCRIBE 4. UPLOAD 5. CITE

Phase 1: Organize data and metadata

When publishing data in the EDI data repository, you will be producing one or more data packages, i.e. the published unit of data and metadata (information about the data).

  • A data package will be assigned a unique digital object identifier (doi), which is used to permanently identify the data package and link to it on the web.

We recommend archiving entities using standard file formats that are likely to be machine readable in the future. Exceptions to this may exist where the community standard for processing particular data types relies on specialized file formats (binary, closed specification, etc.) or proprietary software. In these cases, it may be appropriate to archive specialized file types and/or a copy that has been parsed into a format (e.g. ascii) that does not require proprietary software.

  • Well organized data in standard file formats simplify potential future use and synthesis.

Guidelines for organizing data into data packages

Coherence

If several data units are closely related they are best packaged together with one metadata file. Many environmental data are arranged in tables. A primary table of observations could be accompanied by a table of sampling sites and characteristics or taxonomic information. Data entities that share high-level metadata such as methods, sites and people can be efficiently grouped together in one data package.

  • Package together if methods and observations are similar and related. 

On the other hand, if for example, in a large sampling campaign, many parameters are measured, methods vary, or groups of data are independent, it is often best to break data into several packages, all accompanied by some of the same metadata and and some unique metadata. This is especially true of ongoing campaigns, because packages composed of discrete units can be managed or updated independently.

Considerations for specific types of data and additional information/material that may belong in data packages

Detailed guidelines that ensure optimal re-usability are available for certain types of data, based on their format, or acquisition method:

  • Code
  • Model-Based Datasets
  • Images and Documents as Data
  • Data in Other Repositories
  • Spatial Data
  • Data Gathered with Small Moving Platforms
  • Large Data Sets

The guidelines include recommendations that are aimed at improving documentation of data acquisition and processing to avoid misinterpretation:

Gries, C., S. Beaulieu, R.F. Brown, S. Elmendorf, H. Garritt, G. Gastil-Buhl, H. Hsieh, L. Kui, M. Martin, G. Maurer, A.T. Nguyen, J.H. Porter, A. Sapp, M. Servilla, and T.L. Whiteaker. 2021. Data Package Design for Special Cases ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/9d4c803578c3fbcb45fc23f13124d052.

Collection status

Is data collection finished or will there be more data added? Continuous instrument data is only one kind of “ongoing” data. Human-observed data (such as ecological or biogeochemical survey data) may also be expecting updates, albeit less regularly.

Data collection is finished

The data package(s) may be being created to go with a published paper or a student’s thesis. These datasets are not expected to have data values added later on and will be archived only once. Later enhancements of the metadata are possible.

Data collection is ongoing

Many data collection projects are ongoing and so data additions are expected. Field stations may have an ongoing meteorological station or a regularly-collected organism survey (e.g., for birds). There are several options for data packaging arrangements for ongoing data:

  • Continuous: The submitter expects to revisit the same data package, adding new data in the future and updating metadata. This is a good option for tabular data that are grouped into a single unit (table) to which the new data will be added. This will create a revision of the original data package, with a new doi. All previous revisions of a data package are accessible and immutable.
  • Non-continuous: A new package is created for each logical unit (e.g., a summer sampling season), regardless of similarities or differences in methods.
  • Hybrid: A new entity is created for each logical unit (e.g., year) but the entity is added to an existing package with shared resource-level metadata. This will create a revision of the original data package, with a new doi. All previous revisions of a data package are accessible and immutable.

Guidelines for how to adjust metadata for ongoing datasets and addressing problems.