|OVERVIEW||1. ORGANIZE||2. CLEAN||3. DESCRIBE||4. UPLOAD||5. CITE|
Phase 1: Organize data and metadata
When publishing data in the EDI data repository, you will be producing one or more data packages, i.e. the published unit of data and metadata (information about the data).
- A data package will be assigned a unique digital object identifier (doi), which is used to permanently identify the data package and link to it on the web.
We recommend archiving entities using standard file formats that are likely to be machine readable in the future. Exceptions to this may exist where the community standard for processing particular data types relies on specialized file formats (binary, closed specification, etc.) or proprietary software. In these cases, it may be appropriate to archive specialized file types and/or a copy that has been parsed into a format (e.g. ascii) that does not require proprietary software.
- Well organized data in standard file formats simplify potential future use and synthesis.
Guidelines for organizing data into data packages
If several data units are closely related they are best packaged together with one metadata file. Many environmental data are arranged in tables. A primary table of observations could be accompanied by a table of sampling sites and characteristics or taxonomic information. Data entities that share high-level metadata such as methods, sites and people can be efficiently grouped together in one data package.
- Package together if methods and observations are similar and related.
On the other hand, if for example, in a large sampling campaign, many parameters are measured, methods vary, or groups of data are independent, it is often best to break data into several packages, all accompanied by some of the same metadata and some unique metadata. This is especially true of ongoing campaigns, because packages composed of discrete units can be managed or updated independently.
Considerations for specific types of data and additional information/material that may belong in data packages
Detailed guidelines that are aimed at improving documentation of data acquisition and processing, are available for certain types of data, based on their format, or acquisition method:
- Model-Based Datasets
- Images and Documents as Data
- Data in Other Repositories
- Spatial Data
- Data Gathered with Small Moving Platforms
- Large Data Sets
The guidelines help to avoid misinterpretation of the data and ensure optimal re-usability. The guidelines are published in the EDI data repository and can be downloaded as pdf file:
Gries, C., S. Beaulieu, R.F. Brown, S. Elmendorf, H. Garritt, G. Gastil-Buhl, H. Hsieh, L. Kui, M. Martin, G. Maurer, A.T. Nguyen, J.H. Porter, A. Sapp, M. Servilla, and T.L. Whiteaker. 2021. Data Package Design for Special Cases ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/9d4c803578c3fbcb45fc23f13124d052.
Is data collection finished or will there be more data added? Continuous instrument data is only one kind of “ongoing” data. Human-observed data (such as ecological or biogeochemical survey data) may also be expecting updates, albeit less regularly.
Data collection is finished
The data package(s) may be being created to go with a published paper or a student’s thesis. These datasets are not expected to have data values added later on and will be archived only once. Later enhancements of the metadata are possible.
Data collection is ongoing
Many data collection projects are ongoing and so data additions are expected. Field stations may have an ongoing meteorological station or a regularly-collected organism survey (e.g., for birds).
Options for data packaging arrangements for ongoing data:
Continuous: The submitter expects to revisit the same data package, adding new data in the future and updating metadata. This is a good option for tabular data that are grouped into a single unit (table) to which the new data will be added. This will create a revision of the original data package, with a new doi. All previous revisions of a data package are accessible and immutable.
- Disadvantage: More work for the creators if there are changes, as data are ‘pre-integrated’ by them.
- Examples: knb-lter-mcr.7, knb-lter-bnz.212
Non-continuous: A new package is created for each logical unit (e.g., a summer sampling season), regardless of similarities or differences in methods.
- Disadvantage: User must find, download and integrate many data packages to create a time series.
- Example: PISCO instrument data (see DataONE.org)
Hybrid: A new entity is created for each logical unit (e.g., year) but the entity is added to an existing package with shared resource-level metadata. This will create a revision of the original data package, with a new doi. All previous revisions of a data package are accessible and immutable.