Dataset Design

A major EDI goal is to build recommendations for data set design. We – and other repositories – hold thousands of diverse primary observations from research studies which often are not easily reused in synthesis or meta analyses. EDI addresses this need for datasets associated with ecosystem science by considering physical formats, data models, metadata content and data manipulation scripts which provide maximum data flexibility.

We know from experience that original, primary research data sets cannot be easily combined or synthesised until all data are completely understood and converted to a similar format. There are two approaches to achieving this data regularity: a) to prescribe the format before data collection starts, or b) to convert primary data into a flexible standard format for reuse. Prescribed formats are impossible to impose on research studies, so we take the second approach: define a flexible intermediate, and convert primary data to that model. Our thematic standardization workshops define those formats, describe their workflows, and publish the models so that data providers can be made aware of the most important dataset features.

Our approach is thematic. Initially, we have addressed long-term studies of community composition and biodiversity, which are currently in high demand by synthesis projects. EDI will extend this process to other thematic areas.


Figure: Abstract view of dataset levels. A flexible intermediate (L1, middle) lies between datasets of primary observations (L0, left) and the aggregated views used by synthesis projects.

Workshops in this series:

Dataset design for community survey data (6-8 June 2017, UNM Albuquerque).