Thematic Standardization

Important links

EDI – and other repositories – hold thousands of diverse primary observations from  research studies in the ecological sciences. A major EDI goal is to develop recommendations for thematic standardization, i.e. convert “raw data” packages that contain the same variables, but in various data formats and using different vocabularies, into a common data model. To derive a common data model, EDI is considering physical formats, data models, metadata content and data manipulation scripts which provide maximum data flexibility. The so harmonized data packages can be reused more easily in synthesis studies or meta analyses.

We know from experience that original, primary research data sets cannot be easily combined or synthesized until all data are completely understood and converted to a similar format. There are two approaches to achieving this data regularity: a) to prescribe the format before data collection starts, or b) to convert primary data into a flexible standard format for reuse. Prescribed formats are impossible to impose on research studies, so we take the second approach: define a flexible intermediate, and convert primary data to that model. Our thematic standardization workshops define those formats, describe their workflows, and publish the models so that data providers can be made aware of the most important data package features.

Figure 1 shows the general workflow for harmonizing data packages. Archived raw data (level 0 – L0) are converted to a common harmonized data model (level 1 – L1). The L1 data allow for a straightforward data discovery and conversion into derived data products (level 2 – L2).

Our approach is thematic. Initially, we have addressed long-term studies of community composition and biodiversity, which are currently in high demand by synthesis projects. EDI is in the process of extending this approach to meteorological and hydrological data.

harmonization_procedure_general
Figure 1: General workflow for data package harmonization from raw data in various formats (Level 0) into a common data model (Level 1) and use of harmonized data packages in derived Level 2 data products.