Data Package Design for Community Survey Data: ecocomDP

Important links:

We started our work on the data package design for community survey data at a workshop in Albuquerque, NM (6-8 June 2017).

The goals of the workshop were to examine the needs of synthesis scientists using community survey data, and to determine the features of a flexible intermediate format for these data. When input data are in a similar format, aggregations and reuse can be greatly accelerated. We began by examining data commonly used in population and community surveys in this workshop.

Work accomplished

We have a full description of our candidate model for the flexible intermediate for ecological community data. Our short name for the model is “ecocomDP”, for “ecological community data design pattern”. The workflow developed is used for a general description of the process EDI is developing, described at our Data package design page. A GitHub repository has been set up to contain the ecocomDP R-code and other material (‘man’, ‘vignettes’, etc) and HTML documentation of the ecocomDP model itself: https://github.com/EDIorg/ecocomDP. Below is a graphic showing all eight data objects in the model and their relationships. Optional tables are greyed out. Four are required; “observation”, “sampling_location”, “taxon” must be populated from primary data. The “dataset_summary” is created and populated from the “observation” table by code. The three primary tables are each linked to an optional table for ancillary information.

The nesting of sampling locations (e.g., transects within areas) is accomplished by using a self-referencing table, in which a location may have a ‘parent’ which is itself a sampling location in the same table. This mechanism allows observations to be associated with a location at any level, and observations can be aggregated under groups of locations.

An optional “variable_mapping” table holds variable names (string content) with their mappings to URIs in external vocabularies. URIs are essential for retrieving a full description of a variable. We have included a column for the name mapped_system itself (although this may be redundant for URIs) to maintain a consistent table pattern as compared to the taxon table (because taxon-URIs are rare and the name of the mapped system is essential).

ecocomDP
Data objects in the ecocomDP model and their relationships.

 
The general description of the workflow for examples of dataset “Levels” we are using for these workflows can be found at our Data package design overview page. In short, Level 0 (L0) is incoming, or raw data. Level 1(L1) is the same data transformed to the ecocomDP model. Level 2 (L2) is data that has been further transformed or aggregated as needed by synthesis working groups. With a common, flexible intermediate (L1), the transforms for L2 can be greatly streamlined, and even facilitated by EDI.


L0 to L1 Use cases

  1. Data destined for Darwin Core Archives (as L2)
  2. Data ised bu two LTER working groups, “Metacommunities” and “Synchrony”
  3. NEON macroinvertebrate data

L0 >L0 to L1 transformation tools

  1. Cleaning and quality control functions for data objects (in R)
  2. Taxon functions with R-taxize, both for metadata and data values
  3. Templated EML metadata for archiving
  4. We are considering L0-to-L1 translator functions using a template.
    • Template that holds “mappings” between the L0 data and metadata to L1 tables and columns
    • Standardized metadata descriptions for the L1 data objects in ecocomDP
    • R-code to ingest and build L1

Access to NEON data
NEON has created an export view of their macroinvertebrate data in the ecocomDP format (also available in our Git repository).

L0 to L1 Processing queue
EDI has assembled a list of datasets to be candidates for conversion. Work began during the June workshop, and continues.
https://github.com/EDIorg/ecocomDP/tree/master/documentation/processing_queue


L1 discovery and further use
EDI has created code to discover, query, and assemble ecocomDP tables into an aggregate, with R. Clone the git repository at the link above.


Work in progress
Many important L0 data are ongoing research-grade time-series. To track their updates, EDI is creating “event subscriptions” for the repository, and will rerun formatting code to update product L1 packages. This is a potentially transformative activity – access to trusted, up-to-date, research-grade data sources is highly desired by synthesis scientists, policy and decision makers, and yet seldom realized. The ability to rerun automated workflows will advance this need significantly.


Shared findings
With LTER sites and other L0 producers: We gather feedback as we develop tools for this data model. As the tools mature, the dialogue will expand LTER sites, and provide feedback to L0 data creators on the essential data components.

With the LTER NCO: The LTER Network Communications Office (NCO) manages several synthesis working groups, and has interns to assist scientists with data cleansing for their synthesis needs. Their staff are aware of this project, so we will engage them through their scientific programming staff.

Webinar/Video discussion: EDI hosts a series of video teleconferences (VTCs), and we use that venue for updates and feedback on the ecocomDP model and tools. See the EDI Events area.