EDI will use a regular blog to keep you informed of progress on the work we started at the Albuquerque workshop (6-8 June 2017), titled “Dataset design for community survey data”. This is the first installment of that blog, and is a summary of activities since then.
The goals of the workshop were to examine the needs of synthesis scientists using community survey data, and to determine the features of a flexible intermediate for these data. When input data are in a similar format, aggregations and reuse can be greatly accelerated. It is EDI’s goal to facilitated the use of common, flexible intermediate models. We began by examining data commonly used in population and community surveys in this workshop.
We have a full description of our candidate model for the flexible intermediate for ecological community data. Our short name for the model is “ecocomDP”, for “ecological community data design pattern”. The workflow developed at our June workshop was used for a general description of the process EDI is developing, described here: Dataset design page.
A GitHub repository has been set up to contain the ecocomDP R-code and other material (‘man’, ‘vignettes’, etc) and HTML documentation of the ecocomDP model itself: https://github.com/EDIorg/ecocomDP. Below is a graphic showing all seven data objects in the model and their relationships. Of the seven, three are required (“observation”, “sampling_location”, “taxon”). The “dataset_summary” is populated from the “observation” table by code. The sampling_location and taxon tables are each linked to an optional table for ancillary information.
The nesting of sampling locations (e.g., transects within areas) is accomplished by using a self-referencing table, in which a location may have a ‘parent’ which is itself a sampling location in the same table. This mechanism allows observations to be associated with a location at any level, and observations can be aggregated under groups of locations.
Work in progress
The general description of the workflow for examples of dataset “Levels” we are using for these workflows is here: Dataset design page. In short, Level 0 (L0) is incoming, or raw data. Level 1(L1) is the same data transformed to the ecocomDP model. Level 2 (L2) is data that has been further transformed or aggregated as needed by synthesis working groups. With a common, flexible intermediate (L1), the transforms for L2 can be greatly streamlined, and even facilitated by EDI.
L0 > L1 transformations:
We are currently working on L0-to-L1 translator functions using a template. There are two goals:
- Create ecocomDP data objects in R (up to 7)
- Build EML metadata so that the ecocomDP data objects can be archived.
- Template that holds “mappings” between the L0 data and metadata to L1 tables and columns
- Standardized metadata descriptions for the L1 data objects in ecocomDP
- R-code to ingest and build L1
- Taxon functions with R-taxize, both for metadata and data values
Use case datasets:
EDI is working with the datasets below to develop the templates and code. Work with datasets A, B & C began during the June workshop. D was received soon after (and is being prepared as a new submission to EDI), and E is a compilation of community survey data, some of which is already clean and in use by the two LTER working groups, “Metacommunities” and “Synchrony. It was added to this list because is provides a good test of nested sampling sites and data subsampling.
- A. Point-count bird censusing: Long-term monitoring of bird abundance and diversity in central Arizona-Phoenix, ongoing since 2000 (knb-lter-cap.46.13)
- B. CAP LTER: Long-term monitoring of herpetofauna along the Salt River in and near the greater Phoenix metropolitan area, ongoing since 2012 (knb-lter-cap.627.2)
- C. Ant Assemblages in Hemlock Removal Experiment at Harvard Forest since 2003 (knb-lter-hrf.118.27)
- D. McMurdo diatoms (new submission to EDI, and currently an RDB export)
- E. Santa Barbara Channel Marine BON: Integrated fish (edi.5.2)
L1 to L2 transformations: EDI will create code to assemble ecocomDP tables into an aggregate for querying, subsampling or other synthetic use. Specific components are still being worked out.
Share findings with LTER sites: We will gather feedback from a few LTER site data managers as we develop tools for this data model. A general process is to choose a site with data we know to be of interest to these working groups (e.g. Cedar Creek). EDI will have converted a few of their datasets and ask about other related data, assess interest in being involved, etc. As the tools mature, the dialogue can be opened more broadly, eg, other (and beyond) LTER sites, and to provide feedback to L0-data creators on the essential data components.
Engage the NCO in this process: The LTER Network Communications Office (NCO) manages several synthesis working groups, and has interns to assist scientists with data cleansing for their synthesis needs. Their staff are aware of this project (but were unable to participate in the workshop), so we will engage them through their scientific programming staff this summer.
Webinar/Video discussion: EDI hosts a series of video teleconferences (VTCs), and we will use that venue to introduce and gather feedback on the ecocomDP model and tools. The first VTC will take place in late summer, and will be announced through the usual EDI mechanism.
Meet with working groups at ESA: Both the Metacommunities and Synchrony working groups are meeting at ESA. EDI will be involved in those meetings, to provide help on using the tools and mapping template.