The quality of a data package (data and metadata) is a reflection of how well it may be used for a specific purpose. The EDI Repository ingests EML-described datasets for presentation, syntheses and analyses. To be useful, a data package must include a minimum amount of supporting information (metadata) that details how to access the data, and adequately describes the data. For metadata to be effective, there must be strong and accurate agreement (i.e., congruence) between the metadata and the data they describe.
The EML Congruence Checker (ECC) was developed originally by the LTER Network Information Managers Committee (IMC) and adopted by EDI along with other software components (e.g., PASTA). The original working group was formed at the LTER All Scientists Meeting in 2009 at Estes Park
- Committee Members: LTER: Margaret O’Brien (chair), Corinna Gries, Emery Boose, Dan Bahauddin, Gastil Gastil-Buhl, Jason Downing, Sven Bohm, James Brunt, Mark Servilla (software designer/developer), Duane Costa (principal developer).
- Associates: Matt Jones, Mark Shildhauer, Ben Leinfleder, Jing Tao (NCEAS Ecoinformatics programming group).
What the ECC does
By running a series of checks that examine the congruence of metadata and data, the ECC reports on details of submitted data packages to help ensure that they meet a high standard for quality. As of 2018, 40 unique checks have been implemented. The report generated by the checks (example report) becomes part of the data package in the repository. The checks vary in their action where some checks must be passed for a data set to be accepted for inclusion in the repository whereas others merely provide helpful information. Checks that must be passed for a data set to be accepted were carefully chosen to reflect only conditions that will make a dataset unusable under most circumstances.
Overview of checks
In addition to basic pass/fail criteria, each check’s definition includes categorization according to several features: scope, justification, response behavior, packaging aspect, and priority. Although some typologies simply facilitate organization or communication (e.g., justification, priority), having a specific, granular definition for each check meant that code was concentrated on the most salient features. Checks that would prevent insertion were considered and justified with special care. The high number of checks recorded to date reflects the complexity of datasets submitted, and the granularity allowed by EML metadata.
How to work with checks
Check results are stored in an XML document, which can be transformed for a variety of purposes, e.g., an individual report can be transformed into HTML for web presentation during evaluation of a single data package, or results from a group of reports can be aggregated.
Modes of check execution
- Evaluate: An EML document can be evaluated by the ECC Quality Engine without adding it to the repository. Typically, software for evaluating XML stops at the first error, and repeated submissions are required until all errors have been exposed. In the ECC, however, as much as possible, all errors are exposed in one run (as opposed to the first error encountered stopping subsequent checks) so that the report provided to the data set submitter shows most (or even all) of potential problems. This feature can save considerable time. Of course, there are some errors that will prevent subsequent checks from running. For example, a data entity cannot be evaluated if a URL or other means of accessing the data is not provided. Reports from Evaluate mode are stored for 180 days.
- Upload: When an EML document is uploaded for the purpose of being added to the repository, the ECC Quality Engine halts on the first error, saving processor time. Upon successful upload, the quality report document is stored permanently as part of the data package (associated via the resource map), and can be accessed and displayed alongside its metadata and data.
Process for adding checks
Checks are added periodically with new checks (if any are staged for deployment) released at 6 month intervals, typically May and November. An EDI news item will accompany the implementation of any new checks.
- Please visit the ECC on GitHub for access to meeting notes, and outlines and assessments for potential new checks.
- An active committee exists, and new members are welcome.
- If you have a suggestion for a check, please enter it as an issue in the GitHub repository, which will allow for public discussion and links to tasks and outcomes.