The Environmental Data Initiative is currently funded through grants from the National Science Foundation’s (NSF) Division of Biological Infrastructure to the University of Wisconsin-Madison (Award #1931174) and the University of New Mexico (Award #1931143): “Collaborative Research: Environmental Data Initiative: Sustaining the Legacy of Scientific Data”. Previous NSF funding was awarded through grants #1565103 and #1629233.
- Brief History and Infrastructure Overview of EDI
- EDI Data Infrastructure
- Physical Management and Curation of Digital Products
Brief History and Infrastructure Overview of EDI
The Environmental Data Initiative (EDI) began in the summer of 2016 as a collaboration between two US National Science Foundation (NSF) grants, one awarded to the University of Wisconsin (UW) named NIMO and the other to the University of New Mexico (UNM) for PASTA+ (together, they are known as EDI). Both groups originate from the Long Term Ecological Research (LTER) Network and consist of highly motivated and experienced data practitioners, software developers, and research scientists. In addition to the LTER Network, EDI now supports a broad community of environmental and ecological scientists funded through the Long Term Research in Environmental Biology (LTREB), the Organization of Biological Field Stations (OBFS), and the Macrosystem Biology (MSB) programs at NSF. The goal of the LTER focused NIMO (National Information Management Office) project was to expand and enhance the support of informatics in the LTER program, while the goal of PASTA+ (Provenance Aware Synthesis Tracking Architecture – Plus) was to provide an open access data repository that was built using the PASTA software stack for communities other than LTER. To be more inclusive of all served communities, both goals are now part of EDI’s vision. As such, EDI is a combination of informatics expertise and a production-level data repository (Figure 1) for use by all four communities (and others). EDI also works closely with the LTER National Communications Office (NCO) and DataONE to promote data management best practices and stewardship, and supports two separate DataONE member nodes, one for LTER and the other for all non-LTER data (the EDI Member Node).
Figure 1: Components of the EDI infrastructure.
EDI Data Infrastructure
Development of the Provenance Aware Synthesis Tracking Architecture (PASTA) software began in 2009 by LTER information managers and software developers with the goal to serve as the LTER Network Information System data repository. A full production system was delivered to the LTER Network in January 2013 and quickly acquired a majority of LTER’s data products (> 5,900 as of January 2017). PASTA’s design was patterned on a Service Oriented Architecture to provide scalable data-repository functionality through a ReST-based application programmable interface (API), with primary operations to create, read, update, and delete (often termed CRUD) data packages to and from the repository. In addition, the PASTA development team delivered a browser-based web application for LTER called the Data Portal that gives users a human accessible interface to interact with PASTA. This was followed by an LTER Member Node (MN) in the DataONE federation, which exposes LTER data packages through DataONE’s search and catalog service.
By design, PASTA was LTER-centric. With the advent of EDI, aspects of the PASTA software that were idiomatic to LTER practices were generalized for broader use (or removed completely) into a revised software stack called PASTA+ (https://github.com/PASTAplus), which provides the underlying services for the EDI data repository. In simple terms, the EDI data repository is a “re-branding” of the LTER Network Information System data repository, including the full archive of LTER data packages, and uses the revised PASTA+ software stack. Because PASTA+ is backwards compatible with the previous PASTA API, the LTER Data Portal seamlessly interacts with the PASTA+ API. To promote broader inclusivity, EDI software developers released a generalized version of the LTER Data Portal in late 2016, which also interacts directly with the PASTA+ API. The EDI Data Portal can be used in lieu of the LTER Data Portal to access both LTER and non-LTER data packages. In March 2017, EDI released a new DataONE member node that exposes non-LTER data packages to the DataONE federation. Collectively, the infrastructure of EDI includes the EDI data repository, which uses the PASTA+ software stack, the EDI Data Portal, the EDI DataONE Member Node, the LTER Data Portal, and the LTER DataONE Member Node, in addition to a suite of software tools for information and data management (https://github.com/EDIorg). New features that will be incorporated into the EDI infrastructure, including PASTA+, will be an extended user identification system to allow authentication through applications like OpenID Connect/OAuth 2.0 through providers like Google, ORCID, and GitHub, and improved metadata creation and management tools.
Physical Management and Curation of Digital Products
The physical management and curation of EDI digital products, both submitted and internally generated, varies according to the type of product under review. EDI employs industry standard protocols to ensure that all scientific data are well documented, secure, and persist for future use. These protocols are described below.
Standards for Data and Metadata
EDI’s PASTA+ data repository software environment accepts data in any digital format. All science data curated by EDI in the PASTA+ environment, either submitted by an external user or generated internally as a derived product, is required to be described and documented with the Ecological Metadata Language (EML) standard (see: https://knb.ecoinformatics.org/tools/ eml). The PASTA+ data repository software supports versions 2.1.0, 2.1.1, and 2.2.0 of EML. EML is a semantically rich science metadata standard that is specified in the form of an XML schema. The EML standard is actively supported as an open source project.
Of utmost importance to EDI are the science data and metadata under our curatorial management. Curation of science data and metadata begin at the moment of upload when a checksum is computed for all objects and stored as a measurement for comparison during random monitoring, which ensures long-term digital integrity. All objects are then cataloged in our data package resource registry and written to physical storage. The science data and metadata are replicated daily to both a permanent mirrored storage device and to a removable storage device using a combination of copy and checksum verification. Once on the mirrored storage device, the aggregate “data package” is compressed as a single digital file and written to Amazon Glacier, a high-latency cloud storage service designed for long-term preservation. The removable storage is rotated offsite on a weekly basis. We view both the mirrored storage and removable storage as near-line backup systems for quick recovery of science data and metadata. Data packages stored at Amazon Glacier are considered only for large-scale catastrophic recovery; these data package files are sufficiently complete to fully recover our entire data repository or to allow transfer to another data repository system. Recovery scenario testing is performed on a monthly basis. Science metadata is also replicated to the DataONE Federation (see: https://www.dataone.org/) on an hourly basis.
Relational databases that are critical for the operation of the PASTA+ data repository environment or related services are exported as plaintext SQL “dump” files, compressed in volume, and then replicated to the same storage devices described above for science data and metadata, also on a daily basis. These files are not, however, written to Amazon Glacier. As with science data and metadata backups, the backed-up SQL “dump” files are tested for integrity and usability on a monthly basis.
System log files necessary to better understand the state of the PASTA+ data repository environment and related services are also compressed and copied to the mirrored and removable storage devices on a daily basis. These files are not, however, written to Amazon Glacier. No integrity testing is performed on these files.
Virtual machine images of the PASTA+ data repository environment and related services are copied to a dedicated VMware ESXi backup host on a daily basis. Due to storage constraints, old images are overwritten by new images as they are produced. Virtual machine images capture the entire system state of the server and can be immediately put into operation, if necessary. Integrity testing of backed-up virtual machine images is performed weekly.
EDI developed software products fall into two categories: software for the PASTA+ data repository environment and software in support of general data management practices. All EDI software development follows principles utilized by open source projects, including frequent and incremental submittals of written code, documentation, and architectural diagrams to a recognized software repository. EDI uses two separate GitHub repositories for software management and control: one for PASTA+ software at https://github.com/PASTAplus and another for data management support tools at https://github.com/EDIorg.
Security, Access, and Confidentiality
EDI implements physical security for access to all repository infrastructure. EDI technical staff limits privileged access on PASTA+ and related systems to only certified personnel. System access logs and accounts are monitored for irregular or malicious activity and all systems operate firewall software that prohibits network intrusion from external sources.
Access to all science data and metadata through the PASTA+ data repository software REST API is controlled through access rules that are declared in the corresponding EML metadata document. In principle, EDI recommends that open access be granted to read science data and metadata, but restricted access be enforced to revise or modify science data or metadata. PASTA+ software supports conditional logic such that rules may be set to allow or deny user access to science data or metadata. In the absence of access rules, PASTA+ defaults to denying access to science data and metadata for all users except for the single user who performed the original data and metadata contribution.
EDI requires users to register in a locally managed LDAP directory before they can contribute science data and metadata to the repository. With the exception of a valid electronic mail address, EDI does not store personal user information. All contributors are vetted and instructed on data and metadata standards before registration occurs. Session authentication is a precondition of upload access. Non-contributors may authenticate through external OAuth/OpenID Connect providers (coming soon), but will not be allowed upload-access unless mapped to an EDI LDAP registration. PASTA+ access logs contain only the identification of users who have performed session authentication; all other users are recorded as “public”. No Internet address information is recorded in access logs, although the HTTP request agent value is stored to filter potential Internet robots and crawlers. EDI believes that science data and metadata that is collected through public means should be openly available and unfettered where possible.
Reuse and Redistribution
EDI advocates for open and unfettered access to science data and metadata, including data and metadata that reside in the EDI data repository. Science data and metadata contributors have the option to provide a reuse policy of their choice, which is declared within the “intellectual rights” section of the EML metadata document. If a contributor reuse policy does not exist in the EML metadata, EDI will apply a default policy that states data and metadata are to be released as “public domain” under Creative Commons CC0 1.0 “No Rights Reserved” license.
Science data and metadata, which are unfettered per access rules, can be downloaded individually and used in accordance with the stated reuse policy. Similarly, complete data packages can be accessed and downloaded as a single archive in a “zip” format, including a contents manifest. In the event that EDI ceases operations, all science data and metadata would be made available on portable storage for transfer to another repository; use of the embedded data package identifier and EML metadata would be possible with minimal effort. Updates to target URLs of PASTA+ assigned Digital Object Identifiers would allow continued resolution to preferred landing pages. If a cessation scenario occurred, the target repository would be requested to support the access rules declared in all EML metadata documents.
All PASTA+ data repository software is licensed under the Apache License, version 2.0 (AL2.0) (http://www.apache.org/licenses). AL2.0 is known as a “permissive” software license, which means that a user is free to download the software, to modify the software, and to use the software for any purpose without concern for royalties. PASTA+ data repository software is directly accessible through the PASTA+ GitHub repository. AL2.0 is compatible with the GNU General Public License v3 “copyleft” software license. All other digital products not related to the PASTA+ data repository environment or related services (e.g., webinar recordings, informational documents) will be licensed under Creative Commons CC0 1.0 “No Rights Reserved”.