Developing an Information Management Code Repository

Kristin Vanderbilt and Colin Smith led a discussion on creating an information management code repository for the earth sciences community, at the Earth Science Information Partners meeting July 25, 2017.

Below are notes from the discussion we had. Each section is the primary question asked during the session (bold text), proposed details (normal text), and comments collected during the session (green text).

 

Introduction

We at the Environmental Data Initiative are initializing an information management (IM) code repository for data managers in the earth sciences community. This group includes persons who may or may not have formal software development expertise yet have written software that is useful to others or seek tools to help improve their IM tasks. The content of this code repository should focus strictly on IM tasks (e.g. harmonization of data patterns, QA/QC procedures, metadata generation, etc), be flexible to integration in the workflows of users, and be coding language agnostic. Here we propose a set of parameters for this repository but need your feedback to ensure these are appropriate in creating a useful resource to the community in the near and long-term future.

We will discuss several topics including:

  • What are the minimum requirements for code to be accepted?
  • How should the code repository be governed?
  • What is an effective structure for the repository?
  • How do we facilitate discoverability?
  • Other consideration?
  • Laundry list of code to be created

Our next steps are to draft a blueprint of the IM code repository for an additional round of comment and revision prior to realization.

For questions and comments please contact Kristin Vanderbilt (krvander@fiu.edu) and or Colin Smith (colin.smith@wisc.edu)

 

Some post introduction comments

  • The proposed scope may be too broad and encompassing extant efforts. Duplicated effort should be avoided. Survey the landscape for code repositories and identify a specific niche. Some possible overlap may be in the Bioconductor and OntoSoft projects.
  • If niche isn’t large enough, consider integration among multiple groups.

 

What are the minimum requirements for code acceptance?

If the bar is too high, then contributions will be low. If the bar is too low, the repository may become difficult to navigate. We need to find a happy medium. Some general principles that may be important include:

  1. The code is refactored for general use by others (i.e. not specific to some idiosyncratic task).
  2. The code is well documented at the package level and should include:
    1. code metadata and tagging for discoverability
    2. install instructions
    3. example uses with example data
  3. The code is well documented at the script level, which means:
    1. line and section comments follow best practices
  4. The code works and does what it says it does, as verified by a reviewer.
  5. The user interface supplies meaningful error messages when the code breaks.
  6. The code is developed under version control with periodic version releases accompanied with DOIs.
  7. The code project includes a road map so contributors and users know what to plan for.
  • Rather than enforce a single standard, it might be good to structure the repository into multiple levels, of varying degrees of standards, thereby accommodating a broader range of contributions. For example the base level could be allow anything in like single scripts containing a single line of code. Here, code may converge to development level packages (mid-level of repo) and eventually become production ready (top-level), archived with DOI. These levels will engage the community at multiple levels.
  • Code testing should be current and relevant to the use case. Testing rigor should conform to level of code maturity. Unit tests are a good metric.
  • How might Docker software containers be integrated with repository?

 

How should the code repository be governed?

Governance is important to the longevity of the repository. How shall governance be structured? Perhaps a committee:

  1. Composed of specialists specialists in different programming languages, along with  information managers (and other stakeholders) to formulate and enforce rules, and conduct basic maintenance
  2. Each member serves a specific function
  3. The terms of service are set.
  • Governance is very important. A committee structure is what we recommend.
  • Other groups that have had varying success in setting up code repositories. Reach out to these group to learn from their successes and failures.
  • It is important to identify stakeholders and have adequate representation from the community. Governors should be practitioners.
  • The terms of service need to be clearly set and well coordinated.
  • Target people with time to spare. Assess commitment.

 

How should the repository be structured?

We see 3 options for repository structure:

  1. A decentralized approach, where contributors develop code in the environment of their choosing and publish where they wish. The code repository links to these through DOIs. This essentially serves as a catalog. Some issues for consideration are:
    1. How to search across different repositories?
    2. How to ensure persistence?
    3. No standardized metadata.
  2. Centralized approach, where the repository is set up on one environment. This provides persistence and facilitates discoverability. Some issues for consideration include:
    1. What venue best supports the intended functions of the repo?
    2. What is the set of rules to guide the governing body and ensure persistence?
  3. A hybrid approach, where local code is accompanied by a catalog linking to external code that for whatever reason, can’t be imported into the local repository.
  • Can we archive software in data repositories? A couple reasons not to do this include:
    • Inappropriate use of service.
  • Software development side functionality gets lost.
  • However, this could be facilitated by having a catalog in front end with a linked development environment in back.
  • We should use the multi-leveled structure proposed earlier.

 

How do we facilitate discoverability?

To facilitate discoverability, a code metadata standard should be selected and a controlled vocabulary applied. Additionally, the repository should be tagged with keywords. It may be important for governance to catalog metadata of code in the repository, as well as some search interface. Advertising will be important.

  • Discovery of code in the base level (i.e. everything and the kitchen sink) can be facilitated by good keyword tagging, activity, and other metrics. Voting by use is one method of evaluation.
  • Use keywords and metadata to delineate code maturity levels.
  • Think about how users search. Often it is the description of a problem. Use language public facing text to get picked up by search engines.

 

Other considerations and thoughts?

Here are some other considerations:

  1. Incentives to participating in the repository are confronted the chicken and the egg problem, centralized open source code repositories offer benefits, but require participation to facilitate adoption, reuse, and development in the community. Some incentives might be necessary in the short term to see this. From the provider end there are journals for code publication and with some additional effort, the code contributor could publish a peer reviewed paper documenting their work. Consumers need to use the code to provide feedback and maturation of code, as well as cite the code to facilitate discovery, and recognition.
  2. Maintenance/code updates. What is the mechanism by which this will happen?
  3. User support by code creator?  We think this would keep people from contributing?
  4. Licensing – what is the most open accommodating license for contributors and users?
  5. DOI. Should a DOI be required for all code, or only mature software? Archival of code with complementary DOI can be obtained through Zenodo, figshare, and Mozilla Science Lab.
  • What is the end of project life transition look like? This is something that should be considered.
  • Other incentives include emerging use of altmetrics (e.g. downloads, visits; discoverable by services like depsy.org).
  • Think of how to encompass, teachers and educators, as well as people learning on their own.
  • It might be good to link to proprietary software in the resource catalog of the repository.
  • Use registries to increase discoverability.
  • Licensing is a sticky issue. May want to accept a wide breadth of licenses.

 

What are the next steps?

  • ESIP science software cluster may be a good place to develop.
  • Reach out and learn strategies from other code repositories groups.
  • A message board, or help wanted postings could facilitate collaborations.