Introduction
Features
Manipulation
Recommendations
Attribution
Introduction
In a “wide”, or matrix-style data model each variable has a separate column. In an attribute-value (“long”, “narrow”, or “key-value”) data model there is one column containing the names of the variables and another column for the variables’ values.
The attribute-value model is often used when a matrix-type model would result in many empty cells. This model treats each point observation as a single record, which contains fields for location, time, the variable (attribute) name, and value. There may be other multiple columns to context, or for flags, methods, etc. For ecologists, this data model is useful for (1) biodiversity (the Darwin Core Archive model used by GBIF is key-value) and (2) sensor data that if put in a matrix-format might require a very wide table, or would have many empty cells, for example when species are not regularly observed or sensors not deployed consistently.
The advantages of the key-value model are its flexibility and efficiency. However, data tables in this format are not easily (or completely) described by EML (Version 2.x). EML includes descriptors for type (numeric, string), unit, and precision which must apply to all values in the same column. So generally, any data entity of the key-value model will have both, the key and value columns typed as “nominal text” to accommodate any content. When values are numeric, text-typing will reduce data’s understanding considerably, unless additional columns for attributes like unit or precision are included as context columns. The basic features of these two models are summarized below.
top
Features
Feature | Long | Wide |
---|---|---|
size | tend to be compact, with no empty cells | there may be many empty cells |
additions (ongoing datasets only) | multiple rows are added for each time & location addition | generally, one row per time & location addition |
data model maintenance, e.g., for an ongoing dataset | no columns will need to be added for new measurements; new measurements are simply new rows | you may need to add columns over time, e.g., when new measurements are added |
implications for EML typing | “value” must be typed and described generally enough to fit all values, e.g., “string” | structured attribute descriptions can be very detailed |
implications for EML enumeration | keys (variable names) should be enumerated; ie, the metadata explicitly states what content is allowed in that column (you would update the enumerations if measurements are added) | variable names are in the element for each column. |
Manipulation
If you work in R, there are tools for converting between wide and long tables, that are part of the “tidyverse” packages. In “tidyverse”, two fundamental functions for tidying data are available:
- gather(): takes multiple columns, and gathers them into key-value pairs. It makes “wide” data longer.
- spread(): takes two columns (key & value) and spreads them into multiple columns, it makes “long” data wider.
“tidyverse” also provides the separate() and extract() functions which make it easier to pull apart a column that represents multiple variables. The complement to separate() is unite().
To get started, see the vignettes in the packages themselves, and the material on the tidyverse website: https://tidyr.tidyverse.org/
top
Recommendations
We recommend a middle of the road approach. Basic guidelines:
- For efficient maintenance, you should plan to add rows, not columns. For example, don’t arrange annual data in columns for each year; have one column for “date” and add rows for a new year’s data.
- Don’t collapse (gather) context columns; e.g., keep date, location separate.
- It’s OK to collapse (gather) columns that share a unit, precision, typing.
- Example 1: Biodiversity data: may have all values in “percent cover” or “count”; if so, these could be represented in an attribute-value arrangement and described accurately in EML. If they don’t share a unit use a matrix format.
- Example 2: Sensor data: variations of the key-value format have been developed in communities handling large volumes of sensor data (e.g., the Observation Data Model, or ODM from CUAHSI.) For a single site this model is useful when sensors are frequently changed or redeployed. For description in EML we recommend that each data table contain only the key-value formatted values for a single sensor (or as separate entities in one data package). Most of these data models describe the concept of a single “data stream” which may be used to achieve this dynamically.
Attribution
This material adapted by EDI from:
- Version 2 of the Ecological Metadata Best Practices document produced by the LTER Network Information Managers. https://lternet.edu/wp-content/uploads/2011/08/emlbestpractices-2.0-FINAL-20110802.pdf.
- Wikipedia: Wide and Narrow Data. https://en.wikipedia.org/wiki/Wide_and_narrow_data (accessed 2018-07-22)
- https://tidyr.tidyverse.org/
- Hadley Wickham, H. 2014. Tidy Data. Journal of Statistical Software. 59:1-23. DOI:10.18637/jss.v059.i10