Using Darwin Core with your existing data
Consider incremental and strategic incorporation of standards
It can pay to keep local structures aligned with standards that are useful downstream of collection.
However, you may already have systems in place for collecting and working with your data. Consider improving them by selectively working toward a standards-based system. Already standards-based? Consider mapping to Darwin Core and publishing a Darwin Core Archive so that the data can be found and reused by the biodiversity science community.
What does it look like to map existing data to Darwin Core? Below we share a hypothetical example of fisheries trawl data and another example of satellite telemetry data from the ATN.
Example: Hypothetical Fisheries Trawl Data
Here we share data from a hypothetical fisheries trawl survey. In Darwin Core, these data would be coordinated over a couple of tables to describe the core data (what, when, where) and the additional data, like experimental and environmental covariates that are specific to this data type and study. Here, we’ve broken these up into a few different tables to make them easier to view.
Let’s start on the left, with ‘When’. This table demonstrates how variable names like ‘Date’ and ‘Time’ map to the Darwin Core terms eventDate
and eventTime
fairly directly. With trawl data, these might be captured as ‘Start’ and ‘Stop’ times, and that range can be captured within the ISO 8601 format that is used in Darwin Core.
What may not be obvious is that Darwin Core events can be structured hierarchically, so individual time points, or ranges, can exist within parent events.
Like the When table, there is a fairly straightforward mapping of spatial coordinates to the Darwin Core standard. For many people, they can use decimal degrees taken directly from a GPS system. Others may need to convert from degrees-minutes-seconds to decimal degrees.
What do you do when you want to describe a transect, as is common in trawl data? Darwin Core offers the term footprintWKT
, which allows description of 2-D and 3-D shapes as Well Known Text (WKT). This can include polygons, lines, circles, and other shapes.
Lastly, Darwin Core allows both minimum and maximum depth to be captured, although you may need to convert your data to meters.
Now we finally get to the biological part of our data. A trawl might return many species. Darwin Core allows these to be described via scientificName
(any level of classification, not just species) and scientificNameID
(corresponding to a database ID like a WoRMS AphiaID or ITIS TSN).
individualCount
can be described, or if you have non-count units, organismQuantity
and organismQuantityUnit
can be used to describe another unit that appropriately describes the catch in terms of Catch Per Unit Effort (CPUE).
Lastly, each of these descriptions would be given an eventID
or occurrenceID
to identify this specific observation. This is like a primary key in a SQL database.
How does Darwin Core capture the information that might be specific to this study that might not be common across all of biology? It uses extensions, especially the Extended Measurement or Fact
extension. In this way, any measurement or variable can be linked to any event or occurrence. Our example here shows temperature. Below we show a more unusual example from satellite telemetry.
This table reads differently than the others. In this case we have the Darwin Core terms in the left column and an example value in the right column. We name the variable in the measurementType
, and use measurementTypeID
to link it to a corresponding URL definition that can be read by a computer.
Similarly, we break unit into a name and ID so computers can efficiently and correctly coordinate these data with other data. If we have information on properties like accuracy, methodology, and who recorded this information, we can record that as well.
Example: Satellite Telemetry Data
Here is another example: data from the Animal Telemetry Network (ATN), which deploys satellite tags on animals to track their movement.
We included this example to demonstrate how a parent event, like the entire deployment of a tag, can be described separately from children events and occurrences. Here, eventDate
describes the full time range of all the satellite pings (individual occurrences) that happened during the deployment. minimumDepthInMeters
and maximumDepthInMeters
are appropriately recorded as 0m because all pings happen at the surface.
Since the deployment is not a single point, it is described as a MULTIPOINT using footprintWKT
.
Lastly, samplingProtocol
is a generic description here, but it could just as easily reference a DOI from a published protocol or paper.
Here is an example of an occurrence, or individual ping from the tracked animal. Some noteworthy terms we haven’t seen yet include organismID
to identify a specific animal that is tracked for a long time.
associatedReferences
points the user to the corresponding publication.
bibliographicCitation
describes how this individual data point should be cited if, for example, it is mixed with data from another dataset.
Finally, occurrenceRemarks
gives us a place to note that this is a representative point from the full dataset, which can be found at NOAA’s NCEI.
Let’s look at how the Extended Measurement or Fact extension looks here, as a table. Whereas in the previous example we were looking at how a single variable is broken down, here we see how multiple variables are linked to a single occurrence. A new facet to these data is that they are referencing another standard, the Movebank Attribute Dictionary. The telemetry and tracking community have already agreed to use these terms to describe specific attributes about variables like tag make and model, or tag placement. Rather than reinvent the wheel, we simply map to them so machines can be leveraged to coordinate these data with other data referencing Movebank attributes.
There are many paths!
There are many paths to the top of the mountain. See the resources page to view how others have shared their data with Darwin Core.