To GBIF and back again, using the Event core standard.
Making your data publicly available is quickly becoming a standard task for researchers. It is increasingly demanded by journals when publishing your research findings, or even by funding agencies when applying for grants. Journals have traditionally accepted data in file format, which can be reached through their websites along with the paper. But these data have idiosyncratic formats and are typically stripped down to a minimal set suited for a specific analysis in the paper. In addition, the data can be hard to find and cumbersome to download, since it is stored (or should I say hidden) behind paywalls on various publishers’ websites.
Wouldn’t it be nice if we could store our ecological data using a common format, in a common place, freely available for everyone?
GBIF could be your solution. It doesn’t work for every sort of biological data, but it probably works for more cases than you would think.
-But my data have a complex structure with repeated measurements and zero-counts that GFIB just can’t accept, you might say.
Well let’s see about that. With the new(ish) “Event Core” of the Darwin core standard, odds are that GBIF works for your data as well. Below is the story of how I formatted and published a multi-year observation data set of ~80 species with a hierarchical survey scheme, while incorporating all collected environmental covariates, and meta-data into GBIF.
GBIF — the Global Biodiversity Information Facility, is a vast store of biodiversity data, with records spanning hundreds of years, and containing hundreds of millions of occcurrence records. Up until recently though, it has been limited to presence-only data, neglecting information of possible zero-counts. The addition of Events to the Darwin Core standard seeks to remedy that. With this addition, GBIF now accepts most survey data. It does this by organizing the data around individual sample events, i.e. a distinct measurement event of one or more species together with related records, in a specific time and place.
See here for info on the Darwin core in general: https://www.gbif.org/darwin-core. For event-data specifically, see here: https://www.gbif.org/data-quality-requirements-sampling-events.
Most of our complex data has a structure to it, whether it is spelled out clearly through naming schemes, or implicit through relationships within a database. A botanist for example, might have mapped the flora of a forest stand by placing 1\(m^2\) survey squares randomly 10 times, and recorded the percentage cover of each species in the squares. The individual observations of the plants are then linked together through the specific 1\(m^2\) survey square they were found in, and further linked together through the 10 survey squares in the same forest stand. This could be recorded in a single table by using a column for the names of the forest stand, a column for the survey square a column for the plant species and a column of the percentage cover of the species (Table 1). Or, the data could be split up into separate tables in a database, with keys that link the tables together. Alas, there are (almost) as many potential structures as there are surveys. But there is only one Darwin Event Core. The task is then to standardise individual data sets so that it fits into GBIF, while providing the necessary information so that the original format can later be retrieved. It’s a little bit like zipping and unzipping a file.
The key idea behind the event core format is to link related information in a data set together using a hierarchy of recording events. The lowest level event in our example is the counting of plants within one survey square. That’s the smallest observational unit that links individual records together. Each such event has a number of occurrence records linked to it, in this example the percentage cover of each plant in that survey square. This can include values of 0%, for species that we looked for, but didn’t find in any particular survey square. We can also include zero counts the same way. The hierarchical structure of the survey scheme is recorded by encoding the higher recording level, above the observation event as a parent event. In our case this is the survey of the entire forest stand. This hierarchy is open-ended, meaning there could be additional higher level events. Thus a parent event can have it’s own parent event, and so on.
The point of coding the hierarchical relationships this way is to be able to store any (?) structure and related recorded data in a standardised way, in only one event table and one occurrence table. This obviously has some major benefits. We could store a wide range of biodiversity data in a unified way at a single location, facilitating the combination of data sources and streamlining data gathering routines.
But nothing so great comes for free, and the cost here is that you might need to rearrange your data before you upload it to GBIF. This particular formatting can seem confusing at first, and along the way you might wonder if it’s really worth it. But there is no way around it. There is inherently work involved in organizing complex data into a unified format and getting it back again. But once it’s done, your data is securely stored and publicly available, free of charge, in the worlds largest repository of biodiversity information.
Let’s look at our made up example data of plant surveys in forests.
date | observer | forest | plot | pH | species | percCover |
---|---|---|---|---|---|---|
2019-04-01 | Mary | Sherwood | plot_1 | 6.23 | Quercus robur | 34.44 |
2019-04-01 | Mary | Sherwood | plot_1 | 6.23 | Anemone nemorosa | 25.61 |
2019-04-01 | Mary | Sherwood | plot_1 | 6.23 | Athyrium filix-femina | 0.38 |
2019-04-01 | Mary | Sherwood | plot_2 | 7.24 | Quercus robur | 9.30 |
2019-04-01 | Mary | Sherwood | plot_2 | 7.24 | Anemone nemorosa | 26.64 |
2019-04-01 | Mary | Sherwood | plot_2 | 7.24 | Athyrium filix-femina | 20.57 |
2019-04-15 | John | Nottingham | plot_1 | 7.22 | Quercus robur | 27.74 |
2019-04-15 | John | Nottingham | plot_1 | 7.22 | Anemone nemorosa | 21.80 |
2019-04-15 | John | Nottingham | plot_1 | 7.22 | Athyrium filix-femina | 11.31 |
2019-04-15 | John | Nottingham | plot_2 | 7.25 | Quercus robur | 36.94 |
2019-04-15 | John | Nottingham | plot_2 | 7.25 | Anemone nemorosa | 11.69 |
2019-04-15 | John | Nottingham | plot_2 | 7.25 | Athyrium filix-femina | 33.49 |
Figure 1: Rare footage of early botanists.
Using the Event Core format, this could be transformed into something like these two tables, which GBIF can swallow.
eventDate | eventID | parentEventID | locationID | dynamicProperties |
---|---|---|---|---|
2019-04-01 | parentEvent_1 | NA | Sherwood | {“Mary”} |
2019-04-01 | event_1 | parentEvent_1 | plot_1 | {“pH” : 6.23} |
2019-04-01 | event_2 | parentEvent_1 | plot_2 | {“pH” : 7.24} |
2019-04-15 | parentEvent_2 | NA | Nottingham | {“John”} |
2019-04-15 | event_3 | parentEvent_2 | plot_1 | {“pH” : 7.22} |
2019-04-15 | event_4 | parentEvent_2 | plot_2 | {“pH” : 7.25} |
eventID | scientificName | organismQuantity | organismQuantityType |
---|---|---|---|
event_1 | Quercus robur | 34.44 | percentageCover |
event_1 | Anemone nemorosa | 25.61 | percentageCover |
event_1 | Athyrium filix-femina | 0.38 | percentageCover |
event_2 | Quercus robur | 9.30 | percentageCover |
event_2 | Anemone nemorosa | 26.64 | percentageCover |
event_2 | Athyrium filix-femina | 20.57 | percentageCover |
event_3 | Quercus robur | 27.74 | percentageCover |
event_3 | Anemone nemorosa | 21.80 | percentageCover |
event_3 | Athyrium filix-femina | 11.31 | percentageCover |
event_4 | Quercus robur | 36.94 | percentageCover |
event_4 | Anemone nemorosa | 11.69 | percentageCover |
event_4 | Athyrium filix-femina | 33.49 | percentageCover |
Notice that the individual survey squares are tied to a common forest survey through the parentEventID column in the event table. Additional environmental data (here pH and observer) is stored in a column named “dynamicProperties”, which is formated as JSON, and can hold any collection of such extra information. The result is perhaps less readable for a human, but the structure is generalizable to a lot of situations.
I will now show how we formatted a real world data set for publication on GBIF through the Event Core standard. The data comes from a yearly survey of butterfly and bumblebees at approximately 60 fixed locations in Norway. We employ citizen scientists to record the abundances of butterflies and bumblebees along 20 50 meter long pre-defined transects at each survey location. Each location is visited 3 times each year, and rudimentary environmental data is recorded at each transect, each visit. We train each surveyor to be able to identify all occurring species, and thus record non observed species as zero counts.