Exporting complex ecological data into GBIF

To GBIF and back again, using the Event core standard.

Jens Åström https://github.com/jenast (The Norwegian institute for nature research)https://www.nina.no

Where should I put my ecological data set?

Making your data publicly available is quickly becoming a standard task for researchers. It is increasingly demanded by journals when publishing your research findings, or even by funding agencies when applying for grants. Journals have traditionally accepted data in file format, which can be reached through their websites along with the paper. But these data have idiosyncratic formats and are typically stripped down to a minimal set suited for a specific analysis in the paper. In addition, the data can be hard to find and cumbersome to download, since it is stored (or should I say hidden) behind paywalls on various publishers’ websites.

Wouldn’t it be nice if we could store our ecological data using a common format, in a common place, freely available for everyone?

GBIF could be your solution. It doesn’t work for every sort of biological data, but it probably works for more cases than you would think.

-But my data have a complex structure with repeated measurements and zero-counts that GFIB just can’t accept, you might say.

Well let’s see about that. With the new(ish) “Event Core” of the Darwin core standard, odds are that GBIF works for your data as well. Below is the story of how I formatted and published a multi-year observation data set of ~80 species with a hierarchical survey scheme, while incorporating all collected environmental covariates, and meta-data into GBIF.

The Event Core standard

GBIF — the Global Biodiversity Information Facility, is a vast store of biodiversity data, with records spanning hundreds of years, and containing hundreds of millions of occcurrence records. Up until recently though, it has been limited to presence-only data, neglecting information of possible zero-counts. The addition of Events to the Darwin Core standard seeks to remedy that. With this addition, GBIF now accepts most survey data. It does this by organizing the data around individual sample events, i.e. a distinct measurement event of one or more species together with related records, in a specific time and place.

See here for info on the Darwin core in general: https://www.gbif.org/darwin-core. For event-data specifically, see here: https://www.gbif.org/data-quality-requirements-sampling-events.

From a custom sample structure to a standardized hierarchy of events

Most of our complex data has a structure to it, whether it is spelled out clearly through naming schemes, or implicit through relationships within a database. A botanist for example, might have mapped the flora of a forest stand by placing 1\(m^2\) survey squares randomly 10 times, and recorded the percentage cover of each species in the squares. The individual observations of the plants are then linked together through the specific 1\(m^2\) survey square they were found in, and further linked together through the 10 survey squares in the same forest stand. This could be recorded in a single table by using a column for the names of the forest stand, a column for the survey square a column for the plant species and a column of the percentage cover of the species (Table 1). Or, the data could be split up into separate tables in a database, with keys that link the tables together. Alas, there are (almost) as many potential structures as there are surveys. But there is only one Darwin Event Core. The task is then to standardise individual data sets so that it fits into GBIF, while providing the necessary information so that the original format can later be retrieved. It’s a little bit like zipping and unzipping a file.

The key idea behind the event core format is to link related information in a data set together using a hierarchy of recording events. The lowest level event in our example is the counting of plants within one survey square. That’s the smallest observational unit that links individual records together. Each such event has a number of occurrence records linked to it, in this example the percentage cover of each plant in that survey square. This can include values of 0%, for species that we looked for, but didn’t find in any particular survey square. We can also include zero counts the same way. The hierarchical structure of the survey scheme is recorded by encoding the higher recording level, above the observation event as a parent event. In our case this is the survey of the entire forest stand. This hierarchy is open-ended, meaning there could be additional higher level events. Thus a parent event can have it’s own parent event, and so on.

The point of coding the hierarchical relationships this way is to be able to store any (?) structure and related recorded data in a standardised way, in only one event table and one occurrence table. This obviously has some major benefits. We could store a wide range of biodiversity data in a unified way at a single location, facilitating the combination of data sources and streamlining data gathering routines.

But nothing so great comes for free, and the cost here is that you might need to rearrange your data before you upload it to GBIF. This particular formatting can seem confusing at first, and along the way you might wonder if it’s really worth it. But there is no way around it. There is inherently work involved in organizing complex data into a unified format and getting it back again. But once it’s done, your data is securely stored and publicly available, free of charge, in the worlds largest repository of biodiversity information.

A simple example

Let’s look at our made up example data of plant surveys in forests.

Table 1: Made up simple example data.
date observer forest plot pH species percCover
2019-04-01 Mary Sherwood plot_1 6.23 Quercus robur 34.44
2019-04-01 Mary Sherwood plot_1 6.23 Anemone nemorosa 25.61
2019-04-01 Mary Sherwood plot_1 6.23 Athyrium filix-femina 0.38
2019-04-01 Mary Sherwood plot_2 7.24 Quercus robur 9.30
2019-04-01 Mary Sherwood plot_2 7.24 Anemone nemorosa 26.64
2019-04-01 Mary Sherwood plot_2 7.24 Athyrium filix-femina 20.57
2019-04-15 John Nottingham plot_1 7.22 Quercus robur 27.74
2019-04-15 John Nottingham plot_1 7.22 Anemone nemorosa 21.80
2019-04-15 John Nottingham plot_1 7.22 Athyrium filix-femina 11.31
2019-04-15 John Nottingham plot_2 7.25 Quercus robur 36.94
2019-04-15 John Nottingham plot_2 7.25 Anemone nemorosa 11.69
2019-04-15 John Nottingham plot_2 7.25 Athyrium filix-femina 33.49
Rare footage of early botanists.

Figure 1: Rare footage of early botanists.

Using the Event Core format, this could be transformed into something like these two tables, which GBIF can swallow.

Table 2: Minimal example of Event table for made up data.
eventDate eventID parentEventID locationID dynamicProperties
2019-04-01 parentEvent_1 NA Sherwood {“Mary”}
2019-04-01 event_1 parentEvent_1 plot_1 {“pH” : 6.23}
2019-04-01 event_2 parentEvent_1 plot_2 {“pH” : 7.24}
2019-04-15 parentEvent_2 NA Nottingham {“John”}
2019-04-15 event_3 parentEvent_2 plot_1 {“pH” : 7.22}
2019-04-15 event_4 parentEvent_2 plot_2 {“pH” : 7.25}
Table 3: Minimal example of Occurrence table for made up data.
eventID scientificName organismQuantity organismQuantityType
event_1 Quercus robur 34.44 percentageCover
event_1 Anemone nemorosa 25.61 percentageCover
event_1 Athyrium filix-femina 0.38 percentageCover
event_2 Quercus robur 9.30 percentageCover
event_2 Anemone nemorosa 26.64 percentageCover
event_2 Athyrium filix-femina 20.57 percentageCover
event_3 Quercus robur 27.74 percentageCover
event_3 Anemone nemorosa 21.80 percentageCover
event_3 Athyrium filix-femina 11.31 percentageCover
event_4 Quercus robur 36.94 percentageCover
event_4 Anemone nemorosa 11.69 percentageCover
event_4 Athyrium filix-femina 33.49 percentageCover

Notice that the individual survey squares are tied to a common forest survey through the parentEventID column in the event table. Additional environmental data (here pH and observer) is stored in a column named “dynamicProperties”, which is formated as JSON, and can hold any collection of such extra information. The result is perhaps less readable for a human, but the structure is generalizable to a lot of situations.

Butterfly and bumblebee example

I will now show how we formatted a real world data set for publication on GBIF through the Event Core standard. The data comes from a yearly survey of butterfly and bumblebees at approximately 60 fixed locations in Norway. We employ citizen scientists to record the abundances of butterflies and bumblebees along 20 50 meter long pre-defined transects at each survey location. Each location is visited 3 times each year, and rudimentary environmental data is recorded at each transect, each visit. We train each surveyor to be able to identify all occurring species, and thus record non observed species as zero counts.