Chapter 3 Data cleaning

Biodiversity data repositories work hard to maintain the accuracy of their holdings. When multiple sources are involved several problems may arise. Here we shall quickly outline what they are and possible solutions.

3.1 Resources

There are a number of libraries, work flows and online resources for automating downloading and cleaning of data. THese include:

3.2 Taxonomies

It is important to be aware of likely taxonomic anomalies prior to working within a region. Checklists are very important, especially if working over several regions / countries. Whilst there are many things that will automatically look for the validity of a name they do not check for the validity of that species occurrence. For example Sphagnum auriculatum Schimp. and Sphagnum denticulatum Bridel, 1826 are both valid names. S. auriculatum is the currently accepted species in Europe but in the British Isles, Ireland and the Netherlands s. denticulatum is the most recorded taxa. Both are legitimate names but they are a synonymy. Both names have been used in Europe but in distinct countries. The current European checklist (Hodgetts et al. 2020) has Sphagnum auriculatum Schimp. as the accepted taxon occurring in Europe. Naive downloading would result in two taxa being present when in fact it is two interpretations of the same taxon. Using data from across the European region without acknowledging this disagreement would impact the results of any research undertaken. For taxa which are known to be capable of dispersing great distances (eg. birds) this becomes even more difficult especially when using community sourced data.

The example above uses data downloaded from GBIF.

  • Data for Sphagnum denticulatum doi: 10.15468/dl.rrp4p4
  • Data for Sphagnum auriculatum doi: 10.15468/dl.3yrtw7

Within Sweden there is an agreed taxonomy for all extant taxa accessible through Dyntaxa. Checking of species lists can be done by directly copying species names into a dialogue box or uploading an excel spreadsheet in the correct format.

3.3 Location data

3.3.1 Locality information

Many records have locality information attached to them. Where there are no coordinates attached this information can be used to locate the record to within an area of where records most likely came from. There are functions for which geocoding can be done automatically within R. Geocoding is finding the coordinate for a known locality. These are included in the libraries:

  • ggmap requires google API key
  • tidygeocoder

Many localities may not be included in the gazetteers associated with these libraries but may be located using online or printed maps. This takes time but may be useful. Unique localities may be extracted, geocoded and then merged back with the data set.

This becomes especially important when extracting records across country boundaries as countries have different numbers of georeferenced observations.
Table 3.1: Table 3.2: The number and percentage of records of bryophyta per EU country with and without coordinates
Country Total Records % Records with coordinate % Records without coordinates
AL Albania 385 6.23 6.23
AD Andorra 2408 6.85 6.85
AM Armenia 503 54.67 54.67
AT Austria 68000 20.78 20.78
BY Belarus 2102 20.65 20.65
BE Belgium 470192 6.73 6.73
BA Bosnia & Herzegovina 415 71.33 71.33
BG Bulgaria 1349 37.21 37.21
HR Croatia 1060 49.81 49.81
CY Cyprus 595 61.18 61.18
CZ Czechia 19910 48.49 48.49
DK Denmark 949565 1.24 1.24
EE Estonia 109502 34.48 34.48
FO Faroe Islands 1618 47.10 47.10
FI Finland 256141 5.35 5.35
FR France 1298606 6.27 6.27
GE Georgia 9680 36.94 36.94
DE Germany 213246 18.91 18.91
GI Gibraltar 22 100.00 100.00
GR Greece 2183 65.23 65.23
HU Hungary 4115 58.13 58.13
IS Iceland 43033 15.07 15.07
IE Ireland 132611 2.77 2.77
IT Italy 22635 44.66 44.66
LV Latvia 7596 42.96 42.96
LI Liechtenstein 546 3.48 3.48
LT Lithuania 5431 6.52 6.52
LU Luxembourg 23640 1.61 1.61
MT Malta 95 40.00 40.00
MC Monaco 16 87.50 87.50
NL Netherlands 886075 4.02 4.02
MK North Macedonia 284 24.65 24.65
NO Norway 526305 29.36 29.36
PL Poland 52850 31.81 31.81
PT Portugal 256568 9.93 9.93
RO Romania 1501 58.69 58.69
RU Russia 217991 15.96 15.96
SM San Marino 1 0.00 0.00
SK Slovakia 5047 51.46 51.46
SI Slovenia 3991 37.31 37.31
ES Spain 130037 20.33 20.33
SE Sweden 1205171 10.79 10.79
CH Switzerland 267715 4.51 4.51
TR Turkey 9274 85.00 85.00
UA Ukraine 13225 8.81 8.81
GB United Kingdom 2976217 1.44 1.44
VA Vatican City 3 100.00 100.00

As can be seen from the above example for European Bryophyta (Hornworts, Liverworts and Mosses) the number of records with coordinates vastly varies between countries. For example of 959444 records in Sweden 88.24 % have coordinates in Switzerland of 117132 records only 2.29 % have coordinates.

3.3.2 Coordinate uncertainty

In many cases there is now an abundance of biodiversity data with coordinates. As can be seen from a summary of the above Sphagnum data coordinate uncertainty can vary from less than 1 meter to multiple kilometers.

Coordinate uncertainty <1 m 1-10 m 10-100 m 100-500 m 500-1000 m 1-5 km 5-10 km 10-50 km >50 km None
Frequency 335 586 5320 376 4619 4980 4147 41 2 7506

In the above example it can be seen that two have uncertainties greater than 50 km and several thousand records that have no known error margin the location. It is important to consider what the error is and removing those records for which the uncertainty is too high. Where this point is will be dependent on the scale of the research.

3.3.3 Coordinate errors

Besides the declared uncertainty of the the coordinates, coordinate errors may occur for a variety of reasons. The library coordinateCleaner is very useful for removing some of the most common errors. These include:

  • 0 latitude, 0 longitude
  • swapping of latitude and longitude
  • The location of the institution holding a preserved sample rather than the location of origin of that sample

3.4 Examples

3.4.1 Data cleaning Cyperaceae in Sweden

In this example I have downloaded the occurrences of the Cyperaceae family in Sweden from SBDI since 2000. This is a single data set from an area with an agreed taxonomy dyntaxa. Our aim here is to download a group of records and clean them to be in such a state than they reflect the question the researcher has in mind.

First we give each record in the assembled data set a unique ID. In this case it could be possible to use the GBIFid. In data sets sourced from multiple sources it is a good idea to create a unique ID related to the data source.

In cleaning data there are a number of dimensions of uncertainty we shall first check.

3.4.2 Taxonomy

In the species column there are records with no Species given in the species column. We then check the scientificName for whether there is useful information there

kable(Carex |>
  filter(species == "") |>
  distinct(scientificName)) |>
   kable_styling(latex_options = c("striped", "hold_position"),
                full_width = F)
scientificName
Carex divulsa subsp. leersii (F.W.Schultz) W.Koch
Carex L.
Carex saxatilis var. saxatilis
Carex fuliginosa subsp. misandra (R.Br.) Nyman
Carex bigelowii subsp. rigida (Raf.) W.Schultze-Motel
Carex buxbaumii subsp. mutica (Hartm.) Isov.
Carex buxbaumii f. buxbaumii
Carex norvegica subsp. inferalpina (Wahlenb.) Hultén
Carex diandra var. major (W.D.J.Koch) Boott
Carex buxbaumii var. mutica Hartm.
Carex cognata var. congolensis (Turrill) Lye
Carex saxatilis subsp. laxa (Trautv.) Kalela

In this case there are only genre recorded.

We then check for data about species in the locality information for which there are no species data given in the species column. Here there are enough unique values to be checked easily within R. It is possible that in large data sets that there may be very many records for which there is useful information

There are no species data in the locality field and so we begin a vector of IDs that are not at the required data resolution.

unUseful <- Carex |>
            filter(scientificName == "") |>
            select(id)

We now need to check that the species names in the records that we are using are valid for the area which we are looking at i.e. Sweden. There are several resources out there that do this. As we are looking specifically and solely at Sweden there is Dyntaxa.se. There is also an R package as part of the Swedish Biodiversity Data Infrastructure with examples and documentation. For wider applicability we will export the unique names from the Carex data frame. These will then be copied to the Dyntaxa portal for matching multiple names. The results are checked against the Swedish lists and unknown species or uncertainties are flagged. Names can be copied directly into a box in Dyntaxa or may be imported directly as an xlsx file. Here we shall export just the species names. There are multiple options.


SpeciesNames <- Carex |>
  filter(!id %in% unUseful) |>
  select("scientificName") |>
  distinct()

# library(openxlsx)
#write.xlsx(SpeciesNames,"./data/BDcleaner_Scripts/Example/UniqueCarexTaxa.xlsx")

Dyntaxa provides options for any taxononmic uncertainties and lists the species for which no match can be made. Having made selections it exports as a .xlslx file. We can then check what those species for which there is no information about what they are. In this case there are an number of species which are likely to be horticultural, a few taxa which are of hybrid origin which are in the Dyntaxa with the addition of ‘×’ between genus and species, and Carex utriculata a species from North America. This last could be a misidentification of C. rostrata. We will exclude the horticultural and uncertain species. We do this by adding a Species column in the Dyntaxa file and then load that file into R.

library(openxlsx)
SpeciesNames <- read.xlsx("./data/matchCarex.xlsx") # From Dyntaxa

Carex <- merge(Carex, 
               SpeciesNames[,c("Provided.string","SpeciesDyn")],
               by.x = "species",by.y = "Provided.string",
               all.x = TRUE)

We then add IDs of the rows with taxa which are to be excluded to the unUseful vector. It is possible that there is overlap between these vectors. Rather than overwriting the vector we join the vectors together and use unique to get a vector with no duplications.

unUseful <- unique(
  c(unUseful, 
    Carex |>
      filter(is.na(SpeciesDyn)) |>
      select(id)
    )
  )

###Coordinate Cleaning

We need to extract locality data for rows where there is no lat/lon information to geocode it, producing a latitude and longitude for each locality. This can be carried out in a similar way as for species ie the merging of data frames. This can be also be done automatically in R using the geocode feature of the libraries tidygeocoder or ggmap. Some of these services require an API key (eg. Google). See the documentation for the link for more information

Localities <- Carex |>
              filter(is.na(longitude)) |>
              select(locality) |>
              distinct()
write.csv(Localities,"./data/Localities.csv",row.names = FALSE)

For brevity we will simply include the records with no lat/lon info in the unUseful vector. These will ultimately not be used.

unUseful <- unique(c(unUseful,
                     Carex |>
                       filter(is.na(latitude)) |>
                       select(id)
                     )
                   )

Before cleaning the coordinates we simplify the data retaining columns that have information directly related to the collection of the data.

Carex <- Carex |>
          select(id, 
                 SpeciesDyn,
                 longitude,
                 latitude,
                 coordinateUncertaintyInMetres,
                 locality,
                 # recordedBy,
                 # countryCode,
                 # eventDate,
                 year,
                 month,
                 day,
                 institutionCode,
                 collectionCode,
                 catalogueNumber)

We then filter out the data rows for which we can not use ie those of too great taxonomic uncertainty. The data may then be cleaned.

There are a number of things to consider when cleaning data:

  • How precise are the locations? Coordinate uncertainty ranges in the Cyperaceae of Sweden from 1 m - 30.5 km.
  • Are interpreted coordinates sufficient? These interpreted coordinates may imply a greater precision than is necessarily true
  • Are the locations likely to be errors? Errors can relate to where a sample is (eg. herbarium location) rather than where it came from.
  • Are the locations in the country claimed? This can arise through swapping of lat and lon; duplication of latitude numbers in the longitude; or simply incorrect coordinates being given.
unUseful <- unique( c(unUseful, 
                      Carex |>
                        filter(is.na(coordinateUncertaintyInMetres)) |>
                        select(id)
                      )
                    )
# kable( t(table( round( Carex$coordinateUncertaintyInMetres, -2))))

We shall filter out all records with unknown coordinate uncertainty and all with a coordinate uncertainty more than 12.5 km.


unUseful <- unique( 
  c(unUseful, 
    Carex |>
      filter(is.na(coordinateUncertaintyInMetres)) |>
      select(id)
    )
  )
unUseful <- unique( 
  c(unUseful, 
    Carex |>
      filter(coordinateUncertaintyInMetres  > 12500) |>
      select(id)
    )
  )

We now have an index of records we can’t use owing to incomplete taxonomy or incomplete location information. We save these before cleaning the coordinates.

save(unUseful, file = "./data/Cyperacea_SWE_unsuseful.rdata")

We first remove the already labeled inaccurate data.

Carex <- Carex[-which(Carex$id %in% unUseful),]

We then use the library CoordinateCleaner to automatically flag coordinates that may be errors. Outputs of this can be a cleaned data.frame or additional columns with doubtful records flagged.

Carex$countryCode <- countrycode::countrycode(Carex$countryCode,"iso2c","iso3c") #converts ISO2 country codes to ISO3

Carex <- Carex |>
          filter(!id %in% unUseful)

Carex <- Carex[-which(is.na(Carex$latitude)),]

Carex <- clean_coordinates(Carex,
                  lon = "longitude",
                  lat = "latitude",
                  species = "SpeciesDyn",
                  countries = "countryCode",
                  tests = c("capitals", "centroids", "equal", "gbif", "institutions",
                            "outliers", "seas", "zeros","countries")
                  )

kable( Carex |>
  summarise(`Invalid coords` = sum(.val==FALSE),
            `Equal coords` = sum(.equ==FALSE),
            `0 coords` = sum(.zer==FALSE),
            `capitals` = sum(.cap==FALSE),
            `country centre` = sum(.cen==FALSE),
            `Country Border` = sum(.con==FALSE),
            `outlier` = sum(.otl==FALSE),
            `Gbif HQ` = sum(.gbf==FALSE),
            `Insitution` = sum(.inst==FALSE),
            `Summary` = sum(.summary==FALSE)), 
  col.names = c("Invalid coords","Equal coords","0 coords","capitals", 
                "country centre","Country Border","outlier","Gbif HQ",
                "Insitution", "Summary"))

Checking the distribution of the observations


Sweden <- raster::getData("GADM", 
                  country = "SWE",
                  path = "./data/",
                  level = 0)
Sweden_sf <- st_as_sf(Sweden)

BolMar <- ggplot(Sweden_sf) +
  geom_sf() +
  geom_point(data = Carex |>
                filter(SpeciesDyn == "Carex capillaris"),# |>
                # filter(.con == FALSE),
             aes(x = longitude,
                 y = latitude)) +
  theme_cowplot() +
  ggtitle("Carex capillaris")

CarAct <- ggplot(Sweden_sf) +
  geom_sf() +
  geom_point(data = Carex |>
              filter(SpeciesDyn == "Carex acuta"), # |>
                 # filter(.otl == TRUE),
             aes(x = longitude,
                 y = latitude)) + #,
                 # colour = .otl))+
  theme_cowplot() +
  ggtitle("Carex acuta")

grid_Sweden <- makeGrid(Sweden, 25)

OB <- organiseBirds(Carex, 
                    idCols = c("locality"),
                    xyCols = c("longitude", "latitude"),
                    sppCol = "SpeciesDyn")

SB <- summariseBirds(OB, grid_Sweden)
grid_Sweden <- st_as_sf(grid_Sweden)
Sweden_sf <- st_as_sf(Sweden)
library(colorRamps)
SB_SPat <- st_as_sf(SB$spatial)

nSpecies <- ggplot(data = SB_SPat,aes(fill = nSpp)) +
  geom_sf() +
  theme_cowplot() +
  scale_fill_gradientn(colors = matlab.like2(100))

nVisits <- ggplot(data = SB_SPat,aes(fill = nVis)) +
  geom_sf() +
  theme_cowplot() +
  scale_fill_gradientn(colors = matlab.like2(100))

nObs <- ggplot(data = SB_SPat,aes(fill = nObs)) +
  geom_sf() +
  theme_cowplot() +
  scale_fill_gradientn(colors = matlab.like2(100))

plot_grid(nSpecies, nVisits, nObs, ncol = 1)

References

Hodgetts, N. G., Lars Söderström, T. L. Blockeel, S. Caspari, Michael S. Ignatov, N. A. Konstantinova, N. Lockhart, Beáta Papp, Christian Schröck, and Manuela Sim-Sim. 2020. “An Annotated Checklist of Bryophytes of Europe, Macaronesia and Cyprus.” Journal of Bryology 42 (1): 1–116.