Chapter 3 Data cleaning

Biodiversity data repositories work hard to maintain the accuracy of their holdings. When multiple sources are involved several problems may arise. Here we shall quickly outline what they are and possible solutions.

3.1 Resources

There are a number of libraries, work flows and online resources for automating downloading and cleaning of data. THese include:

The BDverse. A group of libraries for cleaning biodiversity data
Kurator
Wallace

3.2 Taxonomies

It is important to be aware of likely taxonomic anomalies prior to working within a region. Checklists are very important, especially if working over several regions / countries. Whilst there are many things that will automatically look for the validity of a name they do not check for the validity of that species occurrence. For example Sphagnum auriculatum Schimp. and Sphagnum denticulatum Bridel, 1826 are both valid names. S. auriculatum is the currently accepted species in Europe but in the British Isles, Ireland and the Netherlands s. denticulatum is the most recorded taxa. Both are legitimate names but they are a synonymy. Both names have been used in Europe but in distinct countries. The current European checklist (Hodgetts et al. 2020) has Sphagnum auriculatum Schimp. as the accepted taxon occurring in Europe. Naive downloading would result in two taxa being present when in fact it is two interpretations of the same taxon. Using data from across the European region without acknowledging this disagreement would impact the results of any research undertaken. For taxa which are known to be capable of dispersing great distances (eg. birds) this becomes even more difficult especially when using community sourced data.

The example above uses data downloaded from GBIF.

Data for Sphagnum denticulatum doi: 10.15468/dl.rrp4p4
Data for Sphagnum auriculatum doi: 10.15468/dl.3yrtw7

Within Sweden there is an agreed taxonomy for all extant taxa accessible through Dyntaxa. Checking of species lists can be done by directly copying species names into a dialogue box or uploading an excel spreadsheet in the correct format.

3.3 Location data

3.3.1 Locality information

Many records have locality information attached to them. Where there are no coordinates attached this information can be used to locate the record to within an area of where records most likely came from. There are functions for which geocoding can be done automatically within R. Geocoding is finding the coordinate for a known locality. These are included in the libraries:

ggmap requires google API key
tidygeocoder

Many localities may not be included in the gazetteers associated with these libraries but may be located using online or printed maps. This takes time but may be useful. Unique localities may be extracted, geocoded and then merged back with the data set.

This becomes especially important when extracting records across country boundaries as countries have different numbers of georeferenced observations.

Table 3.1: Table 3.2: The number and percentage of records of bryophyta per EU country with and without coordinates
	Country	Total Records	% Records with coordinate	% Records without coordinates
AL	Albania	385	6.23	6.23
AD	Andorra	2408	6.85	6.85
AM	Armenia	503	54.67	54.67
AT	Austria	68000	20.78	20.78
BY	Belarus	2102	20.65	20.65
BE	Belgium	470192	6.73	6.73
BA	Bosnia & Herzegovina	415	71.33	71.33
BG	Bulgaria	1349	37.21	37.21
HR	Croatia	1060	49.81	49.81
CY	Cyprus	595	61.18	61.18
CZ	Czechia	19910	48.49	48.49
DK	Denmark	949565	1.24	1.24
EE	Estonia	109502	34.48	34.48
FO	Faroe Islands	1618	47.10	47.10
FI	Finland	256141	5.35	5.35
FR	France	1298606	6.27	6.27
GE	Georgia	9680	36.94	36.94
DE	Germany	213246	18.91	18.91
GI	Gibraltar	22	100.00	100.00
GR	Greece	2183	65.23	65.23
HU	Hungary	4115	58.13	58.13
IS	Iceland	43033	15.07	15.07
IE	Ireland	132611	2.77	2.77
IT	Italy	22635	44.66	44.66
LV	Latvia	7596	42.96	42.96
LI	Liechtenstein	546	3.48	3.48
LT	Lithuania	5431	6.52	6.52
LU	Luxembourg	23640	1.61	1.61
MT	Malta	95	40.00	40.00
MC	Monaco	16	87.50	87.50
NL	Netherlands	886075	4.02	4.02
MK	North Macedonia	284	24.65	24.65
NO	Norway	526305	29.36	29.36
PL	Poland	52850	31.81	31.81
PT	Portugal	256568	9.93	9.93
RO	Romania	1501	58.69	58.69
RU	Russia	217991	15.96	15.96
SM	San Marino	1	0.00	0.00
SK	Slovakia	5047	51.46	51.46
SI	Slovenia	3991	37.31	37.31
ES	Spain	130037	20.33	20.33
SE	Sweden	1205171	10.79	10.79
CH	Switzerland	267715	4.51	4.51
TR	Turkey	9274	85.00	85.00
UA	Ukraine	13225	8.81	8.81
GB	United Kingdom	2976217	1.44	1.44
VA	Vatican City	3	100.00	100.00

As can be seen from the above example for European Bryophyta (Hornworts, Liverworts and Mosses) the number of records with coordinates vastly varies between countries. For example of 959444 records in Sweden 88.24 % have coordinates in Switzerland of 117132 records only 2.29 % have coordinates.

3.3.2 Coordinate uncertainty

In many cases there is now an abundance of biodiversity data with coordinates. As can be seen from a summary of the above Sphagnum data coordinate uncertainty can vary from less than 1 meter to multiple kilometers.


Coordinate uncertainty	<1 m	1-10 m	10-100 m	100-500 m	500-1000 m	1-5 km	5-10 km	10-50 km	>50 km	None
Frequency	335	586	5320	376	4619	4980	4147	41	2	7506

In the above example it can be seen that two have uncertainties greater than 50 km and several thousand records that have no known error margin the location. It is important to consider what the error is and removing those records for which the uncertainty is too high. Where this point is will be dependent on the scale of the research.

3.3.3 Coordinate errors

Besides the declared uncertainty of the the coordinates, coordinate errors may occur for a variety of reasons. The library coordinateCleaner is very useful for removing some of the most common errors. These include:

0 latitude, 0 longitude
swapping of latitude and longitude
The location of the institution holding a preserved sample rather than the location of origin of that sample

3.4 Examples

3.4.1 Data cleaning Cyperaceae in Sweden

In this example I have downloaded the occurrences of the Cyperaceae family in Sweden from SBDI since 2000. This is a single data set from an area with an agreed taxonomy dyntaxa. Our aim here is to download a group of records and clean them to be in such a state than they reflect the question the researcher has in mind.

First we give each record in the assembled data set a unique ID. In this case it could be possible to use the GBIFid. In data sets sourced from multiple sources it is a good idea to create a unique ID related to the data source.

In cleaning data there are a number of dimensions of uncertainty we shall first check.

3.4.2 Taxonomy

In the species column there are records with no Species given in the species column. We then check the scientificName for whether there is useful information there

kable(Carex |>
  filter(species == "") |>
  distinct(scientificName)) |>
   kable_styling(latex_options = c("striped", "hold_position"),
                full_width = F)

scientificName
Carex divulsa subsp. leersii (F.W.Schultz) W.Koch
Carex L.
Carex saxatilis var. saxatilis
Carex fuliginosa subsp. misandra (R.Br.) Nyman
Carex bigelowii subsp. rigida (Raf.) W.Schultze-Motel
Carex buxbaumii subsp. mutica (Hartm.) Isov.
Carex buxbaumii f. buxbaumii
Carex norvegica subsp. inferalpina (Wahlenb.) Hultén
Carex diandra var. major (W.D.J.Koch) Boott
Carex buxbaumii var. mutica Hartm.
Carex cognata var. congolensis (Turrill) Lye
Carex saxatilis subsp. laxa (Trautv.) Kalela

In this case there are only genre recorded.

We then check for data about species in the locality information for which there are no species data given in the species column. Here there are enough unique values to be checked easily within R. It is possible that in large data sets that there may be very many records for which there is useful information

There are no species data in the locality field and so we begin a vector of IDs that are not at the required data resolution.

unUseful <- Carex |>
            filter(scientificName == "") |>
            select(id)

We now need to check that the species names in the records that we are using are valid for the area which we are looking at i.e. Sweden. There are several resources out there that do this. As we are looking specifically and solely at Sweden there is Dyntaxa.se. There is also an R package as part of the Swedish Biodiversity Data Infrastructure with examples and documentation. For wider applicability we will export the unique names from the Carex data frame. These will then be copied to the Dyntaxa portal for matching multiple names. The results are checked against the Swedish lists and unknown species or uncertainties are flagged. Names can be copied directly into a box in Dyntaxa or may be imported directly as an xlsx file. Here we shall export just the species names. There are multiple options.


SpeciesNames <- Carex |>
  filter(!id %in% unUseful) |>
  select("scientificName") |>
  distinct()

# library(openxlsx)
#write.xlsx(SpeciesNames,"./data/BDcleaner_Scripts/Example/UniqueCarexTaxa.xlsx")

Dyntaxa provides options for any taxononmic uncertainties and lists the species for which no match can be made. Having made selections it exports as a .xlslx file. We can then check what those species for which there is no information about what they are. In this case there are an number of species which are likely to be horticultural, a few taxa which are of hybrid origin which are in the Dyntaxa with the addition of ‘×’ between genus and species, and Carex utriculata a species from North America. This last could be a misidentification of C. rostrata. We will exclude the horticultural and uncertain species. We do this by adding a Species column in the Dyntaxa file and then load that file into R.

library(openxlsx)
SpeciesNames <- read.xlsx("./data/matchCarex.xlsx") # From Dyntaxa

Carex <- merge(Carex, 
               SpeciesNames[,c("Provided.string","SpeciesDyn")],
               by.x = "species",by.y = "Provided.string",
               all.x = TRUE)

We then add IDs of the rows with taxa which are to be excluded to the unUseful vector. It is possible that there is overlap between these vectors. Rather than overwriting the vector we join the vectors together and use unique to get a vector with no duplications.

unUseful <- unique(
  c(unUseful, 
    Carex |>
      filter(is.na(SpeciesDyn)) |>
      select(id)
    )
  )

###Coordinate Cleaning

We need to extract locality data for rows where there is no lat/lon information to geocode it, producing a latitude and longitude for each locality. This can be carried out in a similar way as for species ie the merging of data frames. This can be also be done automatically in R using the geocode feature of the libraries tidygeocoder or ggmap. Some of these services require an API key (eg. Google). See the documentation for the link for more information

Localities <- Carex |>
              filter(is.na(longitude)) |>
              select(locality) |>
              distinct()
write.csv(Localities,"./data/Localities.csv",row.names = FALSE)

For brevity we will simply include the records with no lat/lon info in the unUseful vector. These will ultimately not be used.

unUseful <- unique(c(unUseful,
                     Carex |>
                       filter(is.na(latitude)) |>
                       select(id)
                     )
                   )

Before cleaning the coordinates we simplify the data retaining columns that have information directly related to the collection of the data.

Carex <- Carex |>
          select(id, 
                 SpeciesDyn,
                 longitude,
                 latitude,
                 coordinateUncertaintyInMetres,
                 locality,
                 # recordedBy,
                 # countryCode,
                 # eventDate,
                 year,
                 month,
                 day,
                 institutionCode,
                 collectionCode,
                 catalogueNumber)

We then filter out the data rows for which we can not use ie those of too great taxonomic uncertainty. The data may then be cleaned.

There are a number of things to consider when cleaning data:

How precise are the locations? Coordinate uncertainty ranges in the Cyperaceae of Sweden from 1 m - 30.5 km.
Are interpreted coordinates sufficient? These interpreted coordinates may imply a greater precision than is necessarily true
Are the locations likely to be errors? Errors can relate to where a sample is (eg. herbarium location) rather than where it came from.
Are the locations in the country claimed? This can arise through swapping of lat and lon; duplication of latitude numbers in the longitude; or simply incorrect coordinates being given.

unUseful <- unique( c(unUseful, 
                      Carex |>
                        filter(is.na(coordinateUncertaintyInMetres)) |>
                        select(id)
                      )
                    )
# kable( t(table( round( Carex$coordinateUncertaintyInMetres, -2))))

We shall filter out all records with unknown coordinate uncertainty and all with a coordinate uncertainty more than 12.5 km.


unUseful <- unique( 
  c(unUseful, 
    Carex |>
      filter(is.na(coordinateUncertaintyInMetres)) |>
      select(id)
    )
  )
unUseful <- unique( 
  c(unUseful, 
    Carex |>
      filter(coordinateUncertaintyInMetres  > 12500) |>
      select(id)
    )
  )

We now have an index of records we can’t use owing to incomplete taxonomy or incomplete location information. We save these before cleaning the coordinates.

save(unUseful, file = "./data/Cyperacea_SWE_unsuseful.rdata")

We first remove the already labeled inaccurate data.

Carex <- Carex[-which(Carex$id %in% unUseful),]

We then use the library CoordinateCleaner to automatically flag coordinates that may be errors. Outputs of this can be a cleaned data.frame or additional columns with doubtful records flagged.

Carex$countryCode <- countrycode::countrycode(Carex$countryCode,"iso2c","iso3c") #converts ISO2 country codes to ISO3

Carex <- Carex |>
          filter(!id %in% unUseful)

Carex <- Carex[-which(is.na(Carex$latitude)),]

Carex <- clean_coordinates(Carex,
                  lon = "longitude",
                  lat = "latitude",
                  species = "SpeciesDyn",
                  countries = "countryCode",
                  tests = c("capitals", "centroids", "equal", "gbif", "institutions",
                            "outliers", "seas", "zeros","countries")
                  )

kable( Carex |>
  summarise(`Invalid coords` = sum(.val==FALSE),
            `Equal coords` = sum(.equ==FALSE),
            `0 coords` = sum(.zer==FALSE),
            `capitals` = sum(.cap==FALSE),
            `country centre` = sum(.cen==FALSE),
            `Country Border` = sum(.con==FALSE),
            `outlier` = sum(.otl==FALSE),
            `Gbif HQ` = sum(.gbf==FALSE),
            `Insitution` = sum(.inst==FALSE),
            `Summary` = sum(.summary==FALSE)), 
  col.names = c("Invalid coords","Equal coords","0 coords","capitals", 
                "country centre","Country Border","outlier","Gbif HQ",
                "Insitution", "Summary"))

Checking the distribution of the observations


Sweden <- raster::getData("GADM", 
                  country = "SWE",
                  path = "./data/",
                  level = 0)
Sweden_sf <- st_as_sf(Sweden)

BolMar <- ggplot(Sweden_sf) +
  geom_sf() +
  geom_point(data = Carex |>
                filter(SpeciesDyn == "Carex capillaris"),# |>
                # filter(.con == FALSE),
             aes(x = longitude,
                 y = latitude)) +
  theme_cowplot() +
  ggtitle("Carex capillaris")

CarAct <- ggplot(Sweden_sf) +
  geom_sf() +
  geom_point(data = Carex |>
              filter(SpeciesDyn == "Carex acuta"), # |>
                 # filter(.otl == TRUE),
             aes(x = longitude,
                 y = latitude)) + #,
                 # colour = .otl))+
  theme_cowplot() +
  ggtitle("Carex acuta")


grid_Sweden <- makeGrid(Sweden, 25)

OB <- organiseBirds(Carex, 
                    idCols = c("locality"),
                    xyCols = c("longitude", "latitude"),
                    sppCol = "SpeciesDyn")

SB <- summariseBirds(OB, grid_Sweden)
grid_Sweden <- st_as_sf(grid_Sweden)
Sweden_sf <- st_as_sf(Sweden)

library(colorRamps)
SB_SPat <- st_as_sf(SB$spatial)

nSpecies <- ggplot(data = SB_SPat,aes(fill = nSpp)) +
  geom_sf() +
  theme_cowplot() +
  scale_fill_gradientn(colors = matlab.like2(100))

nVisits <- ggplot(data = SB_SPat,aes(fill = nVis)) +
  geom_sf() +
  theme_cowplot() +
  scale_fill_gradientn(colors = matlab.like2(100))

nObs <- ggplot(data = SB_SPat,aes(fill = nObs)) +
  geom_sf() +
  theme_cowplot() +
  scale_fill_gradientn(colors = matlab.like2(100))

plot_grid(nSpecies, nVisits, nObs, ncol = 1)

References

Hodgetts, N. G., Lars Söderström, T. L. Blockeel, S. Caspari, Michael S. Ignatov, N. A. Konstantinova, N. Lockhart, Beáta Papp, Christian Schröck, and Manuela Sim-Sim. 2020. “An Annotated Checklist of Bryophytes of Europe, Macaronesia and Cyprus.” Journal of Bryology 42 (1): 1–116.