2. Prepare data

Load, merge and aggregate ASV portal data into a format suitable for downstream ecological analyses.

Load the datasets

Load the two 16S datasets using the data_path object defined in the previous section:

loaded <- load_data(data_path)

This reads datasets into the loaded object, which you can inspect using the magnifying glass next to it in the Environment pane. The object is a list organised into data type sub-lists, each containing one table or matrix element per dataset:

counts: ASV read counts
asvs: ASV sequences and taxonomic annotations
events: core sample-level metadata (where, when, how, who)
datasets: dataset-level metadata
emof: dataset-specific extended measurement or fact contextual data

Sparse matrices (dgCMatrix) are used to reduce memory usage for count data, while tabular structures (data.table) are used for metadata. To see dataset sizes, inspect the elements in counts, where sizes are shown as matrix dimensions (e.g. KTH-2013-Baltic-16S: dgCMatrix [3074 x 21], corresponding to the number of ASVs × samples).

To see the full function documentation for load_data:

?asvoccur::load_data

Merge the datasets

Next, we merge the datasets in order to analyse them jointly:

merged <- merge_data(loaded)

The merged object is a list with one combined object per data type (counts, events, asvs, datasets, emof), represented as a sparse matrix for count data and data tables for metadata, as in loaded.

Convert merged data to data frames

For some analyses, it is convenient to work with regular data frames instead of sparse matrices. Note that converting large count matrices to data frames can be very memory-intensive, but should be feasible for the datasets used here:

merged_df <- convert_to_df(merged, convert_counts = TRUE, max_cells = 1e9)

In merged_df$counts, each row corresponds to an ASV and each column to a sample. To inspect the sample names:

colnames(merged_df$counts)

When converting, ASVs shared across datasets are merged using SBDI ASV IDs. These IDs correspond to MD5 hashes of the sequences, meaning that ASVs with identical sequences are represented only once.

Aggregate counts by taxonomy

We now aggregate ASV read counts at different taxonomic levels based on the taxonomic annotation:

cladecounts <- sum_by_clade(merged$counts, merged$asvs)

This produces aggregated counts at the following levels:

kingdom (or domain)
phylum
class
order
family
genus
species
otu (if available)

We convert the result to data frame format:

cladecounts_df <- convert_to_df(cladecounts)

Both raw counts and normalised counts (relative abundances per sample) are available. For example:

cladecounts_df$raw$phylum[1:3, 1:3]  # raw counts at phylum level
cladecounts_df$norm$class[1:3, 1:3]  # relative abundances at class level

At this point, the data are ready for exploration and analysis.

← Previous · Overview · Next →