2. Prepare data
Load, merge and aggregate ASV portal data into a format suitable for downstream ecological analyses.
Load the datasets
Load the two 16S datasets using the data_path object defined in the previous section:
loaded <- load_data(data_path)This reads datasets into the loaded object, which you can inspect using the magnifying glass next to it in the Environment pane. The object is a list organised into data type sub-lists, each containing one table or matrix element per dataset:
counts: ASV read countsasvs: ASV sequences and taxonomic annotationsevents: core sample-level metadata (where, when, how, who)datasets: dataset-level metadataemof: dataset-specific extended measurement or fact contextual data
Sparse matrices (dgCMatrix) are used to reduce memory usage for count data, while tabular structures (data.table) are used for metadata. To see dataset sizes, inspect the elements in counts, where sizes are shown as matrix dimensions (e.g. KTH-2013-Baltic-16S: dgCMatrix [3074 x 21], corresponding to the number of ASVs × samples).
To see the full function documentation for load_data:
?asvoccur::load_dataMerge the datasets
Next, we merge the datasets in order to analyse them jointly:
merged <- merge_data(loaded)The merged object is a list with one combined object per data type (counts, events, asvs, datasets, emof), represented as a sparse matrix for count data and data tables for metadata, as in loaded.
Convert merged data to data frames
For some analyses, it is convenient to work with regular data frames instead of sparse matrices. Note that converting large count matrices to data frames can be very memory-intensive, but should be feasible for the datasets used here:
merged_df <- convert_to_df(merged, convert_counts = TRUE, max_cells = 1e9)In merged_df$counts, each row corresponds to an ASV and each column to a sample. To inspect the sample names:
colnames(merged_df$counts)When converting, ASVs shared across datasets are merged using SBDI ASV IDs. These IDs correspond to MD5 hashes of the sequences, meaning that ASVs with identical sequences are represented only once.
Aggregate counts by taxonomy
We now aggregate ASV read counts at different taxonomic levels based on the taxonomic annotation:
cladecounts <- sum_by_clade(merged$counts, merged$asvs)This produces aggregated counts at the following levels:
- kingdom (or domain)
- phylum
- class
- order
- family
- genus
- species
- otu (if available)
We convert the result to data frame format:
cladecounts_df <- convert_to_df(cladecounts)Both raw counts and normalised counts (relative abundances per sample) are available. For example:
cladecounts_df$raw$phylum[1:3, 1:3] # raw counts at phylum level
cladecounts_df$norm$class[1:3, 1:3] # relative abundances at class levelAt this point, the data are ready for exploration and analysis.
← Previous · Overview · Next →