Please send bug reports and/or suggestions to SBDI support.

Guide to ENA submission (webin)

This is a guide on how to submit sequence reads from environmental samples to the European Nucleotide Archive (ENA), provided by the Swedish Biodiversity Data Infrastructure (SBDI). Our guide is largely a summary of ENA’s own extensive instructions, with added pointers on issues specific to submission of metarbarcoding data, as well as on more general matters that may confuse first-time contributors. While ENA provides three different routes for submission, we describe interactive submission via the Webin portal here.

Preparation for submission

Step 1: Prepare data and metadata

In ENA, raw sequencing output from a next generation platform, including e.g. base calls and per-base quality scores, is accepted in FASTQ, CRAM or BAM format. Before submission, make sure that sequencing adapters have been removed (trimmed), and that reads have been assigned to their sample of origin (demultiplexed). In addition, gather all the information (metadata) you have about how, when and where you acquired the samples and generated the reads, as well as any contextual (environmental or clinical) data that was collected during sampling (see ENA’s metadata model).

Step 2: Register with ENA

To be able to submit data to ENA, you need to register an account. Go to the Webin submission Portal, select Register to fill out the form, and save your account details. You will receive a confirmation email with your account name.

Interactive submission

Step 1: Log in to submission portal

ENA provides two submission services: one for test submission and one for (real) production submission. Make sure that your data pass validation in the test procedure, before submitting anything to the production environment. All test submissions are removed the following day, but you will save all metadata as tab-separated text (*.tsv) files that you can reuse later.

After login, you are directed to the Dashboard, which gives you an overview of all available submission activities and reports (i.e. lists of successfully submitted samples etc.). Via the top-left hamburger icon (hamburger-icon) you can either go back to this Dashboard, or jump directly to some activity or report.

Step 2: Register study

The study (project) object is linked to samples and sequence reads via experiments, and is typically what you cite in publications. Click Studies (Projects) | Register Study to start filling out the Study form. Note that a single release date is set for all data within a study, and that you thus may want to split sequenced batches of samples into multiple ENA studies. After entering the required metadata for a study, click Submit | OK. If successful, you will receive a confirmation message, and should be able to see your study listed in the Studies (Projects) | Study Report page.

Step 3: Register samples

Samples are the source material from which your sequences derive, and the searchability and usability of your submitted data will depend on how well you document these samples. Go to Samples | Register Samples and click Download spreadsheet to register samples to start the process.

Step 3a: Select sample checklist

The ENA sample checklists are partly overlapping sets of attributes (or data fields) that can be used to describe samples, and by selecting one of these you enable your sample metadata to be validated for correctness during submission. For environmental and organismal (host-associated) samples, alike, we recommend using one of the Environmental Checklists and, among these, to select the alternative from the Genomic Standards Consortium (GSC) MIxS checklists that provides the most specific match to your sampled environment, for example:

Sampled environment

Recommended checklist

Air or general, above-ground, terrestrial

GSC MixS air

Epi- or endophytic (e.g. leaf, root)

GSC MIxS plant associated

Epi- or endozoic (e.g. spider gut, animal skin)

GSC MIxS host associated

Fresh- or seawater

GSC MixS water

Human gut / oral / skin / vaginal

GSC MIxS human gut / oral / skin / vaginal

Human non- gut / oral / skin / vaginal

GSC MIxS human associated

Sediment

GSC MixS sediment

Soil

GSC MixS soil

Note that most GSC MIxS checklists have similar setups of mandatory and recommended attributes, i.e. differ mainly in terms of which optional attributes can be added and validated during submission. The environmental attributes altitude, elevation and depth, are mandatory only for some lists, however. Furthermore, the GSC MIxS built environment list (not mentioned above) has several unique mandatories.

Step 3b: Add sample attributes

Once you have selected a checklist, you can inspect the attributes available in that list. All Mandatory Fields are preselected, and you can add Optional Fields by ticking their boxes. We suggest that you, at least, tick the following optional attributes for metabarcoding data*:

Checklist

Optional attribute

[ All ]

target gene

[ All ]

target subfragment

[ All ]

pcr primers

[ All ]

sequencing method

GSC MIxS host / plant associated

host taxid

* Disclaimer: While target gene/subfragment and primers are typically associated with the (sequencing) experiment, ENA only lists these fields as optional at sample level. When queried about this, ENA suggests adding the fields to experiments through post-submission XML editing, but we consider this process too complex. Therefore, we still recommend adding them to your samples as outlined above. Please note, however, that if you’ve sequenced multiple targets (e.g., both 16S and 18S) using the same physical samples, you will then need to submit two sets of digital samples. If you choose this approach, please make it easy for data users to identify sequences originating from the same physical sample by adding sample aliases, such as [physical_sample_id]_16S and [physical_sample_id]_18S.

To ensure that metadata are validated and searchable, you should use existing attributes whenever possible, but you can also add custom attributes to describe your data, if needed (see Add custom field in top left corner of page. Write your field name before hitting the plus button). Adding more than a few of these is easier to do in a spreadsheet, though, and we explain how to do that in the next step.

Step 3c: Download spreadsheet template

After selecting optional attributes, click Next, and then the Download TSV Template button to download a tab-separated values (*.tsv) file. Open the file in your spreadsheet application of choice (In MS Excel, click Data | Text to Columns | Delimited | Delimiter: Tab, to separate text into columns, if needed). With the added optional attributes from 3b, a template created from the GSC MIxS water checklist should look like this:

Checklist

ERC000024

GSC MIxS water

tax_id

scientific_name

sample_alias

sample_title

sample_description

project name

target gene

target subfragment

pcr primers

sequencing method

collection date

geographic location (country and/or sea)

geographic location (latitude)

geographic location (longitude)

depth

broad-scale environmental context

local environmental context

environmental medium

#units

DD

DD

m

Step 3d: Edit spreadsheet structure

In the downloaded spreadsheet, you can add custom attributes (with units, where applicable) to the right of existing columns, and samples below the row starting with #units. For instance, if you add two custom attributes (salinity [psu] and sampling_station_id), and three samples (xyz:1:01, xyz:1:02 and xyz:1:03) to the template from Step 3c, your spreadsheet structure should look like this:

Checklist

ERC000024

GSC MIxS water

tax_id

scientific_name

sample_alias

sample_title

sample_description

project name

target gene

target subfragment

pcr primers

sequencing method

collection date

geographic location (country and/or sea)

geographic location (latitude)

geographic location (longitude)

depth

broad-scale environmental context

local environmental context

environmental medium

salinity

sampling_station_id

#units

DD

DD

m

psu

xyz:1:01

xyz:1:02

xyz:1:03

Remember to enter sample aliases that correspond to what you use in related publications. This will enable readers to find sample-specific metadata and read files, even if you only state a Project accession number in your paper. Sample aliases can, furthermore, be optionally displayed as Unique name in the Sample report.

Step 3e: Add sample metadata

Before adding actual sample metadata to your spreadsheet, take a close look at ENA’s explanations of selected attributes and lists of permitted values. These are available in the Sample Checklists browser. Also note the following:

  • Taxon attributes for metabarcoding samples may be confusing. In this context, the tax_id & scientific_name attributes do not typically refer to sequenced organisms, but rather describe sampled organisms or environments. The scientific_name value spider metagenome is, for example, used to describe samples from a spider or spider body part, i.e. not samples from which you have derived spider sequences. The attributes tax_id & scientific_name should thus be selected from the list of environmental and organismal metagenomes in NCBI’s taxonomy browser. For host-associated samples, also differentiate between these generic attributes (i.e. tax_id & scientific_name) and host taxid, which you can also search for in NCBI’s taxonomy browser, and should be as specific as possible.

  • Some attributes should be selected from ontologies. To increase searchability, some attribute values should be selected from designated ontologies, which are formal specifications of terms used in certain contexts, and of how these terms relate to each other. You can browse or search the latest versions of ontologies used in ENA submission using the EMBL-EBI Ontology Lookup Service (OLS). You can also use the following direct links as starting points for finding valid terms for some mandatory or recommended attributes in a GSC MixS checklists:

    Checklist

    Ontology-linked attribute

    ENA description

    [ All ]

    broad-scale environmental context

    Report the major environmental system the sample or specimen came from. The system(s) identified should have a coarse spatial grain, to provide the general environmental context of where the sampling was done (e.g. in the desert or a rainforest). We recommend using subclasses of EnvO’s biome class: http://purl.obolibrary.org/obo/ENVO_00000428. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS.

    [ All ]

    local environmental context

    Report the entity or entities which are in the sample or specimen’s local vicinity and which you believe have significant causal influences on your sample or specimen. We recommend using EnvO terms which are of smaller spatial grain than your entry for ‘broad-scale environmental context’. Terms, such as anatomical sites, from other OBO Library ontologies which interoperate with EnvO (e.g. UBERON) are accepted in this field. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS.

    [ GSC MIxS non host / plant associated ]

    environmental medium

    Report the environmental material(s) immediately surrounding the sample or specimen at the time of sampling. We recommend using subclasses of ‘environmental material’ (http://purl.obolibrary.org/obo/ENVO_00010483). EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS . Terms from other OBO ontologies are permissible as long as they reference mass/volume nouns (e.g. air, water, blood) and not discrete, countable entities (e.g. a tree, a leaf, a table top).

    [ GSC MIxS host associated ]

    environmental medium (host)

    Report the environmental material(s) immediately surrounding the sample or specimen at the time of sampling. We recommend using subclasses of ‘environmental material’ (http://purl.obolibrary.org/obo/ENVO_00010483). EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS . Terms from other OBO ontologies are permissible as long as they reference mass/volume nouns (e.g. air, water, blood) and not discrete, countable entities (e.g. a tree, a leaf, a table top).

    [ GSC MIxS plant associated ]

    environmental medium (plant)

    Report the environmental material(s) immediately surrounding the sample or specimen at the time of sampling. We recommend using subclasses of ‘environmental material’ (http://purl.obolibrary.org/obo/ENVO_00010483). EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS . Terms from other OBO ontologies are permissible as long as they reference mass/volume nouns (e.g. air, water, blood) and not discrete, countable entities (e.g. a tree, a leaf, a table top).

    [ All ]

    sequencing method

    Sequencing machine used. Where possible the term should be taken from the OBI list of DNA sequencers (http://purl.obolibrary.org/obo/OBI_0400103).

    In the linked ontology tree views, click the plus sign next to a highlighted branch to show all instances of that term, and continue downwards until you find the most specific term that accurately describes your data. It is good practice to then register the term together with ontology acronym and accession, e.g: marine pelagic biome (ENVO:01000023).

  • Environmental attributes of host-associated samples are ambiguous. For instance, a spider may, in some sense, be the environment from which a host-associated sample derives. But as the external environment also may be of interest here, we tentatively suggest that you interpret broad-scale and local environmental context the same way as for non-host-associated samples, and use the most specific instance of material anatomical entity (UBERON:0000465) or plant anatomical entity (PO:0025131) for the environmental medium attribute of host- and plant associated samples, respectively.

  • Geographic positions should be given in decimal degrees (DD). If conversion is needed, use an online tool, such as the PGC Coordinate Converter, or the following formula:

    Decimal Degrees

    =

    Degrees + Minutes/60 + Seconds/3600

    Ex: 58°11’12.34’’ = 58 + 11/60 + 12.34/3600 = 58.1868°

    Ex: 58°11.21’ = 58 + 11.21/60 = 58.1868°

    The related Geographic location (country and/or sea) attribute should be selected from the INSDC list for countries and seas.

Step 3f: Upload spreadsheet

With data added, your spreadsheet should look similar to this (admittedly, we have added a manufacturer name to the sequencing method ontology term, as it seemed to be missing from this specific device):

Checklist

ERC000024

GSC MIxS water

tax_id

scientific_name

sample_alias

sample_title

sample_description

project name

target gene

target subfragment

pcr primers

sequencing method

collection date

geographic location (country and/or sea)

geographic location (latitude)

geographic location (longitude)

depth

broad-scale environmental context

local environmental context

environmental medium

salinity

sampling_station_id

#units

DD

DD

m

psu

1874687

marine plankton metagenome

xyz:1:01

Microbial plankton sample 1

Microbial plankton diversity along Baltic salinity gradient July 2018

16S rRNA

V3-V4

341F (CCTACGGGNGGCWGCAG) and 805R (GACTACHVGGGTATCTAATCC)

Illumina MiSeq

2018-07-23

Baltic Sea

61.407611

19.472809

3

marine pelagic biome (ENVO:01000023)

brackish water body (ENVO:01001321)

brackish water (ENVO:00002019)

3.13

A13

1874687

marine plankton metagenome

xyz:1:02

Microbial plankton sample 2

Microbial plankton diversity along Baltic salinity gradient July 2018

16S rRNA

V3-V4

341F (CCTACGGGNGGCWGCAG) and 805R (GACTACHVGGGTATCTAATCC)

Illumina MiSeq

2018-07-24

Baltic Sea

62.805337

19.912262

3

marine pelagic biome (ENVO:01000023)

brackish water body (ENVO:01001321)

brackish water (ENVO:00002019)

5.64

A14

1874687

marine plankton metagenome

xyz:1:03

Microbial plankton sample 3

Microbial plankton diversity along Baltic salinity gradient July 2018

16S rRNA

V3-V4

341F (CCTACGGGNGGCWGCAG) and 805R (GACTACHVGGGTATCTAATCC)

Illumina MiSeq

2018-07-25

Baltic Sea

64.970396

23.076324

3

marine pelagic biome (ENVO:01000023)

brackish water body (ENVO:01001321)

brackish water (ENVO:00002019)

6.1

A15

If so, go back to Samples | Register Samples, click Upload filled spreadsheet to register samples, select your edited *.tsv file, and click Submit Completed Spreadsheet. Given that your spreadsheet structure is correct, you will receive a confirmation message, and your samples should now be listed under Samples | Samples Report. Note that you need to click Show unique name to display sample aliases there.

Step 4: Prepare and upload read files

Before starting submission of reads and experiment data, you need to prepare and upload read files to your own directory in the Webin file upload area. We assume that you have paired-end reads in FASTQ format, like the example files from the mothur MiSeq SOP that we use below, but please refer to ENA’s guidelines on file formats for other options:

Step 4a: Compress files and calculate checksums

On a Mac, open the Terminal app (in Applications | Utilities), and do the following:

# Go to read file directory
# Tip: type 'cd ' and then drag/drop folder from Finder into Terminal
cd ~/your-read-file-dir

# Compress read files, if they are uncompressed (edit file extension as needed)
gzip -k *.fastq

To enable verification of file integrity after upload, calculate the md5 checksum of each (compressed) read file:

# Calculate and print md5 sums to tab-separated file (for easy cut-and-paste later)
for f in *.gz; do md5 $f | awk '{ gsub(/\(|\)/,""); print $2"\t" $4 }'; done > md5sums.tsv

The resulting file should look similar to this:

F3D148_S214_L001_R1_001.fastq.gz

bb8f57f8a94050de3d8ad47d03477e68

F3D148_S214_L001_R2_001.fastq.gz

d71d368d056b1d4bcf0ea74628f4d63a

F3D149_S215_L001_R1_001.fastq.gz

f76e0d9366998f7509a7aaca640cc842

F3D149_S215_L001_R2_001.fastq.gz

6c9acb1f743d984e07a9ce6b35b0fb6a

F3D150_S216_L001_R1_001.fastq.gz

e869e2b2bff6cd6c5dbe8f1b1d22d5bc

F3D150_S216_L001_R2_001.fastq.gz

f3c6d5a3dae027187ff5f80100d37482

The md5sum command should work similarly on a Linux machine, but Windows users may need to install some application to compress files to *.gz format, and do a quick web search for how to generate MD5 checksum for file on Windows.

Step 4b: Upload read files to ENA

You can now upload your sequence reads to ENA’s transit area, but remember to keep local copies of your data. There are several FTP (File Transfer Protocol) clients available for file transfer. We describe one alternative (lftp) for Mac users, but please refer to ENA’s guidelines on file upload for other options:

# Connect to FTP server [replace X:s, and provide password when prompted]
lftp webin2.ebi.ac.uk -u Webin-XXXXX
# Expected response: lftp Webin-XXXXX@webin2.ebi.ac.uk:~>

# Transfer your read files
mput ~/your-read-file-dir/*.fastq.gz
# Expected response: ... Total x files transferred

# Disconnect from server
bye

Step 5: Submit sequence reads

Once you have successfully uploaded your sequence reads, you need to associate them with already submitted sample metadata, as well as describe the sequencing experiment that produced the read files. In the ENA metadata model, an experiment refers to a sequencing event, and contains information on e.g. library construction and instruments, whereas runs represent the read files resulting from an experiment.

Step 5a: Download the experiment/run template

Go to Raw Reads (Experiments and Runs) | Submit Reads, click Download spreadsheet template for Read submission, and select your file format. We assume you want to Submit paired reads using two Fastq files, here. Again, Mandatory Fields have been pre-selected for you, but you can also add Optional Fields. Have a look at permitted values before downloading your template (or simply return to this page later).

Your downloaded spreadsheet template should look something this:

FileType

fastq

Read submission file type

sample

study

instrument_model

library_name

library_source

library_selection

library_strategy

library_layout

forward_file_name

forward_file_md5

reverse_file_name

reverse_file_md5

Step 5b: Describe experiment and runs

In your downloaded template, you need to link each experiment/run row to an already submitted sample and study by adding the correct Sample and Study Accession values under the sample and study headers. Then, for metabarcoding data, you would typically use the following values to describe sequenced libraries (but see ENA’s guidelines on experiment metadata for all available options):

  • library source: METAGENOMIC

  • library selection: PCR

  • library strategy: AMPLICON

Also add the following metadata for read files:

  • forward [reverse]_file_name: If you put your read files directly into your account (in step 4b), you add filenames only here. Otherwise, enter paths including all subdirectories.

  • forward [reverse]_file_md5: Paste the md5 checksums from Step 4a.

Your spreadsheet should now look similar to this (but add as much optional metadata as possible):

FileType

fastq

Read submission file type

sample

study

instrument_model

library_name

library_source

library_selection

library_strategy

library_layout

forward_file_name

forward_file_md5

reverse_file_name

reverse_file_md5

ERS8208349

PRJEB48461

Illumina MiSeq

16S rRNA amplicon of Sample 1

METAGENOMIC

PCR

AMPLICON

PAIRED

F3D148_S214_L001_R1_001.fastq.gz

bb8f57f8a94050de3d8ad47d03477e68

F3D148_S214_L001_R2_001.fastq.gz

d71d368d056b1d4bcf0ea74628f4d63a

ERS8208348

PRJEB48461

Illumina MiSeq

16S rRNA amplicon of Sample 2

METAGENOMIC

PCR

AMPLICON

PAIRED

F3D149_S215_L001_R1_001.fastq.gz

f76e0d9366998f7509a7aaca640cc842

F3D149_S215_L001_R2_001.fastq.gz

6c9acb1f743d984e07a9ce6b35b0fb6a

ERS8208347

PRJEB48461

Illumina MiSeq

16S rRNA amplicon of Sample 3

METAGENOMIC

PCR

AMPLICON

PAIRED

F3D150_S216_L001_R1_001.fastq.gz

e869e2b2bff6cd6c5dbe8f1b1d22d5bc

F3D150_S216_L001_R2_001.fastq.gz

f3c6d5a3dae027187ff5f80100d37482

If so, go back to Raw Reads (Experiments and Runs) | Submit reads, click Upload filled spreadsheet template for Read submission, select your edited *.tsv file, and click Submit Completed Spreadsheet. If your submission is confirmed, you should now be able to se your submitted runs in the Raw Reads (Experiments and Runs) | Run Files Report.

Step 6: Submit to production service

After careful inspection of metadata, you can now repeat the process in the production version of the Webin interface.

Post-submission editing

In each row of the above mentioned Reports, you can click the box-arrow icon action-icon in the Action column to either show related items, e.g. see runs linked to a certain sample, or inspect/edit the underlying xml of a submitted item. But for more complex edits you likely need to contact ENA and ask them to help you cancel and resubmit data.