Taxonomy assignment

Overview

Teaching: 30 min
Exercises: 60 min

Questions

What reference databases are available for metabarcoding

How do I make a custom database?

What are the main strategies for assigning taxonomy to OTUs?

What is the difference between Sintax and Naive Bayes?

Objectives

Create your own local reference database

Assign taxonomy using SINTAX

Assign taxonomy using local BLAST

Use less to explore a long help output

Write a script to assign taxonomy

Taxonomy assignment

1. Introduction

As we have previously discussed, eDNA metabarcoding is the process whereby we amplify a specific gene region of the taxonomic group of interest from a DNA extract obtained from an environmental sample. Once the DNA is amplified, a library is prepared for sequencing. Up to this point, we have prepared the data for the bioinformatic process and executed a bioinformatic pipeline, which ended with a count table (zotutable.txt) and a ZOTU list (zotus.fasta) containing all the biologically relevant sequences. The next step in the bioinformatic process will be to assign a taxonomic ID to each sequence to determine what species were detected through our eDNA metabarcoding analysis.

There are four basic strategies to taxonomy classification based on (Hleap et al. (2021)) (and endless variations of each of these):

Sequence Similarity (SS)
Sequence Composition (SC)
Phylogenetic (Ph)
Probabilistic (Pr)

alt text

Of the four abovementioned strategies to taxonomy classification, the Sequence Similarity and Sequence Composition methods are currently the most frequently used for taxonomy assignment in the metabarcoding research community. Multiple comparative experiments have been conducted to determine the most optimal approach to assign a taxonomic ID to metabarcoding data, though it seems to be that the main conclusion of those papers is that their own method is best. One example of a comparative study is the one from Bokulich et al., 2018.

alt text

While there are multiple ways to assign a taxonomic ID to a sequence, all methods require a reference database to match your ZOTUs to a set of sequences with known taxonomic IDs. So, before we start with taxonomy assignment, let’s create our own local curated reference database using (CRABS), which stands for Creating Reference databases for Amplicon-Based Sequencing.

2. Reference databases

Several reference databases are available online. The most notable ones being NCBI, EMBL, BOLD, SILVA, and RDP. However, recent research has indicated a need to use custom curated reference databases to increase the accuracy of taxonomy assignment. While certain gene-specific or primer-specific reference databases are available (RDP: 16S microbial; MIDORI: COI eukaryotes; etc.), this essential data source is missing in most instances. We will, therefore, show you how to build your own custom curated reference database using CRABS.

For time-saving purposes, we will only build a small reference database that includes sharks (Chondrichthyes) for the primer set used during this experiment, rather than the reference database we will be using for taxonomy assignment later on that includes bony fish and sharks. However, the code provided here can be changed to build a reference database for any primer set or taxonomic group by changing some of the parameters. The final reference database that we will be using later on for taxonomy assignment is already placed in the refdb folder on NeSI.

2.1 CRABS workflow

The CRABS workflow consists of seven parts, including:

(1) download data from online repositories;
(2) import downloaded data into CRABS format;
(3) extract amplicons through in silico PCR analysis;
(4) retrieving amplicons without primer-binding regions through pairwise global alignments;
(5) curate and subset the local database;
(6) export the local database in various taxonomic classifier formats; and
(7) basic visualizations to explore the local reference database.

alt text

2.2 Help documentation

To bring up the help documentation for CRABS, we will first have to source the moduleload script to load the program.

source moduleload
crabs -h

alt text

2.3 Step 1: download data from online repositories

As a first step, we will download the 16S shark sequences from the NCBI online repository. While there are multiple repositories supported by CRABS, we’ll only focus on this one today, as NCBI is the most-widely used online repository in metabarcoding research. To download data from the NCBI servers, we need to provide a few parameters to CRABS. Maybe one of the trickiest to understand is the --query parameter. However, we can get the input string for this parameter from the NCBI website, as shown on the screenshot below.

alt text

cd ../refdb/
crabs --download-ncbi --query '(16S[All Fields] AND ("Chondrichthyes"[Organism] OR Chondrichthyes[All Fields])) AND mitochondrion[filter]' --database nucleotide --email gjeunen@gmail.com --output ncbi_16S_chondrichthyes.fasta

/// CRABS | v1.0.6

|            Function | Download NCBI database
|Retrieving NCBI info | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:01
|         Downloading | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:23
|             Results | Number of sequences downloaded: 3764/3764 (100.0%)

Let’s look at the top 10 lines of the downloaded fasta file to see what CRABS created:

head -n 10 ncbi_16S_chondrichthyes.fasta

>PP329120.1 Mobula hypostoma isolate MHBRsp_Gua mitochondrion, complete genome
GCTAGTGTAGCTTAATTTAAAGCATGGCACTGAAGATGCTAAGATAAAAATTAACCTTTTTCACAAGCAT
GAAGGTTTGGTCCCAGCCTCAATATTAGTTTTAACTTGACTTACACATGCAAGTCTCAGCATTCCGGTGA
GAACGCCCTAATCACCCCTAAAATTTGATTAGGAGCTGGTATCAGGCACATTCCAAGAACAGCCCATGAC
ACCTCGCTCAGCCACACCCACAAGGGAACTCAGCAGTGATAGATATTGTTCTATAAGCGCAAGCTTGAGA
CAATCAAAGTTATAAAGAGTTGGTTAATCTCGTGCCAGCCACCGCGGTTATACGAGTGACACAAATTAAT
ATTTCACGGCGTTAAGGGTGATTAGAAATAATCTCATCAGTATAAAGTTAAGACCTCATTAAGCTGTCAT
ACGCACTTATGATCGAAAACATCATTCACGAAAGTAACTTTACACAAACAGAATTTTTGACCTCACGACA
GTTAAGGCCCAAACTAGGATTAGATACCCTACTATGCTTAACCGCAAACATTGTTATTAATTAATACTAC
CTTAATACCCGCCTGAGTACTACAAGCGCTAGCTTAAAACCCAAAGGACTTGGCGGTGCTCCAAACCCAC

Here, we can see that the CRABS has downloaded the original format from the NCBI servers, which is a multi-line fasta document.

2.4 Step 2: import downloaded data into CRABS format

Since CRABS can download data from multiple online repositories, each with their own format, we need to first reformat the data to CRABS format before we can continue building our reference database. This reformatting can be done using the --import function, which will create a single line text document for each sequence, that contains a unique ID, the taxonomic lineage for that sequence, plus the sequence itself. The CRABS taxonomic lineage is based on the NCBI taxonomy database and requires 3 files to be downloaded using the --download-taxonomy function. However, to speed things up, these 3 files have already been provided to you.

crabs --import --import-format ncbi --names names.dmp --nodes nodes.dmp --acc2tax nucl_gb.accession2taxid --input ncbi_16S_chondrichthyes.fasta --output ncbi_16S_chondrichthyes.txt --ranks 'superkingdom;phylum;class;order;family;genus;species'

/// CRABS | v1.0.6

|            Function | Import sequence data into CRABS format
| Read data to memory | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:05:29
|Phylogenetic lineage | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|Fill missing lineage | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|             Results | Imported 3764 out of 3764 sequences into CRABS format (100.0%)

When we now look at the first line of the imported document, we can see how the structure has changed:

head -n 1 ncbi_16S_chondrichthyes.txt

PP329120	Mobula hypostoma	723540	Eukaryota	Chordata	Chondrichthyes	Myliobatiformes	Myliobatidae	Mobula	Mobula hypostoma	GCTAGTGTAGCTTAATTTAAAGCATGGCACTGAAGATGCTAAGATAAAAATTAACCTTTTTCACAAGCATGAAGGTTT...

2.5 Step 3: extract amplicons through in silico PCR

Once we have formatted the data, we can start with a first curation step of the reference database, i.e., extracting the amplicon sequences. We will be extracting the amplicon region from each sequence through an in silico PCR analysis. With this analysis, we will locate the forward and reverse primer binding regions and extract the sequence in between. This will significantly reduce file sizes when larger data sets are downloaded, while also ensuring that only the necessary information is kept.

crabs --in-silico-pcr --input ncbi_16S_chondrichthyes.txt --output ncbi_16S_chondrichthyes_amplicon.txt --forward GACCCTATGGAGCTTTAGAC --reverse CGCTGTTATCCCTADRGTAACT

/// CRABS | v1.0.6

|            Function | Extract amplicons through in silico PCR
|         Import data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|  Transform to fasta | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|       In silico PCR | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|      Exporting data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|             Results | Extracted 3427 amplicons from 3764 sequences (91.05%)

2.6 Step 4: extract amplicons through pairwise global alignments

Amplicons in the originally downloaded sequences might be missed during the in silico PCR analysis when one or both primer-binding regions are not incorporated in the online deposited sequence. This can happen when the reference barcode is generated with the same primer set or if the deposited sequence is incomplete in the primer-binding regions (denoted as “N” in the sequence). To retrieve those missed amplicons, we can use the already-retrieved amplicons as seed sequences in a Pairwise Global Alignment (PGA) analysis.

crabs --pairwise-global-alignment --input ncbi_16S_chondrichthyes.txt --amplicons ncbi_16S_chondrichthyes_amplicon.txt --output ncbi_16S_chondrichthyes_pga.txt --forward GACCCTATGGAGCTTTAGAC --reverse CGCTGTTATCCCTADRGTAACT --percent-identity 0.9 --coverage 90

/// CRABS | v1.0.6

|            Function | Retrieve amplicons without primer-binding regions
|         Import data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|  Transform to fasta | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|  Pairwise alignment | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:01
|Parse alignment data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|      Exporting data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|             Results | Retrieved 42 amplicons without primer-binding regions from 337 sequences

2.7 Step 5: dereplicating the database

Now that we have retrieved all amplicons from the downloaded sequences, we can remove duplicate sequences for each species, i.e., dereplicating the reference database. This step helps reduce the file size even more, plus is essential for most taxonomic classifiers that restrict the output to N number of best hits.

crabs --dereplicate --input ncbi_16S_chondrichthyes_pga.txt --output ncbi_16S_chondrichthyes_derep.txt --dereplication-method unique_species

/// CRABS | v1.0.6

|            Function | Dereplicate CRABS database
|         Import data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|      Exporting data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|             Results | Written 723 unique sequences to ncbi_16S_chondrichthyes_derep.txt out of 3469 initial sequences (20.84%)

2.8 Step 6: filter the reference database

The final curation step for our custom reference database is to clean up the database using a variety of parameters, including minimum length, maximum length, maximum number of ambiguous base calls, environmental sequences, sequences for which the species name is not provided, and sequences with unspecified taxonomic levels.

crabs --filter --input ncbi_16S_chondrichthyes_derep.txt --output ncbi_16S_chondrichthyes_clean.txt --minimum-length 150 --maximum-length 250 --environmental --no-species-id --maximum-n 1 --rank-na 2

/// CRABS | v1.0.6

|            Function | Filter CRABS database
| Included parameters | "--minimum-length", "--maximum-length", "--maximum-n", "--environmental", "--no-species-id", "--rank-na"
|         Import data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|      Exporting data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|             Results | Written 706 filtered sequences to ncbi_16S_chondrichthyes_clean.txt out of 723 initial sequences (97.65%)
|                     | Maximum ambiguous bases filter: 17 sequences not passing filter (2.35%)

2.9 Step 7: export the reference database to a format supported by the taxonomic classifier software

Now that we have created a custom curated reference database containing 706 shark sequences for the 16S rRNA gene, we can export the reference database to be used by different taxonomic classifiers. Multiple formats are implemented in CRABS, but today we will be exporting the reference database to SINTAX format, which is a kmer based classifier implemented in USEARCH and VSEARCH.

crabs --export --input ncbi_16S_chondrichthyes_clean.txt --output ncbi_16S_chondrichthyes_sintax.fasta --export-format sintax

/// CRABS | v1.0.6

|            Function | Export CRABS database to SINTAX format
|         Import data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|      Exporting data | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
|             Results | Written 706 sequences to ncbi_16S_chondrichthyes_sintax.fasta out of 706 initial sequences (100.0%)

Let’s open the reference database to see what the final database looks like!

head -n 6 ncbi_16S_chondrichthyes_sintax.fasta

>PP329120;tax=d:Eukaryota,p:Chordata,c:Chondrichthyes,o:Myliobatiformes,f:Myliobatidae,g:Mobula,s:Mobula_hypostoma
ACTTAAGTCACTTTTAAGATATTAAATTTTTTCCTCAGGGTATAAAATAACAAACTAATTTTCTGACTTAACCTTGTTTTTGGTTGGGGCGACCGAGGGGTAAAACAAAACCCCCTTATCGAATGTGTGAAATATCACTTAAAAATTAGAGTTACATCTCTAGTTAATAGAAAATCTAACGAACAATGACCCAGGAAACTTATCCTGATCAATGAACCA
>PQ469261;tax=d:Eukaryota,p:Chordata,c:Chondrichthyes,o:Carcharhiniformes,f:Carcharhinidae,g:Prionace,s:Prionace_glauca
ACTTAAATTAATTATGTAACCATCCATTTCCCAGGAAATAAACAAAATATACAACACTTCTAATTTAACTGTTTTTGGTTGGGGTGACCAAGGGGAAAAACAAATCCCCCTCATCGATTGAGTACTAAGTACTTAAAAATTAGAATGACAATTCTAATTAATAAAACATTTACCGAAAAATGATCCAGGATTTCCTGATCAATGAACCA
>PQ182775;tax=d:Eukaryota,p:Chordata,c:Chondrichthyes,o:Carcharhiniformes,f:Triakidae,g:Mustelus,s:Mustelus_schmitti
ACTTAAGTTAATTATGTAACCATTTATTCCTTAGGGTATAAACAAAATATACAATACTTCTAACTTAACTGTTTTTGGTTGGGGTGACCGAGGGGGAAAACAAATCCCCCTCATCGATTGAGTACTCAGTACTTGAAAATCAGAATAACAATTCTGATTAATAAAATATTTAGCGAAAAATGACCCAGGATTTCCTGATCAATGAACCA

That’s all it takes to generate your own reference database. We are now ready to assign a taxonomy to our OTUs that were generated in the previous section!

3. The 16S Vertebrate reference database

For our tutorial data set, we will be using the sintax_vertebrata_16S.fasta and vertebrata_16S databases, with the first one being in SINTAX format and the second one being in BLAST format. The reference databases were created using CRABS, as shown above, and contain data from the MitoFish online repository, NCBI online repository, and local New Zealand barcodes that we generated at the University of Otago. These reference databases hold 57,092 high quality, unique, 16S rRNA reference barcodes for vertebrate taxa.

4. Taxonomic classification

4.1 SINTAX

The first method we will explore is the SINTAX algorithm. The SINTAX algorithm predicts taxonomy by using k-mer similarity to identify the top hit in a reference database and provides bootstrap confidence for all ranks in the prediction. Unlike the RDP classifier, SINTAX does not rely on Bayesian posteriors. Hence, training the dat aset is not required and results can be obtained very quickly. SINTAX was developed for USEARCH and is also incorporated into VSEARCH through the --sintax command. More information about the SINTAX algorithm can be found on the USEARCH website. When thinking about the four basic strategies for taxonomy assignment, SINTAX belongs to the Sequence Composition (SC) category.

cd ../scripts/
nano sintax

#!/bin/bash

source moduleload

cd ../final/

vsearch --sintax zotus.fasta --db ../refdb/sintax_vertebrata_16S.fasta --sintax_cutoff 0.8 --tabbedout sintax_taxonomy.txt

chmod +x sintax
./sintax

vsearch v2.21.1_linux_x86_64, 503.8GB RAM, 72 cores
https://github.com/torognes/vsearch

Reading file ../refdb/sintax_vertebrata_16S.fasta 100%  
WARNING: 1 invalid characters stripped from FASTA file: ?(1)
REMINDER: vsearch does not support amino acid sequences
10986499 nt in 57092 seqs, min 150, max 249, avg 192
Counting k-mers 100% 
Creating k-mer index 100% 
Classifying sequences 100%  
Classified 26 of 32 sequences (81.25%)

Taxonomy assignment through SINTAX was able to classify 27 out of 32 OTUs. Let’s use the head command to inspect the output file.

head ../final/sintax_taxonomy.txt

zotu.3	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Perciformes(1.00),f:Bovichtidae(1.00),g:Bovichtus(1.00),s:Bovichtus_variegatus(1.00)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Perciformes,f:Bovichtidae,g:Bovichtus,s:Bovichtus_variegatus
zotu.1	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Scombriformes(1.00),f:Gempylidae(1.00),g:Thyrsites(1.00),s:Thyrsites_atun(1.00)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Scombriformes,f:Gempylidae,g:Thyrsites,s:Thyrsites_atun
zotu.5	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Labriformes(1.00),f:Labridae(1.00),g:Notolabrus(0.89),s:Notolabrus_fucicola(0.86)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Labriformes,f:Labridae,g:Notolabrus,s:Notolabrus_fucicola
zotu.4	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Blenniiformes(1.00),f:Tripterygiidae(1.00),g:Forsterygion(1.00),s:Forsterygion_lapillum(0.99)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Blenniiformes,f:Tripterygiidae,g:Forsterygion,s:Forsterygion_lapillum
zotu.6	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Blenniiformes(1.00),f:Tripterygiidae(1.00),g:Forsterygion(1.00),s:Forsterygion_lapillum(0.91)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Blenniiformes,f:Tripterygiidae,g:Forsterygion,s:Forsterygion_lapillum
zotu.2	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Mugiliformes(1.00),f:Mugilidae(1.00),g:Aldrichetta(1.00),s:Aldrichetta_forsteri(1.00)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Mugiliformes,f:Mugilidae,g:Aldrichetta,s:Aldrichetta_forsteri
zotu.12	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Myctophiformes(1.00),f:Myctophidae(1.00),g:Hygophum(1.00),s:Hygophum_hanseni(1.00)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Myctophiformes,f:Myctophidae,g:Hygophum,s:Hygophum_hanseni
zotu.11	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Osmeriformes(1.00),f:Retropinnidae(1.00),g:Retropinna(1.00),s:Retropinna_retropinna(1.00)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Osmeriformes,f:Retropinnidae,g:Retropinna,s:Retropinna_retropinna
zotu.9				
zotu.13	d:Eukaryota(1.00),p:Chordata(1.00),c:Actinopteri(1.00),o:Scombriformes(1.00),f:Centrolophidae(1.00),g:Seriolella(0.99),s:Seriolella_brama(0.99)	+	d:Eukaryota,p:Chordata,c:Actinopteri,o:Scombriformes,f:Centrolophidae,g:Seriolella,s:Seriolella_brama

4.2 BLAST

The second and last method we will explore is BLAST (Basic Local Alignment Search Tool), probably the most well known and most widely used method to assign a taxonomic ID to sequences. As the name suggests, BLAST assigns a taxonomic ID to a sequence through best hits from local alignments against a reference database. Hence, when thinking about the four basic strategies for taxonomy assignment, BLAST belongs to the Sequence Similarity (SS) category. While BLAST can be run online and remotely against the full NCBI database, BLAST can also be run locally on a curated reference database. Since CRABS can download data from multiple online repositories, running BLAST locally will aid in increased taxonomic assignment accuracy by incorporating a larger number of barcodes.

nano blast

#!/bin/bash

source moduleload

cd ../refdb/

blastn -query ../final/zotus.fasta -db vertebrata_16S -max_target_seqs 1 -perc_identity 50 -qcov_hsp_perc 50 -outfmt "6 qaccver saccver staxid sscinames length pident mismatch qcovs evalue bitscore qstart qend sstart send gapopen" -out ../final/blast_taxonomy.txt

chmod +x blast
./blast

Running BLAST locally doesn’t produce any text in the Terminal Window, but we can use the head command to inspect the output file.

head -n 32 ../final/blast_taxonomy.txt

zotu.1	OP168897	499858	Thyrsites atun	205	100.000	0	100	9.00e-106	379	1	205	1	205	0
zotu.2	JQ060654	443748	Aldrichetta forsteri	202	100.000	0	100	4.12e-104	374	1	202	1	202	0
zotu.3	MF991327	36205	Bovichtus variegatus	203	100.000	0	100	1.15e-104	375	1	203	1	203	0
zotu.4	AY141416	206115	Forsterygion lapillum	203	98.030	4	100	5.40e-98	353	1	203	1	203	0
zotu.5	EU848435	548164	Notolabrus fucicola	205	99.512	1	100	4.19e-104	374	1	205	1	205	0
zotu.6	AY141416	206115	Forsterygion lapillum	203	99.507	1	100	5.36e-103	370	1	203	1	203	0
zotu.7	AY520117	8210	Notothenia angustata	202	100.000	0	100	4.12e-104	374	1	202	1	202	0
zotu.8	NC_016669	691800	Sprattus muelleri	202	100.000	0	100	4.12e-104	374	1	202	1	202	0
zotu.9	MK335855	1311454	Istigobius ornatus	159	87.421	18	80	2.62e-46	182	40	197	39	196	2
zotu.10	GQ365289	159454	Hyporhamphus melanochir	203	100.000	0	100	1.15e-104	375	1	203	1	203	0
zotu.11	MZ301925	170203	Retropinna retropinna	205	100.000	0	100	9.00e-106	379	1	205	1	205	0
zotu.12	AB024914	89989	Hygophum hanseni	202	100.000	0	100	4.12e-104	374	1	202	1	202	0
zotu.13	AB205416	316134	Seriolella caerulea	205	100.000	0	100	9.00e-106	379	1	205	1	205	0
zotu.14	EU848442	443792	Lotella rhacina	200	96.500	7	100	2.49e-91	331	1	200	1	200	0
zotu.15	NC_080276	2972643	Salmo pallaryi	205	100.000	0	100	9.00e-106	379	1	205	1	205	0
zotu.16	KP411798	879484	Enneanectes reticulatus	149	85.906	20	73	4.57e-39	158	56	203	47	195	1
zotu.17	AY141416	206115	Forsterygion lapillum	205	85.854	27	100	7.47e-57	217	1	205	1	203	1
zotu.18	AY141416	206115	Forsterygion lapillum	203	97.537	5	100	2.51e-96	348	1	203	1	203	0
zotu.19	LC627613	107245	Lepidopus caudatus	204	100.000	0	99	3.24e-105	377	1	204	1	204	0
zotu.20	AY141416	206115	Forsterygion lapillum	205	86.341	26	100	1.61e-58	222	1	205	1	203	1
zotu.21	NC_016673	691801	Sprattus antipodum	202	99.505	1	100	1.92e-102	368	1	202	1	202	0
zotu.22	AY279764	241364	Odax pullus	201	100.000	0	100	1.47e-103	372	1	201	1	201	0
zotu.23	PP661061	2791166	Pseudophycis palmata	198	91.919	16	100	3.25e-75	278	1	198	1	198	0
zotu.24	AY141416	206115	Forsterygion lapillum	205	95.610	7	100	3.31e-90	327	1	205	1	203	1
zotu.25	NC_088030	1241940	Umbrina canosai	154	84.416	12	71	4.83e-34	141	62	213	59	202	6
zotu.26	KY020202	879906	Xiphasia setifer	132	78.030	26	68	9.44e-16	80.5	56	184	66	197	3
zotu.27	OM470922	2740763	Lepadogaster candolii	147	83.673	21	72	2.11e-32	135	56	200	54	199	3
zotu.28	PP661061	2791166	Pseudophycis palmata	198	91.414	17	100	1.51e-73	272	1	198	1	198	0
zotu.29	NC_016669	691800	Sprattus muelleri	202	98.515	3	100	4.15e-99	357	1	202	1	202	0
zotu.30	KY020202	879906	Xiphasia setifer	132	78.030	26	68	9.44e-16	80.5	56	184	66	197	3
zotu.31	NC_037961	2605477	Dipturus nasutus	210	100.000	0	100	1.54e-108	388	1	210	1	210	0
zotu.32	CRABS:Acanthoclinus_fuscus	2696578	Acanthoclinus fuscus	203	97.537	4	100	2.51e-96	348	1	203	1	202	1

Using the head command shows us that BLAST managed to provide a taxonomic ID to all 32 ZOTU sequences in our data, unlike SINTAX, which managed to provide a taxonomic ID to only 27 ZOTUs. However, if we look at the output, the BLAST results are “low” in similarity and coverage for the ZOTUs for which SINTAX does not provide any taxonomic ID.

Please note that I do not recommend working on solely top BLAST hits, as we did in our example. This choice was done as a time-saving approach. I recommend setting the parameter --max-target-seqs to 50 or 100 and calculate the Most Recent Common Ancestor (MRCA) based on those BLAST results. While this is outside the scope of this workshop, there are some useful programs that are publicly available that can do this for you, e.g., ALEX (Ancestor Link EXplorer).

Key Points

The main taxonomy assignment strategies are global alignments, local alignments, and machine learning approaches

Local curated reference databases increase taxonomic assignment accuracy

previous episode

eDNA

next episode

Taxonomy assignment

Overview

Taxonomy assignment

1. Introduction

2. Reference databases

2.1 CRABS workflow

2.2 Help documentation

2.3 Step 1: download data from online repositories

2.4 Step 2: import downloaded data into CRABS format

2.5 Step 3: extract amplicons through in silico PCR

2.6 Step 4: extract amplicons through pairwise global alignments

2.7 Step 5: dereplicating the database

2.8 Step 6: filter the reference database

2.9 Step 7: export the reference database to a format supported by the taxonomic classifier software

3. The 16S Vertebrate reference database

4. Taxonomic classification

4.1 SINTAX

4.2 BLAST

Key Points

previous episode

next episode