Nanopore: quality assessment
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can I perform quality assessment of ONT data?
Objectives
Assess the quality of reads.
Generate summary metrics.
Software for quality assessment
In this part of the module, we’ll look at software applications that can be used to assess the quality of data from ONT sequencing runs. We’ll be focussing on two pieces of software: FastQC and NanoPlot.
- FastQC (you’ve already seen this): generic tool for assessing the quality of next generation sequencing data.
- NanoPlot: quality assessment software specifically built for ONT data.
NanoPlot
Let’s use the NanoPlot software to assess the quality of our .fastq
data. We can run NanoPlot interactively (i.e., we don’t need to submit it as a slurm job), and it wil produce an HTML-based report that can be viewed in a browser.
Load the NanoPlot module:
module load NanoPlot
Run NanoPlot on passed fastq data (i.e., just the files in the pass
folder):
NanoPlot -o nanoplot_fastmodel \
-p ecoli_fastmodel_subset_ \
--fastq fastq_fastmodel/pass/fastq_runid_*.gz
Command syntax:
NanoPlot
: run the NanoPlot command (note that the capital N and P are required).-o nanoplot_fastmodel
: folder for output (-o
) to be written.-p ecoli_fastmodel_subset_
: prefix (-p
) to be appended to start of each filename in the output folder.--fastq fastq_fastmodel/pass/fastq_runid_*.gz
: the.fastq.gz
files to process with NanoPlot.
Let’s check what output was generated:
ls -1 nanoplot_fastmodel/
ecoli_fastmodel_subset_LengthvsQualityScatterPlot_dot.html
ecoli_fastmodel_subset_LengthvsQualityScatterPlot_kde.html
ecoli_fastmodel_subset_NanoPlot_20221120_2127.log
ecoli_fastmodel_subset_NanoPlot-report.html
ecoli_fastmodel_subset_NanoStats.txt
ecoli_fastmodel_subset_Non_weightedHistogramReadlength.html
ecoli_fastmodel_subset_Non_weightedLogTransformed_HistogramReadlength.html
ecoli_fastmodel_subset_WeightedHistogramReadlength.html
ecoli_fastmodel_subset_WeightedLogTransformed_HistogramReadlength.html
ecoli_fastmodel_subset_Yield_By_Length.html
The file that we are interested in is the HTML report: ecoli_fastmodel_subset_NanoPlot-report.html
To view this in your browser:
- Use the file browser (left panel of Jupyter window) to navigate to the
nanoplot_fastmodel
folder. - Control-click (Mac) or right-click (Windows/Linux) on the “ecoli_fastmodel_subset_NanoPlot-report.html” file and choose “Open in New Browser Tab”.
- The report should now be displayed in a new tab in your browser.
FastQC
To run the FastQC application on our FASTQ data, we need to create a single .fastq.gz
file containing all our data (remember the data are currently spread across multiple .fastq.gz
files in the fastq_fastmodel
folder).
# Make a single .fastq.gz file:
cat fastq_fastmodel/pass/*.gz > ecoli-pass.fastq.gz
# Use zcat to decompress the data on-the-fly and stream it to wc to count the number of lines in the file
zcat ecoli-pass.fastq.gz | wc -l
# Use bc to calculate the number of reads (remember: 4 lines per read):
echo 56448/4 | bc
Load the FastQC module:
module load FastQC
Make a folder for the FastQC output to written to:
mkdir fastqc_fastmodel
Run FastQC on our data:
# NB: I get a java memory error if I don't use two threads
# FastQC allocates 250Mb per thread. Can't remember how
# to specify more memory for java...
fastqc -t 2 -o fastqc_fastmodel ecoli-pass.fastq.gz
Command syntax:
fastqc
: run thefastqc
command-t 2
: use two cpus (see my note about memory usage above) - the-t
is for “threads”.-o fastqc_fastmodel
: specify output folder.ecoli-pass.fastq.gz
: FASTQ data to analyse.
Check the output:
ls -1 fastqc_fastmodel
ecoli-pass_fastqc.html
ecoli-pass_fastqc.zip
As we did with the NanoPlot output, we can view the HTML report in the browser:
- Use the file browser (left panel of Jupyter window) to navigate to the
fastqc_fastmodel
folder. - Control-click (Mac) or right-click (Windows/Linux) on the “ecoli-pass_fastqc.html” file and choose “Open in New Browser Tab”.
- The report should now be displayed in a new tab in your browser.
Bioawk
FastQC and NanoPlot are great for producing pre-formatted reports that summarise read/run quality.
But what if you want to ask some more specific questions about your data?
Bioawk is an extension to the Unix awk
command which allows use to query specific biological data types - here we will use it to ask some specific questions about our FASTQ data.
Load the bioawk module.
module load bioawk
Extract read name and length:
bioawk -c fastx '{print $name,length($seq)}' ecoli-pass.fastq.gz | head
04670db1-f5f2-41f6-a6bd-92f43f6d3968 4403
0d91f59f-9fcd-4935-b062-8f05258816c5 12428
0c6a1ddc-1098-472e-a219-6d5c336e9e1b 2996
0e1306e1-785a-4307-8bbb-a5095f022306 4249
101a659f-fb10-4a22-8174-3f659c6ce462 772
106a9dc6-5e05-4c93-8ff1-569dc97426ca 6377
0f90ec4c-6fc9-4592-8754-bf4fb800d115 2452
11fffe54-669b-4fbf-be13-2236507aa8c9 8762
13f7a628-acda-4f6e-927d-651cb62adf51 14868
14a422e7-c54b-4a1a-9a64-a68ce69e4c5c 7185
Number of reads:
bioawk -c fastx '{print $name,length($seq)}' ecoli-pass.fastq.gz | wc -l
14112
NB: this SHOULD match what we calculated above.
Total number of bases in first ten reads (I’ve stepped through the process to illustrate each component of the command):
bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head
4403
12428
2996
4249
772
6377
2452
8762
14868
7185
bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head | paste -sd+
4403+12428+2996+4249+772+6377+2452+8762+14868+7185
bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head | paste -sd+ | bc
64492
So the first 10 reads comprise 64,492 bases.
Total number of bases in all reads
bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | paste -sd+ | bc
144479297
In total, the 14112 reads in our data set comprise 144,479,297 bases (144Mbp).
Number of reads longer than 10000 bases
First create “yes/no” (1/0) information about whether each read has a length greater than 10,000 bases:
bioawk -c fastx '{print (length($seq) > 10000)}' ecoli-pass.fastq.gz | head
0
1
0
0
0
0
0
0
1
0
Now sum this up to find the total number of reads satisfying this condition:
bioawk -c fastx '{print (length($seq) > 5000)}' ecoli-pass.fastq.gz | paste -sd+ | bc
8700
So, of our 14,112 reads, 8700 of them are longer than 10,000 bases.
Key Points
FastQC and NanoPlot provide pre-formatted reports on run characteristics and read quality.
Bioawk can be used to generate summary metrics and ask specific questions about the read data.