Nanopore: quality assessment

Overview

Teaching: 10 min
Exercises: 10 min

Questions

How can I perform quality assessment of ONT data?

Objectives

Assess the quality of reads.

Generate summary metrics.

Software for quality assessment

In this part of the module, we’ll look at software applications that can be used to assess the quality of data from ONT sequencing runs. We’ll be focussing on two pieces of software: FastQC and NanoPlot.

FastQC (you’ve already seen this): generic tool for assessing the quality of next generation sequencing data.
NanoPlot: quality assessment software specifically built for ONT data.

NanoPlot

Let’s use the NanoPlot software to assess the quality of our .fastq data. We can run NanoPlot interactively (i.e., we don’t need to submit it as a slurm job), and it wil produce an HTML-based report that can be viewed in a browser.

Load the NanoPlot module:

module load NanoPlot

Run NanoPlot on passed fastq data (i.e., just the files in the pass folder):

NanoPlot -o nanoplot_fastmodel \
         -p ecoli_fastmodel_subset_ \
         --fastq fastq_fastmodel/pass/fastq_runid_*.gz

Command syntax:

NanoPlot: run the NanoPlot command (note that the capital N and P are required).
-o nanoplot_fastmodel: folder for output (-o) to be written.
-p ecoli_fastmodel_subset_: prefix (-p) to be appended to start of each filename in the output folder.
--fastq fastq_fastmodel/pass/fastq_runid_*.gz: the .fastq.gz files to process with NanoPlot.

Let’s check what output was generated:

ls -1 nanoplot_fastmodel/

ecoli_fastmodel_subset_LengthvsQualityScatterPlot_dot.html
ecoli_fastmodel_subset_LengthvsQualityScatterPlot_kde.html
ecoli_fastmodel_subset_NanoPlot_20221120_2127.log
ecoli_fastmodel_subset_NanoPlot-report.html
ecoli_fastmodel_subset_NanoStats.txt
ecoli_fastmodel_subset_Non_weightedHistogramReadlength.html
ecoli_fastmodel_subset_Non_weightedLogTransformed_HistogramReadlength.html
ecoli_fastmodel_subset_WeightedHistogramReadlength.html
ecoli_fastmodel_subset_WeightedLogTransformed_HistogramReadlength.html
ecoli_fastmodel_subset_Yield_By_Length.html

The file that we are interested in is the HTML report: ecoli_fastmodel_subset_NanoPlot-report.html

To view this in your browser:

Use the file browser (left panel of Jupyter window) to navigate to the nanoplot_fastmodel folder.
Control-click (Mac) or right-click (Windows/Linux) on the “ecoli_fastmodel_subset_NanoPlot-report.html” file and choose “Open in New Browser Tab”.
The report should now be displayed in a new tab in your browser.

FastQC

To run the FastQC application on our FASTQ data, we need to create a single .fastq.gz file containing all our data (remember the data are currently spread across multiple .fastq.gz files in the fastq_fastmodel folder).

# Make a single .fastq.gz file:
cat fastq_fastmodel/pass/*.gz > ecoli-pass.fastq.gz

# Use zcat to decompress the data on-the-fly and stream it to wc to count the number of lines in the file
zcat ecoli-pass.fastq.gz | wc -l 

# Use bc to calculate the number of reads (remember: 4 lines per read):
echo 56448/4 | bc

Load the FastQC module:

module load FastQC

Make a folder for the FastQC output to written to:

mkdir fastqc_fastmodel

Run FastQC on our data:

# NB: I get a java memory error if I don't use two threads
#     FastQC allocates 250Mb per thread. Can't remember how 
#     to specify more memory for java...

fastqc -t 2 -o fastqc_fastmodel ecoli-pass.fastq.gz

Command syntax:

fastqc: run the fastqc command
-t 2: use two cpus (see my note about memory usage above) - the -t is for “threads”.
-o fastqc_fastmodel: specify output folder.
ecoli-pass.fastq.gz: FASTQ data to analyse.

Check the output:

ls -1 fastqc_fastmodel

ecoli-pass_fastqc.html
ecoli-pass_fastqc.zip

As we did with the NanoPlot output, we can view the HTML report in the browser:

Use the file browser (left panel of Jupyter window) to navigate to the fastqc_fastmodel folder.
Control-click (Mac) or right-click (Windows/Linux) on the “ecoli-pass_fastqc.html” file and choose “Open in New Browser Tab”.
The report should now be displayed in a new tab in your browser.

Bioawk

FastQC and NanoPlot are great for producing pre-formatted reports that summarise read/run quality.

But what if you want to ask some more specific questions about your data?

Bioawk is an extension to the Unix awk command which allows use to query specific biological data types - here we will use it to ask some specific questions about our FASTQ data.

Load the bioawk module.

module load bioawk

Extract read name and length:

bioawk -c fastx '{print $name,length($seq)}' ecoli-pass.fastq.gz | head

04670db1-f5f2-41f6-a6bd-92f43f6d3968    4403
0d91f59f-9fcd-4935-b062-8f05258816c5    12428
0c6a1ddc-1098-472e-a219-6d5c336e9e1b    2996
0e1306e1-785a-4307-8bbb-a5095f022306    4249
101a659f-fb10-4a22-8174-3f659c6ce462    772
106a9dc6-5e05-4c93-8ff1-569dc97426ca    6377
0f90ec4c-6fc9-4592-8754-bf4fb800d115    2452
11fffe54-669b-4fbf-be13-2236507aa8c9    8762
13f7a628-acda-4f6e-927d-651cb62adf51    14868
14a422e7-c54b-4a1a-9a64-a68ce69e4c5c    7185

Number of reads:

bioawk -c fastx '{print $name,length($seq)}' ecoli-pass.fastq.gz | wc -l

NB: this SHOULD match what we calculated above.

Total number of bases in first ten reads (I’ve stepped through the process to illustrate each component of the command):

bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head

bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head | paste -sd+

4403+12428+2996+4249+772+6377+2452+8762+14868+7185

bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head | paste -sd+ | bc

So the first 10 reads comprise 64,492 bases.

Total number of bases in all reads

bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | paste -sd+ | bc

144479297

In total, the 14112 reads in our data set comprise 144,479,297 bases (144Mbp).

Number of reads longer than 10000 bases

First create “yes/no” (1/0) information about whether each read has a length greater than 10,000 bases:

bioawk -c fastx '{print (length($seq) > 10000)}' ecoli-pass.fastq.gz | head

Now sum this up to find the total number of reads satisfying this condition:

bioawk -c fastx '{print (length($seq) > 5000)}' ecoli-pass.fastq.gz | paste -sd+ | bc

So, of our 14,112 reads, 8700 of them are longer than 10,000 bases.

Key Points

FastQC and NanoPlot provide pre-formatted reports on run characteristics and read quality.

Bioawk can be used to generate summary metrics and ask specific questions about the read data.

previous episode

Nanopore Sequencing

next episode