This lesson is still being designed and assembled (Pre-Alpha version)

Nanopore: quality assessment

Overview

Teaching: 10 min
Exercises: 10 min
Questions
  • How can I perform quality assessment of ONT data?

Objectives
  • Assess the quality of reads.

  • Generate summary metrics.

Software for quality assessment

In this part of the module, we’ll look at software applications that can be used to assess the quality of data from ONT sequencing runs. We’ll be focussing on two pieces of software: FastQC and NanoPlot.

NanoPlot

Let’s use the NanoPlot software to assess the quality of our .fastq data. We can run NanoPlot interactively (i.e., we don’t need to submit it as a slurm job), and it wil produce an HTML-based report that can be viewed in a browser.

Load the NanoPlot module:

module load NanoPlot

Run NanoPlot on passed fastq data (i.e., just the files in the pass folder):

NanoPlot -o nanoplot_fastmodel \
         -p ecoli_fastmodel_subset_ \
         --fastq fastq_fastmodel/pass/fastq_runid_*.gz

Command syntax:

Let’s check what output was generated:

ls -1 nanoplot_fastmodel/
ecoli_fastmodel_subset_LengthvsQualityScatterPlot_dot.html
ecoli_fastmodel_subset_LengthvsQualityScatterPlot_kde.html
ecoli_fastmodel_subset_NanoPlot_20221120_2127.log
ecoli_fastmodel_subset_NanoPlot-report.html
ecoli_fastmodel_subset_NanoStats.txt
ecoli_fastmodel_subset_Non_weightedHistogramReadlength.html
ecoli_fastmodel_subset_Non_weightedLogTransformed_HistogramReadlength.html
ecoli_fastmodel_subset_WeightedHistogramReadlength.html
ecoli_fastmodel_subset_WeightedLogTransformed_HistogramReadlength.html
ecoli_fastmodel_subset_Yield_By_Length.html

The file that we are interested in is the HTML report: ecoli_fastmodel_subset_NanoPlot-report.html

To view this in your browser:

  1. Use the file browser (left panel of Jupyter window) to navigate to the nanoplot_fastmodel folder.
  2. Control-click (Mac) or right-click (Windows/Linux) on the “ecoli_fastmodel_subset_NanoPlot-report.html” file and choose “Open in New Browser Tab”.
  3. The report should now be displayed in a new tab in your browser.

FastQC

To run the FastQC application on our FASTQ data, we need to create a single .fastq.gz file containing all our data (remember the data are currently spread across multiple .fastq.gz files in the fastq_fastmodel folder).

# Make a single .fastq.gz file:
cat fastq_fastmodel/pass/*.gz > ecoli-pass.fastq.gz

# Use zcat to decompress the data on-the-fly and stream it to wc to count the number of lines in the file
zcat ecoli-pass.fastq.gz | wc -l 

# Use bc to calculate the number of reads (remember: 4 lines per read):
echo 56448/4 | bc

Load the FastQC module:

module load FastQC

Make a folder for the FastQC output to written to:

mkdir fastqc_fastmodel

Run FastQC on our data:

# NB: I get a java memory error if I don't use two threads
#     FastQC allocates 250Mb per thread. Can't remember how 
#     to specify more memory for java...

fastqc -t 2 -o fastqc_fastmodel ecoli-pass.fastq.gz

Command syntax:

Check the output:

ls -1 fastqc_fastmodel
ecoli-pass_fastqc.html
ecoli-pass_fastqc.zip

As we did with the NanoPlot output, we can view the HTML report in the browser:

  1. Use the file browser (left panel of Jupyter window) to navigate to the fastqc_fastmodel folder.
  2. Control-click (Mac) or right-click (Windows/Linux) on the “ecoli-pass_fastqc.html” file and choose “Open in New Browser Tab”.
  3. The report should now be displayed in a new tab in your browser.

Bioawk

FastQC and NanoPlot are great for producing pre-formatted reports that summarise read/run quality.

But what if you want to ask some more specific questions about your data?

Bioawk is an extension to the Unix awk command which allows use to query specific biological data types - here we will use it to ask some specific questions about our FASTQ data.

Load the bioawk module.

module load bioawk

Extract read name and length:

bioawk -c fastx '{print $name,length($seq)}' ecoli-pass.fastq.gz | head
04670db1-f5f2-41f6-a6bd-92f43f6d3968    4403
0d91f59f-9fcd-4935-b062-8f05258816c5    12428
0c6a1ddc-1098-472e-a219-6d5c336e9e1b    2996
0e1306e1-785a-4307-8bbb-a5095f022306    4249
101a659f-fb10-4a22-8174-3f659c6ce462    772
106a9dc6-5e05-4c93-8ff1-569dc97426ca    6377
0f90ec4c-6fc9-4592-8754-bf4fb800d115    2452
11fffe54-669b-4fbf-be13-2236507aa8c9    8762
13f7a628-acda-4f6e-927d-651cb62adf51    14868
14a422e7-c54b-4a1a-9a64-a68ce69e4c5c    7185

Number of reads:

bioawk -c fastx '{print $name,length($seq)}' ecoli-pass.fastq.gz | wc -l
14112

NB: this SHOULD match what we calculated above.

Total number of bases in first ten reads (I’ve stepped through the process to illustrate each component of the command):

bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head 
4403
12428
2996
4249
772
6377
2452
8762
14868
7185
bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head | paste -sd+ 
4403+12428+2996+4249+772+6377+2452+8762+14868+7185
bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | head | paste -sd+ | bc
64492

So the first 10 reads comprise 64,492 bases.

Total number of bases in all reads

bioawk -c fastx '{print length($seq)}' ecoli-pass.fastq.gz | paste -sd+ | bc
144479297

In total, the 14112 reads in our data set comprise 144,479,297 bases (144Mbp).

Number of reads longer than 10000 bases

First create “yes/no” (1/0) information about whether each read has a length greater than 10,000 bases:

bioawk -c fastx '{print (length($seq) > 10000)}' ecoli-pass.fastq.gz | head
0
1
0
0
0
0
0
0
1
0

Now sum this up to find the total number of reads satisfying this condition:

bioawk -c fastx '{print (length($seq) > 5000)}' ecoli-pass.fastq.gz | paste -sd+ | bc
8700

So, of our 14,112 reads, 8700 of them are longer than 10,000 bases.

Key Points

  • FastQC and NanoPlot provide pre-formatted reports on run characteristics and read quality.

  • Bioawk can be used to generate summary metrics and ask specific questions about the read data.