Nanopore - Variant Calling
Overview
Teaching: 30 min
Exercises: 30 minQuestions
How can I use nanopore data for variant calling?
Objectives
Align reads from long-read data.
Perform variant calling using nanopore data.
Sequence alignment
- If a “reference” genome exists for the organism you are sequencing, reads can be “aligned” to the reference.
- This involves finding the place in the reference genome that each read matches to.
- Due to high sequence similarity within members of the same species, most reads should map to the reference.
Tools for generating alignments
- There are MANY software packages available for aligning data from next generation sequencing experiments.
- For short-read data e.g., Illumina sequencing), the two “original” short read aligners were:
- BWA: http://bio-bwa.sourceforge.net
- Bowtie: http://bowtie-bio.sourceforge.net
- Over the years, MANY more “aligners” have been developed.
- For long-read data, ONT provides the Minimap2 aligner (but there are also a number of other options).
Alignment with dorado
and minimap2
on NeSI
Check that we’ve got a reference genome:
ls genome
e-coli-k12-MG1655.fasta
Load the SAMtools module:
module load SAMtools
Load the Dorado module:
module load Dorado
Make a folder for our aligned data:
mkdir bam-aligned
We can use dorado
again to align the reads to our reference genome (note, this can also be done as part of the basecalling process, but that means you’re performing alignemnt prior to having done any QC checking).
dorado
uses the minimap2
aligner (another software tool from ONT). The syntax for performing alignemnt on basecalled data is (-t 4
specifies four threads: faster):
dorado aligner -t 4 \
genome/e-coli-k12-MG1655.fasta \
bam-unaligned/ecoli-pod5-pass-basecalls.bam | \
samtools sort -o bam-aligned/ecoli-pod5-pass-aligned-sort.bam
Index the bam file:
samtools index bam-aligned/ecoli-pod5-pass-aligned-sort.bam
Generate basic mapping information using samtools:
samtools coverage bam-aligned/ecoli-pod5-pass-aligned-sort.bam
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
U00096.3 1 4641652 12139 4641652 100 24.8486 15.4 59.5
Variant calling
NB - code below is taken from the “Variant Calling Workflow” section of the Genomic Data Carpentry course:
https://datacarpentry.org/wrangling-genomics/04-variant_calling/index.html
Load the BCFtools module:
module load BCFtools
Make a folder for the output data:
mkdir bcf-files
Generate a “pileup” file to enable variant calling (might take a little bit of time):
bcftools mpileup --threads 4 \
-B -Q5 --max-BQ 30 -I \
-o bcf-files/ecoli-pass.bcf \
-f genome/e-coli-k12-MG1655.fasta \
bam-aligned/ecoli-pod5-pass-aligned-sort.bam
Make a folder to hold vcf (variant call format) files:
mkdir vcf-files
Use bcftools
to call variants in the e. coli data:
bcftools call --ploidy 1 -m -v \
-o vcf-files/ecoli-variants.vcf \
bcf-files/ecoli-pass.bcf
Use vcfutils.pl
to filter the variants (in this case, everything gets kept):
vcfutils.pl varFilter vcf-files/ecoli-variants.vcf > vcf-files/ecoli-final-variants.vcf
more vcf-files/ecoli-final-variants.vcf
Key Points
Long-read based algorithms (e.g., minimap2) need to be used when aligning long read data.
Post-alignment, standard software tools (e.g., samtools, bcftools, etc) can be used to perform variant calling.