Nanopore - Variant Calling
Overview
Teaching: 10 min
Exercises: 10 minQuestions
How can I use nanopore data for variant calling?
Objectives
Align reads from long-read data.
Perform variant calling using nanopore data.
Sequence alignment
- If a “reference” genome exists for the organism you are sequencing, reads can be “aligned” to the reference.
- This involves finding the place in the reference genome that each read matches to.
- Due to high sequence similarity within members of the same species, most reads should map to the reference.
Tools for generating alignments
- There are MANY software packages available for aligning data from next generation sequencing experiments.
- For short-read data e.g., Illumina sequencing), the two “original” short read aligners were:
- BWA: http://bio-bwa.sourceforge.net
- Bowtie: http://bowtie-bio.sourceforge.net
- Over the years, MANY more “aligners” have been developed.
- For long-read data, ONT provides the Minimap2 aligner (but there are also a number of other options).
Alignment with minimap2
on NeSI
Load the minimap2 module:
module load minimap2
Check that we’ve got a reference genome:
ls genome
e-coli-k12-MG1655.fasta
Load the SAMtools module:
module load SAMtools
Make a folder for our aligned data:
mkdir bam-files
Use minimap2 followed by samtools to generate aligned data in .bam
format:
minimap2 -ax map-ont genome/e-coli-k12-MG1655.fasta \
fastq_fastmodel/pass/*.gz | \
samtools sort -o bam-files/ecoli-pass-aligned-sort.bam
Index the bam file:
samtools index bam-files/ecoli-pass-aligned-sort.bam
Generate basic mapping information using samtools:
samtools coverage bam-files/ecoli-pass-aligned-sort.bam
#rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq
U00096.3 1 4641652 14273 4641652 100 28.93 17.3 59.8
Variant calling
NB - code below is taken from the “Variant Calling Workflow” section of the Genomic Data Carpentry course:
https://datacarpentry.org/wrangling-genomics/04-variant_calling/index.html
Load the BCFtools module:
module load BCFtools
Make a folder for the output data:
mkdir bcf-files
Generate a “pileup” file to enable variant calling (might take a little bit of time):
bcftools mpileup -B -Q5 --max-BQ 30 -I \
-o bcf-files/ecoli-pass.bcf \
-f genome/e-coli-k12-MG1655.fasta \
bam-files/ecoli-pass-aligned-sort.bam
Make a folder to hold vcf (variant call format) files:
mkdir vcf-files
Use bcftools
to call variants in the e. coli data:
bcftools call --ploidy 1 -m -v \
-o vcf-files/ecoli-variants.vcf \
bcf-files/ecoli-pass.bcf
Use vcfutils.pl
to filter the variants (in this case, everything gets kept):
vcfutils.pl varFilter vcf-files/ecoli-variants.vcf > vcf-files/ecoli-final-variants.vcf
more vcf-files/ecoli-final-variants.vcf
Key Points
Long-read based algorithms (e.g., minimap2) need to be used when aligning long read data.
Post-alignment, standard software tools (e.g., samtools, bcftools, etc) can be used to perform variant calling.