04 - Leveling up your workflow!¶
Catching up¶
From section 03, you should have the following Snakefile:
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/sam/{sample}.sam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
4.1 Use a profile for HPC¶
In section 3.16, we have seen that a snakemake workflow can be run on an HPC cluster.
To reduce the boilerplate, we can use a configuration profile to configure default options.
In this case, we use it to set the --cluster
and the --jobs
options.
write the following to config.yaml
Then run the snakemake workflow using the slurm
profile
If you interrupt the execution of a snakemake workflow using CTRL+C, already submitted Slurm jobs won't be cancelled.
We tell snakemake how to cancel Slurm jobs using scancel
via the --cluster-cancel
option and adding --parsable
to the sbatch
command, to make it return the job ID.
Edit config.yaml
You can specify different resources (memory, cpus, gpus, etc.) for each target in the workflow and refer to them in the cluster
option using placeholders.
Default resources for all rules can also be set using the default-resources
option.
Update the profile slurm/config.yaml
file as follows (using a multiline option to improve readability)
Edit config.yml
jobs: 20
- cluster: "sbatch --parsable --time 00:10:00 --mem 512MB --cpus-per-task 8"
+ cluster:
+ sbatch
+ --parsable
+ --time {resources.time_min}
+ --mem {resources.mem_mb}
+ --cpus-per-task {resources.cpus}
+ --account nesi02659
+ default-resources: [cpus=2, mem_mb=512, time_min=10]
cluster-cancel: scancel
and add resources definitions in the workflow.
Here we give more CPU resources to trim_galore
to make it run faster.
Edit snakefile
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/sam/{sample}.sam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
+ resources:
+ cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
Current slurm profile:
Current snakefile:
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/sam/{sample}.sam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
Run the workflow again
- If you monitor the progress of your jobs using
squeue
, you will notice that some jobs now request 2 or 8 CPUs.
output
JOBID USER ACCOUNT NAME CPUS MIN_MEM PARTITI START_TIME TIME_LEFT STATE NODELIST(REASON)
26763281 lkemp nesi02659 spawner-jupy 4 4G interac 2022-05-11T1 7:21:18 RUNNING wbn003
26763492 lkemp nesi02659 snakejob.fas 2 512M large 2022-05-11T1 9:59 RUNNING wbn144
26763493 lkemp nesi02659 snakejob.tri 8 512M large 2022-05-11T1 9:59 RUNNING wbn212
26763494 lkemp nesi02659 snakejob.fas 2 512M large 2022-05-11T1 9:59 RUNNING wbn145
26763495 lkemp nesi02659 snakejob.fas 2 512M large 2022-05-11T1 9:59 RUNNING wbn146
26763496 lkemp nesi02659 snakejob.tri 8 512M large 2022-05-11T1 9:59 RUNNING wbn217
26763497 lkemp nesi02659 snakejob.tri 8 512M large 2022-05-11T1 9:59 RUNNING wbn229
Now looking at the content of our workflow folder, it is getting cluttered with Slurm log files:
code
output
total 1.8M
-rw-rw----+ 1 lkemp nesi02659 4.2K May 11 12:10 dag_1.png
-rw-rw----+ 1 lkemp nesi02659 3.8K May 11 12:13 dag_2.png
-rw-rw----+ 1 lkemp nesi02659 12K May 11 12:16 dag_3.png
-rw-rw----+ 1 lkemp nesi02659 20K May 11 12:19 dag_4.png
-rw-rw----+ 1 lkemp nesi02659 15K May 11 12:21 dag_5.png
-rw-rw----+ 1 lkemp nesi02659 12K May 11 12:23 dag_6.png
-rw-rw----+ 1 lkemp nesi02659 26K May 11 12:24 dag_7.png
drwxrws---+ 5 lkemp nesi02659 4.0K May 11 12:25 logs
-rw-rw----+ 1 lkemp nesi02659 11K May 11 12:24 rulegraph_1.png
drwxrws---+ 3 lkemp nesi02659 4.0K May 11 12:34 slurm
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:27 slurm-26763403.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:27 slurm-26763404.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:27 slurm-26763405.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:27 slurm-26763406.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:27 slurm-26763407.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:27 slurm-26763408.out
-rw-rw----+ 1 lkemp nesi02659 865 May 11 12:29 slurm-26763409.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:30 slurm-26763418.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:30 slurm-26763419.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:30 slurm-26763420.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:30 slurm-26763421.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:30 slurm-26763422.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:30 slurm-26763423.out
-rw-rw----+ 1 lkemp nesi02659 865 May 11 12:32 slurm-26763431.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:33 slurm-26763435.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:34 slurm-26763436.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:33 slurm-26763437.out
-rw-rw----+ 1 lkemp nesi02659 837 May 11 12:34 slurm-26763438.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:34 slurm-26763439.out
-rw-rw----+ 1 lkemp nesi02659 809 May 11 12:34 slurm-26763440.out
-rw-rw----+ 1 lkemp nesi02659 865 May 11 12:35 slurm-26763444.out
-rw-rw----+ 1 lkemp nesi02659 857 May 11 12:36 slurm-26763447.out
-rw-rw----+ 1 lkemp nesi02659 829 May 11 12:36 slurm-26763448.out
-rw-rw----+ 1 lkemp nesi02659 857 May 11 12:36 slurm-26763449.out
-rw-rw----+ 1 lkemp nesi02659 857 May 11 12:36 slurm-26763450.out
-rw-rw----+ 1 lkemp nesi02659 829 May 11 12:36 slurm-26763451.out
-rw-rw----+ 1 lkemp nesi02659 829 May 11 12:36 slurm-26763452.out
-rw-rw----+ 1 lkemp nesi02659 885 May 11 12:37 slurm-26763454.out
-rw-rw----+ 1 lkemp nesi02659 1.8K May 11 12:34 Snakefile
Let's clean this and create a dedicated folder logs/slurm
for future log files:
then instruct Slurm to save its log files in it, in the profile slurm/config.yaml
file
Note that logs/slurm/slurm-%j-{rule}.out
contains a placeholder {rule}
, which will be replaced by the name of the rule during the execution of the workflow.
Finally, to improve the communication between Snakemake and Slurm, we meed an additional script translating Slurm job status for Snakemake.
The --cluster-status
option is used to tell Snakemake which script to use.
Create an executable status.py
file
- and copy the following content in it
#!/usr/bin/env python
import subprocess
import sys
jobid = sys.argv[1]
output = str(subprocess.check_output("sacct -j %s --format State --noheader | head -1 | awk '{print $1}'" % jobid, shell=True).strip())
running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]
if "COMPLETED" in output:
print("success")
elif any(r in output for r in running_status):
print("running")
else:
print("failed")
Then modify the profile slurm/config.yaml
file
Current slurm profile:
Once all of this is in place, we can:
- submit Slurm jobs with the right resources per Snakemake rule,
- cancel the workflow and Slurms jobs using CTRL-C,
- keep all slurm jobs log files in a dedicated folder,
- and make sure Snakemake reports Slurm jobs failures.
Exercise
Run the snakemake workflow with Slurm jobs then use scancel JOBID
to cancel some Slurm. See how Snakemake reacts with and without the status.py
script.
4.2 Pull out parameters¶
Edit snakefile
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
- expand("../results/sam/{sample}.sam", sample = SAMPLES)
+ expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
+ rule sam_bam:
+ input:
+ SAM = "../results/sam/{sample}.sam"
+ output:
+ BAM = "../results/bam/{sample}.bam"
+ params: "-S -b"
+ log: "logs/samtools/{sample}.sam.log"
+ threads: 2
+ envmodules:
+ "SAMtools/1.9-GCC-7.4.0"
+ shell: "samtools view {params} {input} > {output}"
Current snakefile:
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
shell: "samtools view {params} {input} > {output}"
Run a dryrun to check it works
4.3 Pull out user configurable options¶
We can separate the user configurable options away from the workflow. This supports reproducibility by minimising the chance the user makes changes to the core workflow.
Create a configuration file in a new directory config/
File structure:
demo_workflow/
|_______results/
|_______workflow/
| |_______logs/
| |_______slurm/
| |_______Snakefile
| |_______status.py
|_______config
|_______config.yaml
code
Now we need to pull out the parameters the user would likely need to configure. Let's give the user the option to pass any parameters they like to fastqc. In our ../config/config.yaml
file, add the configuration options and add a couple flags to be passed to fastqc and multiqc:
# set software parameters for...
PARAMS:
# ... fastqc
FASTQC: "--kmers 5"
# ... multiqc
MULTIQC: "--flat"
Edit snakefile : In the Snakefile, tell Snakemake to grab the variables PARAMS
from ../config/config.yaml
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
+ params:
+ fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
- "fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} &> {log}"
+ "fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
+ params:
+ multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
- "multiqc {input} -o ../results/ &> {log}"
+ "multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
shell: "samtools view {params} {input} > {output}"
Current snakefile
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
params:
fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
params:
multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
shell: "samtools view {params} {input} > {output}"
Let's use our configuration file! Run workflow again:
Didn't work? My error:
KeyError in line 19 of /scale_wlg_persistent/filesets/project/nesi02659/snakemake20220512/lkemp/snakemake_workshop/demo_workflow/workflow/Snakefile:
'PARAMS'
File "/scale_wlg_persistent/filesets/project/nesi02659/snakemake20220512/lkemp/snakemake_workshop/demo_workflow/workflow/Snakefile", line 19, in <module>
Snakemake can't find our 'Key' - we haven't told Snakemake where our config file is so it can't find our config variables. We can do this by passing the location of our config file to the --configfile
flag
code
Alternatively, we can define our config file in our Snakefile in a situation where the configuration file is likely to always be named the same and be in the exact same location ../config/config.yaml
and you don't need the flexibility for the user to specify their own configuration files:
Edit snakefile
# define our configuration file
+ configfile: "../config/config.yaml"
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
params:
fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
params:
multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
shell: "samtools view {params} {input} > {output}"
Current snakefile:
# define our configuration file
configfile: "../config/config.yaml"
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/{sample}_1.fastq.gz",
R2 = "../../data/{sample}_2.fastq.gz"
output:
html = ["../results/fastqc/{sample}_1_fastqc.html", "../results/fastqc/{sample}_2_fastqc.html"],
zip = ["../results/fastqc/{sample}_1_fastqc.zip", "../results/fastqc/{sample}_2_fastqc.zip"]
params:
fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
params:
multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
shell:
"multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
shell: "samtools view {params} {input} > {output}"
Then we don't need to specify where the configuration file is on the command line
code
4.4 Leave messages for the user¶
We can provide the user of our workflow more information on what is happening at each stage/rule of our workflow via the message:
directive. We are able to call many variables such as:
- Input and output files
{input}
and{output}
- Specific input and output files such as
{input.R1}
- Our
{params}
,{log}
and{threads}
directives
Edit snakefile
# define our configuration file
configfile: "../config/config.yaml"
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
params:
fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
+ message:
+ "Undertaking quality control checks {input}"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
params:
multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
+ message:
+ "Compiling a HTML report for quality control checks. Writing to {output}."
shell:
"multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
+ message:
+ "Aligning reads to the reference for {wildcards.sample}"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
+ message:
+ "Converting sam: {input} to bam: {output}"
shell: "samtools view {params} {input} > {output}"
Current snakefile:
# define our configuration file
configfile: "../config/config.yaml"
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
params:
fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
message:
"Undertaking quality control checks {input}"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
params:
multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
message:
"Compiling a HTML report for quality control checks. Writing to {output}."
shell:
"multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = "../results/sam/{sample}.sam"
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
message:
"Aligning reads to the reference for {wildcards.sample}"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
message:
"Converting sam: {input} to bam: {output}"
shell: "samtools view {params} {input} > {output}"
code
# run dryrun/run again
snakemake --dryrun --profile slurm --use-envmodules
snakemake --profile slurm --use-envmodules
- Now our messages are printed to the screen as our workflow runs
4.5 Create temporary files¶
In our workflow, we are likely to be creating files that we don't want, but are used or produced by our workflow (intermediate files). We can mark such files as temporary so Snakemake will remove the file once it doesn't need to use it anymore.
For example, we might not want to keep our fastqc output files since our multiqc report merges all of our fastqc reports for each sample into one report. Let's have a look at the files currently produced by our workflow with:
Let's mark all the trimmed fastq files as temporary in our Snakefile by wrapping it up in the temp()
function
Edit snakefile
# define our configuration file
configfile: "../config/config.yaml"
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
params:
fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
message:
"Undertaking quality control checks {input}"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
params:
multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
message:
"Compiling a HTML report for quality control checks. Writing to {output}."
shell:
"multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
- SAM = "../results/sam/{sample}.sam"
+ SAM = temp("../results/sam/{sample}.sam")
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
message:
"Aligning reads to the reference for {wildcards.sample}"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
message:
"Converting sam: {input} to bam: {output}"
shell: "samtools view {params} {input} > {output}"
Current snakefile:
# define our configuration file
configfile: "../config/config.yaml"
# define samples from data directory using wildcards
SAMPLES, = glob_wildcards("../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq")
# target OUTPUT files for the whole workflow
rule all:
input:
"../results/multiqc_report.html",
expand("../results/bam/{sample}.bam", sample = SAMPLES)
# workflow
rule fastqc:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq"
output:
html = ["../results/fastqc/{sample}_1.trim.sub_fastqc.html", "../results/fastqc/{sample}_2.trim.sub_fastqc.html"],
zip = ["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"]
params:
fastqc_params = config['PARAMS']['FASTQC']
log:
"logs/fastqc/{sample}.log"
threads: 2
envmodules:
"FastQC/0.11.9"
message:
"Undertaking quality control checks {input}"
shell:
"fastqc {input.R1} {input.R2} -o ../results/fastqc/ -t {threads} {params.fastqc_params} &> {log}"
rule multiqc:
input:
expand(["../results/fastqc/{sample}_1.trim.sub_fastqc.zip", "../results/fastqc/{sample}_2.trim.sub_fastqc.zip"], sample = SAMPLES)
output:
"../results/multiqc_report.html"
params:
multiqc_params = config['PARAMS']['MULTIQC']
log:
"logs/multiqc/multiqc.log"
envmodules:
"MultiQC/1.9-gimkl-2020a-Python-3.8.2"
message:
"Compiling a HTML report for quality control checks. Writing to {output}."
shell:
"multiqc {input} -o ../results/ {params.multiqc_params} &> {log}"
rule bwa_align:
input:
R1 = "../../data/trimmed_fastq_small/{sample}_1.trim.sub.fastq",
R2 = "../../data/trimmed_fastq_small/{sample}_2.trim.sub.fastq",
genome = "../../data/ref_genome/ecoli_rel606.fasta"
output:
SAM = temp("../results/sam/{sample}.sam")
log: "logs/bwa/{sample}.bwa.log"
threads: 2
resources:
cpus = 8
envmodules:
"BWA/0.7.17-gimkl-2017a"
message:
"Aligning reads to the reference for {wildcards.sample}"
shell: "bwa mem -t {threads} {input.genome} {input.R1} {input.R2} > {output} 2> {log}"
rule sam_bam:
input:
SAM = "../results/sam/{sample}.sam"
output:
BAM = "../results/bam/{sample}.bam"
params: "-S -b"
log: "logs/samtools/{sample}.sam.log"
threads: 2
envmodules:
"SAMtools/1.9-GCC-7.4.0"
message:
"Converting sam: {input} to bam: {output}"
shell: "samtools view {params} {input} > {output}"
code
Now when we have a look at the ../results/fastqc/
directory with:
- These html files have been removed once Snakemake no longer needs the files for another rule/operation, and we've saved some space on our computer (from 4.5 megabytes to 3 megabytes in this directory).
This becomes particularly important when our data become big data, since we don't want to keep any massive intermediate output files that we don't need. Otherwise this can start to clog up the memory on our computer. It ensures our workflow is scalable when our data becomes big data.
4.6 Generating a snakemake report¶
With Snakemake, we can automatically generate detailed self-contained HTML reports after we run our workflow with the following command:
Note
you won't be able to view a rendered version of this html while it is on the remote server, however after you transfer it to your local computer you should be able to view it in your web browser*
In our report:
- We get an interactive version of our directed acyclic graph (DAG).
- When you click on a node in the DAG, the input and output files are fully outlined, the exact software used and the exact shell command that was run.
- You are also provided with runtime information under the
Statistics
tab outlining how long each rule/sample ran for, and the date/time each file was created.
These reports are highly configurable, have a look at an example of what can be done with a report here
See more information on creating Snakemake reports in the Snakemake documentation
4.7 Linting your workflow¶
Snakemake has a built in linter to support you building best practice workflows, let's try it out:
We have a few things we could improve in our workflow!
Writing a best practice workflow is more important than having Marie Kondo level tidiness, it increases the chance your workflow will continue to be used and maintained by others (and ourselves), making the code we write useful (it's exciting seeing someone else using your code!). If your workflow was used in scientific research, it makes your workflow accessible for people to reproduce your research findings; it isn't going to be a nightmare for them to run and they are more likely to try and have success doing so.
Read more about the best practices for Snakemake
Takeaways¶
- Pull out your parameters and put them in
params:
directive - Pulling the user configurable options away from the core workflow will support reproducibility by reducing the chance of changes to the core workflow
- Leaving messages for the user of your workflow will help them understand what is happening at each stage and follow the workflows progress
- Mark files you won't need once the workflow completes to reduce the memory usage - particularly when dealing with big data
- Generate a snakemake report to get a summary of the workflow run - these are highly configurable
- Lint your workflow and check it complies with best practices - this supports reproducibility and portability
- There is so much more to explore, such as creating modular workflows, automatically grabbing remote files from places like Google Cloud Storage and Dropbox, run various types of scripts such as python scripts, R and RMarkdown scripts and Jupyter Notebooks
Summary commands¶
- Use the parameter directive (
params
) to keep the parameters and flags of your programs separate from your shell command, for example:
- Run your snakemake workflow (using environment modules to load your software AND with a configuration file) with:
- Alternatively, define your config file in the Snakefile:
- Use the
message
directive to provide information to the user on what is happening real time, for example:
- Mark temporary files to remove (once they are no longer needed by the workflow) with
temp()
, for example:
- Create a basic interactive Snakemake report after running your workflow with:
Our final snakemake workflow!¶
See leveled_up_demo_workflow for the final Snakemake workflow we've created up to this point