Summary and Schedule
This lesson has been adapted from the original Data Carpentry - Wrangling Genomics to be run using the NeSI infrastructure as part of the Otago Bioinformatics Spring School instead of AWS.
A lot of genomics analysis is done using command-line tools for three reasons:
- you will often be working with a large number of files, and working through the command-line rather than through a graphical user interface (GUI) allows you to automate repetitive tasks,
- you will often need more compute power than is available on your personal computer, and connecting to and interacting with remote computers requires a command-line interface, and
- you will often need to customize your analyses, and command-line tools often enable more customization than the corresponding GUI tools (if in fact a GUI tool even exists).
In a previous
lesson, you learned how to use the bash shell to interact with your
computer through a command line interface. In this lesson, you will be
applying this new knowledge to carry out a common genomics workflow -
identifying variants among sequencing samples taken from multiple
individuals within a population. We will be starting with a set of
sequenced reads (.fastq
files), performing some quality
control steps, aligning those reads to a reference genome, and ending by
identifying and visualizing variations among these samples.
As you progress through this lesson, keep in mind that, even if you aren’t going to be doing this same workflow in your research, you will be learning some very important lessons about using command-line bioinformatic tools. What you learn here will enable you to use a variety of bioinformatic tools with confidence and greatly enhance your research efficiency and productivity.
Prerequisites
This lesson assumes a working understanding of the bash shell. If you haven’t already completed the Shell Genomics lesson, and aren’t familiar with the bash shell, please review those materials before starting this lesson.
This lesson also assumes some familiarity with biological concepts, including the structure of DNA, nucleotide abbreviations, and the concept of genomic variation within a population.
This lesson uses data hosted on NeSI. Workshop participants will be given information on how to log-in to NeSI during the workshop. Learners using these materials for self-directed study will need to set up their own AMI. Information on setting up an AMI and accessing the required data is provided on the original Genomics Workshop setup page.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Background and Metadata |
What data are we using? Why is this experiment important? |
Duration: 00h 15m | 2. Assessing Read Quality | How can I describe the quality of my data? |
Duration: 01h 05m | 3. Trimming and Filtering | How can I get rid of sequence data that does not meet my quality standards? |
Duration: 02h 00m | 4. Variant Calling Workflow | How do I find sequence variants between my sample and a reference genome? |
Duration: 03h 40m | 5. Automating a Variant Calling Workflow | How can I make my workflow more efficient and less error-prone? |
Duration: 04h 25m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
This workshop has been adapted from the original Data Carpentry - Genomics workshop to be conducted on the NeSI compute infrastructure. All software and data is already set up for you to use during the workshop.
The original workshop is designed to be run on pre-imaged Amazon Web Services (AWS) instances. For information about how to use the original workshop materials, see the setup instructions on the main workshop page.