RStudio Project and Experimental Data

Last updated on 2024-11-15 | Edit this page

Overview

Questions

  • How do you use RStudio project to manage your analysis project?
  • What is the most effective way to organize directories for an analysis project?
  • How to download a dataset from the internet and save it as a file.

Objectives

  • Create an RStudio project and the directories required for storing the files pertinent to an analysis project.
  • Download the data set that will be used for the subsequent episodes.

Introduction


Typically, an analysis project begins with dataset files, a handful of R scripts and output files in a directory. As the project advances, complexity inevitably rises with the addition of more scripts, output files, and possibly new datasets. The complexity is further amplified when dealing with multiple versions of scripts and output files, necessitating efficient organisation. If these are not well-managed from the beginning, resuming the project after a break, or sharing the project with someone else becomes challenging and time-consuming, as we struggle to recall the project’s status and navigate the directory tree. Additionally, without proper organisation, the project’s complexity could lead to frequent use of the setwd function to switch between different working directories, resulting in a disorganised workspace.

In this lesson, we will first focus on an effective strategy for managing the files, both used and generated by our data analysis project, within a working directory.

What is a working directory?

A working directory in R is the default location on your computer where R looks for files to load or store any data you wish to save. More information can be found in our Introduction to data analysis with R and Bioconductor lesson.

Secondly, we will also learn how to leverage RStudio projects, a feature built-in to RStudio for managing our analysis project.

What is RStudio?

RStudio is a freely available integrated development environment (IDE), widely used by scientists and software developers for developing software or analysing datasets in R. If you require assistance with RStudio or its general usage, please refer to our Introduction to data analysis with R and Bioconductor lesson.

Finally, in this lesson, we will also learn to use the R function download.file to download the data for the subsequent episodes.

Structuring your working directory


For a more streamlined workflow, we suggest storing all files associated with an analysis in a specific directory, which will serve as your project’s working directory. Initially, this working directory should contain four distinct directories:

  • data: dedicated to storing raw data. This folder should ideally only house raw data and not be modified unless you receive a new dataset (even then, if you have the storage capacity, we suggest you retain the previous dataset as well in case you need it again in the future). For RNA-seq data analysis, this directory will typically contain *.fastq files and any related metadata files for the experiment.
  • scripts: for storing the R scripts you’ve written and utilised for analysing the data.
  • documents: for storing documents related to your analysis, such as a manuscript outline or meeting notes with your team.
  • output: for storing intermediate or final results generated by the R scripts in the scripts directory. Importantly, if you carry out data cleaning or pre-processing, the output should ideally be stored in this directory, as these no longer represent raw data.

As your project grows in complexity, you might find it necessary to create more directories or sub-directories. Nevertheless, the aforementioned four directories should serve as the foundation of your working directory.

Create the directories for subsequent episodes

Create a directory on your computer to serve as the working directory for the rest of this episode and lesson (the workshop example uses a directory called bio_rnaseq). Then, within this chosen directory, create the four fundamental directories previously discussed (data, scripts, documents, and output).

Your working directory should look like this
Your working directory should look like this

Using RStudio project to manage your working directory


As previously highlighted, RStudio project is a feature built-into RStudio for managing your analysis project. It does so by storing project-specific settings in an .Rproj file stored in your project’s working directory. Loading these settings up into RStudio by either opening the .Rproj file directly or through RStudio’s open project option (from the menu bar, select File > Open Project...) will automatically set your working directory in R to the location of the .Rproj file, essentially your project’s working directory.

To create an RStudio project:

  1. Start RStudio.
  2. Navigate to the menu bar and select File > New Project....
  3. Choose Existing Directory.
  4. Click Browse... button, and select the directory you have previously chosen as the working directory for the analysis (i.e., the directory where the 4 essential directories reside).
  5. Click Create Project at the bottom right of the window.

Upon completion of the steps above, you will find a .Rproj file within your project’s working directory.

A new .Rproj file should be created in your chosen working directory.
A new .Rproj file should be created in your chosen working directory.

Moreover, the heading of your RStudio console should now also display the absolute path of your project’s working directory, i.e., where the .Rproj file resides, indicating that RStudio has set this directory as your working directory in R.

Your R working directory should now be set to where the .Rproj file resides.
Your R working directory should now be set to where the .Rproj file resides.

From this point forward, any R code that you execute that involves reading data from a file or saving data in a file will, by default, be directed to a path relative to your project’s working directory.

If you wish to close the project, perhaps to open another project, create a new one, or take a break from the project, you can do so by using to File > Close Project option located in the menu bar. To open the project back up, either double-click on the .Rproj file in the working directory, or open up RStudio and using the File > Open Project option in the menu bar.

Download the RNA-seq data for subsequent episodes


Finally, we will learn how to use R to download the RNA-seq data required for the subsequent episodes of the lesson. The dataset we will be using was generated to investigate the impact upper-respiratory infection have on changes in RNA transcription in the cerebellum and spinal cord of mice. This dataset was produced as part of the following study:

Blackmore, Stephen, et al. “Influenza infection triggers disease in a genetic model of experimental autoimmune encephalomyelitis.” Proceedings of the National Academy of Sciences 114.30 (2017): E6107-E6116.

The dataset is available at Gene Expression Omnibus (GEO), under the accession number GSE96870. Downloading data from GEO is not straightforward (and won’t be covered in this lesson). Hence, we have made the data available on our GitHub repository for easier access.

To download the files, we will use the R function download.file, which necessitates at least two parameters: url and destfile. The url parameter is used to specify the address on the internet to download the data from. The destfile parameter indicates where to save the downloaded file and what the downloaded file should be named.

Let’s download one of the four data files needed for the remainder of this lesson. The data file is located at https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_counts_cerebellum.csv. We shall save the downloaded file in the data folder of our working directory with the name GSE96870_counts_cerebellum.csv.

R

download.file(
    url = "https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_counts_cerebellum.csv", 
    destfile = "data/GSE96870_counts_cerebellum.csv"
)

If you navigate to the data folder in your working directory, you should now find a file named GSE96870_counts_cerebellum.csv.

A file named GSE96870_counts_cerebellum.csv should now reside in the data folder.
A file named GSE96870_counts_cerebellum.csv should now reside in the data folder.

Download the remaining data set files

There are three more data set files we need to download for the remainder of this lesson.

URL Filename
https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_coldata_cerebellum.csv GSE96870_coldata_cerebellum.csv
https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_coldata_all.csv GSE96870_coldata_all.csv
https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_rowranges.tsv GSE96870_rowranges.tsv

Use the download.file function to download the files into the data folder in your working directory.

R

download.file(
    url = "https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_coldata_cerebellum.csv", 
    destfile = "data/GSE96870_coldata_cerebellum.csv"
)

download.file(
    url = "https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_coldata_all.csv", 
    destfile = "data/GSE96870_coldata_all.csv"
)

download.file(
    url = "https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_rowranges.tsv", 
    destfile = "data/GSE96870_rowranges.tsv"
)

Key Points

  • Proper organisation of the files required for your project in a working directory is crucial for maintaining order and ensuring easy access in the future.
  • RStudio project serves as a valuable tool for managing your project’s working directory and facilitating analysis.
  • The download.file function in R can be used for downloading datasets from the internet.