library(DESeq2)
library(dplyr)
load("counts.RData")
conds <- c("WT","WT","WT","MT","MT","MT") # set the conditions, note this is in order Identify Differentially Expressed Genes

Statistical approach
In order to identify which genes are differentially expressed between two groups (in this case, we have 3 samples that are in the wild type group and 3 samples in the mutant group), when need to test if the two groups are significantly different from one another, which can be a challenging question to answer. We can start by specifically asking our question as a statistical question: Are the differences we observe between the two groups greater than the differences we would expect to see by chance?
In the interest of time, in this workshop we will not get into T-tests, P-values and types of errors. You can read more about that here:
TL;DR:
- The “standard” P-value of 0.05 is not sufficient – 5% is too many genes that will be false positives (i.e., genes that look like they are DE, but they are not).
- We need to use a correction - usually FDR, also called Benjamini and Hochberg correction, but if you want to be more strict use FWER.
- We must correct for heteroskedasticity, i.e., cases where variance is dependent on the mean.
Identifying differentially expressed genes using DESeq2
DESeq2 is a highly-regarded R package for analysing RNA-seq data. DESeq2 uses a negative binomial method to model the count data, and combines this with a generalised linear model (GLM) to identify differentially expressed genes. For more about the DESeq2 package, you can read the original article.
The DESeq2 object
The DESeq2 package requires a specific data storage object called a “DESeq Data Set” or DDS object. The DDS object contains not just the count data, but also the design matrix we created earlier (which is used to specify groups for comparison). This object can be created using a function from the DESeq2 package and requires that we specify a source of count data (as a matrix), column data (column names, mutant and wildtype), and a design matrix (the same one we used for Limma above). Once we have created the dds object, we can view the data stored within using the counts function.
Note that if your data is stored in another format the DESeq2 function we will use to import data can be modified to work on other data formats e.g., a summarisedExperimentObject.
Some set up to run first:
Create the DESeq data set object:
dds <- DESeqDataSetFromMatrix(countData = as.matrix(counts),
colData = data.frame(conds=factor(conds)),
design = formula(~conds))
counts(dds) %>% head() WT1 WT2 WT3 MT1 MT2 MT3
YDL248W 52 46 36 65 70 78
YDL247W-A 0 0 0 0 1 0
YDL247W 2 4 2 6 8 5
YDL246C 0 0 1 1 2 0
YDL245C 0 3 0 5 7 4
YDL244W 6 6 5 20 30 19
Now fit the DESeq2 generalised linear model to the data.
dds <- DESeq(dds)estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
Note the outputs from running this function. What are these outputs?
Estimating size factors: assesses library size and calculates factors to adjust for differences between samples. This also adjusts for compositional differences (e.g., if gene X in sample 1 takes up a very large proportion of all available reads, other genes will have correspondingly fewer genes. If this effect is not uniform across samples, it can be corrected for during this stage).
Dispersion: adjusting for heteroskedasticity. DESeq2 makes use of variability estimates from not just one gene, but from all genes to make estimates about overall levels of variance. By bringing in (or “borrowing”) information from other genes, DESeq2 compensates for a small number of samples (which can lead to artificially small variance estimates otherwise).
Fitting model and testing: fitting the GLM and identifying differentially expressed genes.
We will now create a new object, res, to store the results in. We can access those results using either the head function or the summary function, which will give us slightly different information - both are valid and useful ways of familiarising yourself with the data.
res <- DESeq2::results(dds)
res %>% head()log2 fold change (MLE): conds WT vs MT
Wald test p-value: conds WT vs MT
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
YDL248W 56.223032 -0.233947 0.227842 -1.026796 0.30451648 0.36636385
YDL247W-A 0.139771 -0.525577 4.080473 -0.128803 0.89751357 0.91739265
YDL247W 4.242821 -0.810055 0.833254 -0.972158 0.33097175 0.39485364
YDL246C 0.618241 -1.032636 2.156090 -0.478939 0.63198173 0.69151478
YDL245C 2.848658 -1.978175 1.175885 -1.682287 0.09251323 0.12300279
YDL244W 13.088335 -1.582335 0.512879 -3.085204 0.00203413 0.00328597
res %>% summary()
out of 6830 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 2520, 37%
LFC < 0 (down) : 2521, 37%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
The res object contains information for all genes tested. It is practical to create a new object that contains only the genes we consider differentially expressed based on the thresholds (p-value, logFC) and methods (e.g., multiple testing adjustment) that suit our situation.
We will remove any rows that have NAs in the results object, then pull out only those with an adjusted p-value less than 0.05.
res <- na.omit(res)
# Keep all rows in the res object if the adjusted p-value < 0.05
resPadj <- res[res$padj <= 0.05 , ]
resPadjLogFC <- res[res$padj <= 0.05 & abs(res$log2FoldChange) > log2(2),]
# Get dimensions
resPadj %>% dim()[1] 4811 6
resPadjLogFC %>% dim()[1] 1830 6
# 1830 x 6Bioinformatics can often be something of a “black box” - we execute a function by feeding in some data, and get new data as output. It’s very easy to fall into the trap of believing your data without asking all the necessary questions. For example, we’ve just used an adjusted p-value for our threshold for significance. Did you question which method was used to adjust for multiple testing? See if you can identify, using the help menu as we’ve done above, which method was used here. Try and specify a different method for multiple testing corrections and note how this alters your output.
Identifying differentially expressed genes using LIMMA
Limma involves data transformation and log scales to account for the data being in a non-normal distribution.
Limma will create one of the special objects mentioned earlier in this workshop to store data. For Limma, this will be called a Digital Gene Expression object (DGE object). We will use the edgeR package to create this DGE object.
Load the libraries required for running limma, create a dge object where we specify counts (here you specify the matrix that contains your counts, in our case our object is already called counts). Then calculate normalisation factors, which will be used to adjust for library size differences, and calculate the log CPM. logCPM is, for each gene, how many counts there were per million reads. Because we are taking the log of this value, we will use the prior.counts=3 argument to add 3 to all read counts. This prevents any errors due to attempting to take the log of 0.
# BiocManager::install("limma")
# BiocManager::install("edgeR")
# BiocManager::install("DESeq2")
library(limma)
library(edgeR)
library(dplyr)load("counts.RData")
dge <- DGEList(counts=counts)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=1)
head(logCPM, 3) WT1 WT2 WT3 MT1 MT2 MT3
YDL248W 3.6735494 3.5040374 3.1893385 3.5957070 3.6691045 3.8750997
YDL247W-A -2.2615402 -2.2615402 -2.2615402 -2.2615402 -1.3684989 -2.2615402
YDL247W -0.5326267 0.2350803 -0.5068921 0.3843496 0.7123824 0.1836614
The design matrix
We use the design matrix to specify our different groups. Initially we will just look at a simple design, where we will specify wt and mutant. However, you can also include other information e.g., if your two sample groups were comprised of male and females and you had reason to suspect there might be sex-dependent factors involved.
conds <- c("WT","WT","WT","MT","MT","MT")
design <- model.matrix(~conds)
design (Intercept) condsWT
1 1 1
2 1 1
3 1 1
4 1 0
5 1 0
6 1 0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$conds
[1] "contr.treatment"
Addressing heteroskedasticity with Limma
Limma has a function called voom which we can use to address heteroskedasticity (a case where mean and variance are not independent). Voom will estimate the strength of the relationship between mean and variance and calculate “precision weights” for each gene. These are then used to normalise the data during the identification of differentially expressed genes.
Create an object, v, by calling the voom function on the Digital Gene Expression (dge) object. The v object will be used in later analyses. We will add information about our experimental setup by specifying the design object, and we will also call plot = TRUE to generate a plot showing the heteroskedasticity.
v <- voom(dge, design, plot = TRUE)
This plot highlights an earlier point about heteroskedasticity. For genes with low expression, the variance (here the square root of the standard deviation) is highly variable. Because the variance is dependent on the mean, we describe this data as heteroskedastic. Our v (voom) object contains information relating to the curve line (red) and makes an adjustment factor, so that the data will have a uniform relationship between the mean and variance.
Importantly, other than changing the mean-variance relationship, voom does not cause significant changes to the underlying data.
The impacts of voom
boxplot(v$E ~ col(v$E), ylab = "Counts", xlab = "Samples", main = "Log counts after voom")
We can see that log counts in the voom object (that is, counts after voom has been applied to create a stable mean-variance relationship) have not drastically changed from when we plotted them earlier.
We can also visualise the data in the form of density plots:
lineColour <- ifelse(conds=="MT", "red", "blue")
lineColour[1] "blue" "blue" "blue" "red" "red" "red"
plot(density(v$E[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCPM)) lines(density(v$E[,i]), col=lineColour[i])
We can see that voom has not removed the differences between our sample groups.
What if we wanted to remove these differences? Remember, we (probably) cannot be certain as to whether these differences are biological or technical. In cases like this it can be good to run your analysis multiple ways - first without removing the differences, and then after removing the differences. If a decision like this causes your results to change drastically then you need to be aware of the impact your choices are having so that you can make an informed decision about how to investigate further.
Detecting differentially expressed genes
Now that we have reached this point in the analysis, identifying differentially expressed genes is relatively simple in terms of the code we run. We will use the lmFit() command to fit a linear model for every gene using our v (voom) object and the design matrix, then use the eBayes() function on the linear model to perform Empirical Bayes shrinkage estimation and return moderated test statistics.
fit <- lmFit(v, design)
fit <- (eBayes(fit))The fit object is complex to look at. We will use the topTable function to retrieve the useful information (gene name, logFC, adjusted p values), and then filter for significant genes only. We will also save the topTable object (tt) for use in the next episode.
tt <- topTable(fit, coef=ncol(design), n=nrow(counts))
head(tt) logFC AveExpr t P.Value adj.P.Val B
YAL038W 2.313985 10.80214 319.5950 3.725452e-13 8.850433e-10 21.08951
YOR161C 2.568389 10.80811 321.9628 3.574510e-13 8.850433e-10 21.08187
YML128C 1.640664 11.40819 286.6167 6.857932e-13 9.775297e-10 20.84520
YMR105C 2.772539 9.65092 331.8249 3.018547e-13 8.850433e-10 20.16815
YHL021C 2.034496 10.17510 269.4034 9.702963e-13 1.152550e-09 20.07857
YDR516C 2.085424 10.05426 260.8061 1.163655e-12 1.184767e-09 19.87217
sum(tt$adj.P.Val < 0.05)[1] 5140
5140 genes with an adjusted p-value less than 0.05.
sum(tt$adj.P.Val < 0.01)[1] 4566
4566 genes with an adjusted p-value less than 0.01.
save(tt, file="tt.RData")Our output includes an adjusted p-value. Use the help function (type a ? in front of any function name) to learn what method was used to adjust for multiple testing. Using information from the help menu, change the correction method to something different. How does this impact your results?
Volcano plots
Volcano plots are a useful and commonly used method to visualise differentially expressed genes within your data set.
In the following code we will first produce the volcano plot itself, which shows the distribution of all genes according to p-value and log2 fold change.
Statistical significance is often not the only measure we will use to call genes differentially expressed. It is common to apply an additional threshold of gene expression needing to double or halve to be considered differentially expressed. On the log2FC scale, this is log2(2) (i.e., a value > 1 or < -1).
Genes will be coloured based on whether they are up (red) or down (blue) regulated.
This plot is an important “sanity check”. We can logically check that our significant genes (in red) are those with both a p-value < 0.05 and a log2 fold change greater than 1 (which is equivalent to a doubling or halving of gene expression). If we observe red marked dots in the center or bottom of the plot, we could recognise an error has occurred.
library(dplyr)
library(limma)
library(edgeR)
library(dplyr)
library(ggplot2)
library(ggiraph)
library(plotly)
# Define significance thresholds
pval_threshold <- 0.05
fc_threshold <- 1 # log2 fold change threshold
# Create a data frame for plotting
plot_data <- data.frame(
logFC = tt$logFC,
negLogPval = -log10(tt$P.Value),
adj.P.Val = tt$adj.P.Val,
ID = rownames(tt) # Assuming row names are gene IDs
)
# Add a column to categorize genes
plot_data$category <- ifelse(plot_data$adj.P.Val <= pval_threshold,
ifelse(plot_data$logFC >= fc_threshold, "Upregulated",
ifelse(plot_data$logFC <= -fc_threshold, "Downregulated", "Passes P-value cut off")),
"Not Significant")
# Create the volcano plot using ggplot2
ggplot(plot_data, aes(x = logFC, y = negLogPval, color = category)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Not Significant" = "grey20", "Passes P-value cut off" = "grey")) +
geom_vline(xintercept = c(-fc_threshold, fc_threshold), linetype = "dashed") +
geom_hline(yintercept = -log10(pval_threshold), linetype = "dashed") +
labs(
title = "Volcano Plot of Differential Gene Expression",
subtitle = paste("Thresholds: |log2FC| >", fc_threshold, "and adjusted p-value <", pval_threshold),
x = "log2 Fold Change",
y = "-log10(p-value)",
color = "Differential Expression"
) +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(hjust = 0.5, size = 16),
plot.subtitle = element_text(hjust = 0.5, size = 12)
)
# Save the plot (optional)
ggsave("volcano_plot.png", width = 10, height = 8, dpi = 300)We can also produce an interactive volcano plot with the ggigraph package.
# Create the ggplot object
p <- ggplot(plot_data, aes(x = logFC, y = negLogPval, color = category, text = ID)) +
geom_point(alpha = 0.6, size = 2) +
scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Not Significant" = "grey20", "Passes P-value cut off" = "grey")) +
geom_vline(xintercept = c(-fc_threshold, fc_threshold), linetype = "dashed") +
geom_hline(yintercept = -log10(pval_threshold), linetype = "dashed") +
labs(
title = "Interactive Volcano Plot of Differential Gene Expression",
subtitle = paste("Thresholds: |log2FC| >", fc_threshold, "and adjusted p-value <", pval_threshold),
x = "log2 Fold Change",
y = "-log10(p-value)",
color = "Differential Expression"
) +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(hjust = 0.5, size = 16),
plot.subtitle = element_text(hjust = 0.5, size = 12)
)
# Convert ggplot to an interactive plotly object
interactive_plot <- ggplotly(p, tooltip = c("text", "x", "y", "color"))
# Customize hover text
interactive_plot <- interactive_plot %>%
layout(hoverlabel = list(bgcolor = "white"),
hovermode = "closest")
# Display the interactive plot
interactive_plotDifferentially expressed genes stored
Let’s make an object to store all the information for our list of significantly differentially expressed genes. We will use a threshold of an adjusted p-value < 0.05 and a logFC > 1.
sigGenesLimma <- which(tt$adj.P.Val <= 0.05 & (abs(tt$logFC) > log2(2)))
length(sigGenesLimma)[1] 1891
# 1891Keep only gene IDs in tt that match vector index positions in sigGenesLimma, and reassign sigGenesLimma with these gene IDs:
sigGenesLimma <- tt[sigGenesLimma, ]Differentially expressed gene lists
To recap, we’ve now produced two lists of differentially expressed genes - one produced with the Limma method (called sigGenesLimma), one with DESeq2 (called resPadjLogFC). In both cases we used the same threshold for statistical significance (adjusted p-value < 0.05, using the FDR correction) and biological significance (logFC more than 1 or less than negative 1, which is equivalent to a doubling or halving gene expression). In our run-through, Limma identified 1,891 differentially expressed genes, while DESeq2 identified 1,830 genes.
These numbers are very similar, but what’s the actual overlap between these two groups? A venn diagram is an easy way to visualise the relationship between the two groups. First, load the gplots library which we will use for making venn diagrams. Create an object, which we will call setlist, that lists the two gene sets (in our case, this will be the rownames of the objects we created with Limma and DESeq2). Finally, use the venn() function to create a venn diagram.
# BiocManager::install("gplots")
library(gplots)
setlist <- list(Limma = rownames(sigGenesLimma),
DESeq2 = rownames(resPadjLogFC))
venn(setlist)
This is very strong concordance between our two methods and gives us high confidence in our approach. At this point you could take the overlap (n = 1799 genes), take all genes (n = 1799 + 31 + 92), or take genes from just one method (more relevant if one method was conservative compared to the other). Exactly how you resolve this step will depend on the exact shape of your venn diagram and your context.
For simplicity, we will take the list of genes identified by limma, which we saved earlier as a tt RData object, and then see in the next episode how we can derive meaningful biological information from this long list of genes.
