Identify Differentially Expressed Genes

Statistical approach

In order to identify which genes are differentially expressed between two groups (in this case, we have 3 samples that are in the wild type group and 3 samples in the mutant group), when need to test if the two groups are significantly different from one another, which can be a challenging question to answer. We can start by specifically asking our question as a statistical question: Are the differences we observe between the two groups greater than the differences we would expect to see by chance?

In the interest of time, in this workshop we will not get into T-tests, P-values and types of errors. You can read more about that here:

Statistics 📊

T-tests and p-values

The statistical approach to this question is to begin with the null hypothesis (that there is no difference between the two groups) and test whether or not you can reject the null.

To test whether or not we can reject the null hypothesis we can calculate a test statistic:

Here we are taking the difference between the means of the two groups and then dividing that difference by some measure of variability - in this case, dividing by the standard error.

If the difference in means is LARGE relative to the variance, the test statistic will be large (indicating significance). If the difference between the means is small relative to the variance, the test statistic will be small, indicating the difference is probably not significant (i.e., the difference we observe is in-line with the variance we observe).

Exercise: With the above formula in mind, what impact would heteroskedasticity have when you are attempting to identify differentially expressed genes?

Once we have a test statistic we will calculate a p-value. The p-value is an indication of how likely we were to observe the given difference in means (or a more extreme difference) if there is truly no difference between the means. That is, how likely are we to see this difference due to random chance?

Interpreting p-values

How do we interpret the p-value? We will specify a threshold (usually 0.05), and say that if a p-value is less than this threshold we will consider it a significant result. If the p-value is lower than the threshold we set, we will reject the null hypothesis (that the groups are identical) and accept the alternative (that there is a difference between the groups). The threshold we set is our level of “risk” that this event happened by chance alone.

When we declare that a result with a p-value of less than 0.05 is significant, we are saying that we believe the difference to be true since, if there was truly no difference, such a result would happen less than 5% of the time.

A useful mental analogy is to consider flipping a coin. We know that for a fair coin, the odds of getting heads is 50:50. Still, if we get four heads in a row it doesn’t worry us - it’s entirely plausible given the variation we expect. But if we get 50 heads in a row, while we know it’s statistically possible, we know it’s very unlikely to see such an extreme result. If we got to 100 heads in a row, we might instead start to question the fairness of the coin - we would reject the null hypothesis that the odds of heads and tails is identical.

Types of errors

It’s important to think about the two possible ways in which we could be wrong when testing a hypothesis like this: we could generate a false positive or a false negative.

A false positive or Type I error is when we reject the null hypothesis when there is truly no significant difference.
A false negative or Type II error is when we fail to reject the null hypothesis when there truly is a significant difference.

When we select a p-value threshold of 0.05, we are accepting the fact that 5% of the time that the null hypothesis is true, we will reject it. This becomes hugely problematic when you are testing thousands of genes! In order to avoid a large number of false positives we must correct for multiple testing. The more tests we are doing, the more stringent we need to be. We will not cover multiple testing corrections in depth but will briefly mention two types:

Family-wise Error Rate (FWER), also called the Bonferroni and Holm corrections, is a highly stringent procedure. This approach will give the minimal possible number of false positives, but will miss some true positives. Good if false positives are particularly costly (e.g., if you are providing someone with a severe medical diagnosis).
False Discovery Rate control (FDR), also called the Benjamini and Hochberg correction, is less conservative than FWER. This approach will identify more significant events, but expect a greater number of false positives. Use this approach if you are more concerned about missing something valuable and can afford a few false positives.

Exercise: Note down some key features of your experiment. Are you more inclined to use FWER or FDR? Which is more appropriate for your data and your experimental situation?

Modifying the t-test for RNA-seq

Many biological experiments struggle with getting enough samples for statistical significance. In RNA-seq experiments it is common to see groups of three samples or replicates. This is especially problematic when using the t-test (or similar procedures that involve variance). When testing for differences in gene expression it is possible to encounter genes with a small difference in the mean between the two groups and, due to the small sample size, a very small level of variation (a small standard error). A small difference in the means divided by a very small standard error translates to a large test statistic, which is then translated to a small p-value and what looks like a highly significant result.

Since this issue is caused by an artificially low standard error due to low sample numbers, a number of methods have proposed artificially increasing the standard error in some way. One way to implement this is through Shrinkage Estimation, which involves using Empirical Bayes methods to adjust individual test statistics based on the overall distribution of variances. During shrinkage estimation, small standard errors are made larger while large standard errors are made smaller.

TL;DR:

The “standard” P-value of 0.05 is not sufficient – 5% is too many genes that will be false positives (i.e., genes that look like they are DE, but they are not).
We need to use a correction - usually FDR, also called Benjamini and Hochberg correction, but if you want to be more strict use FWER.
We must correct for heteroskedasticity, i.e., cases where variance is dependent on the mean.

Identifying differentially expressed genes using DESeq2

DESeq2 is a highly-regarded R package for analysing RNA-seq data. DESeq2 uses a negative binomial method to model the count data, and combines this with a generalised linear model (GLM) to identify differentially expressed genes. For more about the DESeq2 package, you can read the original article.

The DESeq2 object

The DESeq2 package requires a specific data storage object called a “DESeq Data Set” or DDS object. The DDS object contains not just the count data, but also the design matrix we created earlier (which is used to specify groups for comparison). This object can be created using a function from the DESeq2 package and requires that we specify a source of count data (as a matrix), column data (column names, mutant and wildtype), and a design matrix (the same one we used for Limma above). Once we have created the dds object, we can view the data stored within using the counts function.

Note that if your data is stored in another format the DESeq2 function we will use to import data can be modified to work on other data formats e.g., a summarisedExperimentObject.

Some set up to run first:

library(DESeq2)
library(dplyr)
load("counts.RData")
conds <- c("WT","WT","WT","MT","MT","MT") # set the conditions, note this is in order

Create the DESeq data set object:

dds <- DESeqDataSetFromMatrix(countData = as.matrix(counts), 
                              colData = data.frame(conds=factor(conds)), 
                              design = formula(~conds))

counts(dds) %>% head()

          WT1 WT2 WT3 MT1 MT2 MT3
YDL248W    52  46  36  65  70  78
YDL247W-A   0   0   0   0   1   0
YDL247W     2   4   2   6   8   5
YDL246C     0   0   1   1   2   0
YDL245C     0   3   0   5   7   4
YDL244W     6   6   5  20  30  19

Now fit the DESeq2 generalised linear model to the data.

dds <- DESeq(dds)

estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

Note the outputs from running this function. What are these outputs?

Estimating size factors: assesses library size and calculates factors to adjust for differences between samples. This also adjusts for compositional differences (e.g., if gene X in sample 1 takes up a very large proportion of all available reads, other genes will have correspondingly fewer genes. If this effect is not uniform across samples, it can be corrected for during this stage).
Dispersion: adjusting for heteroskedasticity. DESeq2 makes use of variability estimates from not just one gene, but from all genes to make estimates about overall levels of variance. By bringing in (or “borrowing”) information from other genes, DESeq2 compensates for a small number of samples (which can lead to artificially small variance estimates otherwise).
Fitting model and testing: fitting the GLM and identifying differentially expressed genes.

We will now create a new object, res, to store the results in. We can access those results using either the head function or the summary function, which will give us slightly different information - both are valid and useful ways of familiarising yourself with the data.

res <- DESeq2::results(dds)

res %>% head()

log2 fold change (MLE): conds WT vs MT 
Wald test p-value: conds WT vs MT 
DataFrame with 6 rows and 6 columns
           baseMean log2FoldChange     lfcSE      stat     pvalue       padj
          <numeric>      <numeric> <numeric> <numeric>  <numeric>  <numeric>
YDL248W   56.223032      -0.233947  0.227842 -1.026796 0.30451648 0.36636385
YDL247W-A  0.139771      -0.525577  4.080473 -0.128803 0.89751357 0.91739265
YDL247W    4.242821      -0.810055  0.833254 -0.972158 0.33097175 0.39485364
YDL246C    0.618241      -1.032636  2.156090 -0.478939 0.63198173 0.69151478
YDL245C    2.848658      -1.978175  1.175885 -1.682287 0.09251323 0.12300279
YDL244W   13.088335      -1.582335  0.512879 -3.085204 0.00203413 0.00328597

res %>% summary()


out of 6830 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 2520, 37%
LFC < 0 (down)     : 2521, 37%
outliers [1]       : 0, 0%
low counts [2]     : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

The res object contains information for all genes tested. It is practical to create a new object that contains only the genes we consider differentially expressed based on the thresholds (p-value, logFC) and methods (e.g., multiple testing adjustment) that suit our situation.

We will remove any rows that have NAs in the results object, then pull out only those with an adjusted p-value less than 0.05.

res <- na.omit(res)

# Keep all rows in the res object if the adjusted p-value < 0.05
resPadj <- res[res$padj <= 0.05 , ]

resPadjLogFC <- res[res$padj <= 0.05 & abs(res$log2FoldChange) > log2(2),]

# Get dimensions
resPadj %>% dim()

[1] 4811    6

resPadjLogFC %>% dim()

[1] 1830    6

# 1830 x 6

Exercise: Identify method for multiple testing corrections

Bioinformatics can often be something of a “black box” - we execute a function by feeding in some data, and get new data as output. It’s very easy to fall into the trap of believing your data without asking all the necessary questions. For example, we’ve just used an adjusted p-value for our threshold for significance. Did you question which method was used to adjust for multiple testing? See if you can identify, using the help menu as we’ve done above, which method was used here. Try and specify a different method for multiple testing corrections and note how this alters your output.

Identifying differentially expressed genes using LIMMA

Limma involves data transformation and log scales to account for the data being in a non-normal distribution.

Limma will create one of the special objects mentioned earlier in this workshop to store data. For Limma, this will be called a Digital Gene Expression object (DGE object). We will use the edgeR package to create this DGE object.

Load the libraries required for running limma, create a dge object where we specify counts (here you specify the matrix that contains your counts, in our case our object is already called counts). Then calculate normalisation factors, which will be used to adjust for library size differences, and calculate the log CPM. logCPM is, for each gene, how many counts there were per million reads. Because we are taking the log of this value, we will use the prior.counts=3 argument to add 3 to all read counts. This prevents any errors due to attempting to take the log of 0.

# BiocManager::install("limma")
# BiocManager::install("edgeR")
# BiocManager::install("DESeq2")

library(limma)
library(edgeR)
library(dplyr)

load("counts.RData")

dge <- DGEList(counts=counts)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=1)

head(logCPM, 3)

                 WT1        WT2        WT3        MT1        MT2        MT3
YDL248W    3.6735494  3.5040374  3.1893385  3.5957070  3.6691045  3.8750997
YDL247W-A -2.2615402 -2.2615402 -2.2615402 -2.2615402 -1.3684989 -2.2615402
YDL247W   -0.5326267  0.2350803 -0.5068921  0.3843496  0.7123824  0.1836614

The design matrix

We use the design matrix to specify our different groups. Initially we will just look at a simple design, where we will specify wt and mutant. However, you can also include other information e.g., if your two sample groups were comprised of male and females and you had reason to suspect there might be sex-dependent factors involved.

conds <- c("WT","WT","WT","MT","MT","MT")

design <- model.matrix(~conds)

design

  (Intercept) condsWT
1           1       1
2           1       1
3           1       1
4           1       0
5           1       0
6           1       0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$conds
[1] "contr.treatment"

Addressing heteroskedasticity with Limma

Limma has a function called voom which we can use to address heteroskedasticity (a case where mean and variance are not independent). Voom will estimate the strength of the relationship between mean and variance and calculate “precision weights” for each gene. These are then used to normalise the data during the identification of differentially expressed genes.

Create an object, v, by calling the voom function on the Digital Gene Expression (dge) object. The v object will be used in later analyses. We will add information about our experimental setup by specifying the design object, and we will also call plot = TRUE to generate a plot showing the heteroskedasticity.

v <- voom(dge, design, plot = TRUE)

This plot highlights an earlier point about heteroskedasticity. For genes with low expression, the variance (here the square root of the standard deviation) is highly variable. Because the variance is dependent on the mean, we describe this data as heteroskedastic. Our v (voom) object contains information relating to the curve line (red) and makes an adjustment factor, so that the data will have a uniform relationship between the mean and variance.

Importantly, other than changing the mean-variance relationship, voom does not cause significant changes to the underlying data.

The impacts of voom

boxplot(v$E ~ col(v$E), ylab = "Counts", xlab = "Samples", main = "Log counts after voom")

We can see that log counts in the voom object (that is, counts after voom has been applied to create a stable mean-variance relationship) have not drastically changed from when we plotted them earlier.

We can also visualise the data in the form of density plots:

lineColour <- ifelse(conds=="MT", "red", "blue")
lineColour

[1] "blue" "blue" "blue" "red"  "red"  "red"

plot(density(v$E[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCPM)) lines(density(v$E[,i]), col=lineColour[i])

We can see that voom has not removed the differences between our sample groups.

What if we wanted to remove these differences? Remember, we (probably) cannot be certain as to whether these differences are biological or technical. In cases like this it can be good to run your analysis multiple ways - first without removing the differences, and then after removing the differences. If a decision like this causes your results to change drastically then you need to be aware of the impact your choices are having so that you can make an informed decision about how to investigate further.

Detecting differentially expressed genes

Now that we have reached this point in the analysis, identifying differentially expressed genes is relatively simple in terms of the code we run. We will use the lmFit() command to fit a linear model for every gene using our v (voom) object and the design matrix, then use the eBayes() function on the linear model to perform Empirical Bayes shrinkage estimation and return moderated test statistics.

fit <- lmFit(v, design)

fit <- (eBayes(fit))

The fit object is complex to look at. We will use the topTable function to retrieve the useful information (gene name, logFC, adjusted p values), and then filter for significant genes only. We will also save the topTable object (tt) for use in the next episode.

tt <- topTable(fit, coef=ncol(design), n=nrow(counts))

head(tt)

           logFC  AveExpr        t      P.Value    adj.P.Val        B
YAL038W 2.313985 10.80214 319.5950 3.725452e-13 8.850433e-10 21.08951
YOR161C 2.568389 10.80811 321.9628 3.574510e-13 8.850433e-10 21.08187
YML128C 1.640664 11.40819 286.6167 6.857932e-13 9.775297e-10 20.84520
YMR105C 2.772539  9.65092 331.8249 3.018547e-13 8.850433e-10 20.16815
YHL021C 2.034496 10.17510 269.4034 9.702963e-13 1.152550e-09 20.07857
YDR516C 2.085424 10.05426 260.8061 1.163655e-12 1.184767e-09 19.87217

sum(tt$adj.P.Val < 0.05)

[1] 5140

5140 genes with an adjusted p-value less than 0.05.

sum(tt$adj.P.Val < 0.01)

[1] 4566

4566 genes with an adjusted p-value less than 0.01.

save(tt, file="tt.RData")

Exercise: Find the method for adjusting for multiple testing

Our output includes an adjusted p-value. Use the help function (type a ? in front of any function name) to learn what method was used to adjust for multiple testing. Using information from the help menu, change the correction method to something different. How does this impact your results?

Volcano plots

Volcano plots are a useful and commonly used method to visualise differentially expressed genes within your data set.

In the following code we will first produce the volcano plot itself, which shows the distribution of all genes according to p-value and log2 fold change.

Statistical significance is often not the only measure we will use to call genes differentially expressed. It is common to apply an additional threshold of gene expression needing to double or halve to be considered differentially expressed. On the log2FC scale, this is log2(2) (i.e., a value > 1 or < -1).

Genes will be coloured based on whether they are up (red) or down (blue) regulated.

This plot is an important “sanity check”. We can logically check that our significant genes (in red) are those with both a p-value < 0.05 and a log2 fold change greater than 1 (which is equivalent to a doubling or halving of gene expression). If we observe red marked dots in the center or bottom of the plot, we could recognise an error has occurred.

library(dplyr)
library(limma)
library(edgeR)
library(dplyr)
library(ggplot2)
library(ggiraph)
library(plotly)

# Define significance thresholds
pval_threshold <- 0.05
fc_threshold <- 1  # log2 fold change threshold

# Create a data frame for plotting
plot_data <- data.frame(
  logFC = tt$logFC,
  negLogPval = -log10(tt$P.Value),
  adj.P.Val = tt$adj.P.Val,
  ID = rownames(tt)  # Assuming row names are gene IDs
)

# Add a column to categorize genes
plot_data$category <- ifelse(plot_data$adj.P.Val <= pval_threshold,
                             ifelse(plot_data$logFC >= fc_threshold, "Upregulated",
                                    ifelse(plot_data$logFC <= -fc_threshold, "Downregulated", "Passes P-value cut off")),
                             "Not Significant")


# Create the volcano plot using ggplot2
ggplot(plot_data, aes(x = logFC, y = negLogPval, color = category)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Not Significant" = "grey20", "Passes P-value cut off" = "grey")) +
  geom_vline(xintercept = c(-fc_threshold, fc_threshold), linetype = "dashed") +
  geom_hline(yintercept = -log10(pval_threshold), linetype = "dashed") +
  labs(
    title = "Volcano Plot of Differential Gene Expression",
    subtitle = paste("Thresholds: |log2FC| >", fc_threshold, "and adjusted p-value <", pval_threshold),
    x = "log2 Fold Change",
    y = "-log10(p-value)",
    color = "Differential Expression"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12)
  )

# Save the plot (optional)
ggsave("volcano_plot.png", width = 10, height = 8, dpi = 300)

We can also produce an interactive volcano plot with the ggigraph package.

# Create the ggplot object
p <- ggplot(plot_data, aes(x = logFC, y = negLogPval, color = category, text = ID)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_manual(values = c("Upregulated" = "red", "Downregulated" = "blue", "Not Significant" = "grey20", "Passes P-value cut off" = "grey")) +
  geom_vline(xintercept = c(-fc_threshold, fc_threshold), linetype = "dashed") +
  geom_hline(yintercept = -log10(pval_threshold), linetype = "dashed") +
  labs(
    title = "Interactive Volcano Plot of Differential Gene Expression",
    subtitle = paste("Thresholds: |log2FC| >", fc_threshold, "and adjusted p-value <", pval_threshold),
    x = "log2 Fold Change",
    y = "-log10(p-value)",
    color = "Differential Expression"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(hjust = 0.5, size = 16),
    plot.subtitle = element_text(hjust = 0.5, size = 12)
  )

# Convert ggplot to an interactive plotly object
interactive_plot <- ggplotly(p, tooltip = c("text", "x", "y", "color"))

# Customize hover text
interactive_plot <- interactive_plot %>% 
  layout(hoverlabel = list(bgcolor = "white"),
         hovermode = "closest")

# Display the interactive plot
interactive_plot

Differentially expressed genes stored

Let’s make an object to store all the information for our list of significantly differentially expressed genes. We will use a threshold of an adjusted p-value < 0.05 and a logFC > 1.

sigGenesLimma <- which(tt$adj.P.Val <= 0.05 & (abs(tt$logFC) > log2(2)))
length(sigGenesLimma)

[1] 1891

# 1891

Keep only gene IDs in tt that match vector index positions in sigGenesLimma, and reassign sigGenesLimma with these gene IDs:

sigGenesLimma <- tt[sigGenesLimma, ]

Differentially expressed gene lists

To recap, we’ve now produced two lists of differentially expressed genes - one produced with the Limma method (called sigGenesLimma), one with DESeq2 (called resPadjLogFC). In both cases we used the same threshold for statistical significance (adjusted p-value < 0.05, using the FDR correction) and biological significance (logFC more than 1 or less than negative 1, which is equivalent to a doubling or halving gene expression). In our run-through, Limma identified 1,891 differentially expressed genes, while DESeq2 identified 1,830 genes.

These numbers are very similar, but what’s the actual overlap between these two groups? A venn diagram is an easy way to visualise the relationship between the two groups. First, load the gplots library which we will use for making venn diagrams. Create an object, which we will call setlist, that lists the two gene sets (in our case, this will be the rownames of the objects we created with Limma and DESeq2). Finally, use the venn() function to create a venn diagram.

# BiocManager::install("gplots")
library(gplots)

setlist <- list(Limma = rownames(sigGenesLimma), 
                DESeq2 = rownames(resPadjLogFC))

venn(setlist)

This is very strong concordance between our two methods and gives us high confidence in our approach. At this point you could take the overlap (n = 1799 genes), take all genes (n = 1799 + 31 + 92), or take genes from just one method (more relevant if one method was conservative compared to the other). Exactly how you resolve this step will depend on the exact shape of your venn diagram and your context.

For simplicity, we will take the list of genes identified by limma, which we saved earlier as a tt RData object, and then see in the next episode how we can derive meaningful biological information from this long list of genes.