author: Lumi author_gh_user: BHAAA-ZLM read_time: 30min publish_date: 2022.8.3


DESeq2 is a very special R package made for performing differential expression analysis on your sequence, especially when you are tring to define differences between multiple biological conditions (e.g. treated, untreated).

Most of my overall knowledge comes from surfing the internet, and I found:

Getting Ready for DESeq2

In our RNA-seq workflow, we used cuffdiff to get a general differential expression result, from it, you can easily see which genes are expressed more, which are suppressed, in other words, which genes have a significant count difference from the other genes.

But when we come to the real biological experiments and analysis, we usually want to know more about our gene and about our test subjects. We might even want to draw fancy pictures to convey our ideas. That’s when our rather old tool cuffdiff doesn’t shine.

DESeq2 basically work nearly the same as cuffdiff (although the algorithms might be a bit different), but the final result we are looking for are bascially the same. The log2(foldchanges) and p_values and all those things. So how does it work?

Getting the Matrix

First we need the count matrix from our data. We need an old friend from our RNA-seq workflow - featureCounts.

featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 

By typing this command into the terminal, feature counts will automatically map all the counts to the genes in your annotation file. Thus providing you with a matrix of counts for specific genes. We only need the counts for each matrix, so we can cut the matrix in the command line or in R. We might also want to remove some data with very low counts, since they are not very significant, and we can do that also in the command line or in R.

Input the Matrix into R

The next step is to get your matrix into R and prepare it for DESeq2. The code I used was this, and I found it pretty elegant.

# input data 
input_data <- read.table("deseq2_input.txt",header = T, row.names = 1)


# preperation 
input_data <- as.matrix(input_data)
condition <- factor(c(rep("ctrl",4), rep("DDX6KO",4)))
coldata <- data.frame(row.names = colnames(input_data),condition)

First we input the data into the variable input_data by the R function read.table(). You can always get help in R by just typing ?functionName() in the R console. So in the read.table() function, we set the first row as our header with header = T. row.names = 1 indicates that the first column of each row gives the name of the first row.

Then, we prepare the raw data so that it fits the requirements of DESeq2. Apart from the original data matrix from your sequencing results, DESeq2 also requires another matrix containing informations on the columns of our data (in other words, the names and types of your contol and experiment groups). So the as.matrix turns the data into a matrix fit for calculation. condition stores a factor of two (it can be more than two). coldata <- data.frame() gives DESeq2 the column data it needs.

Construct the DESeq Dataset

Then we use our data to construct a DESeq Dataset from our matrix (in this case, input_data).

library(DESeq2)
# construct a deseq input data matrix
dds <- DESeqDataSetFromMatrix(countData = input_data, colData = coldata, design = ~condition)

dim(dds)
dim(dds[rowSums(counts(dds)) > 5,])
dds <- dds[rowSums(counts(dds)) > 5,]

We first need to import the DESeq2 library (package) (of course).

Then, we use the function DESeqDataSetFromMatrix() from DESeq2 package to input our raw data. colData is the matrix we made earlier containing the data for row names and types and all those stuff. design = ~condtion means that the data is designed by the factor condition in the colData variable. In designed, I mean that the DESeq process will perform the differential expression analysis based on the different conditions of the experiments. The conditions are also set manually. You can add another column in the colData matrix called time, and set the design to design = ~condition + time, DESeq will restrain the effects of condition and focus on the time factor. For more information, see this blog on BioStars.

We can check the dimensions of dds using dim(dds). We could remove the results when the gene’s counts are less than 5 in total, by reassigning the value of dds to dds[rowSums(counts(dds)) > 5,]. Thus we got our DESeq Dataset, named dds, and we can happily perform differential expression analysis on it!~

Finally, the DESeq analysis

The actual analysis part is really easy to understand, and doesn’t require a lot of code of brain.

# perform the differential expression analysis
dds <- DESeq(dds)
res <- results(dds)

Just type these words and everything will be done in the background thanks to our lovely DESeq2 developers. The deeper truth behind all the analysis is much more complex, and I don’t think I fully understand it. I will leave a link here to the lovely Indian girl’s video, and I might make another whole chapter just talking about how it works. But for us who are just using it and only cares about the final output data, these two lines of code is enough.

Plot Your Data

After we obtained our data from DESeq, we need to know how to show everyone else in a clear and fasionable way, what is happening in our experiments. That’s where our beautiful plots come. A wide variaty of plots can be used to express our data, and it’s best if we know how to draw each one of them, and more importantly, the essence behind every plot.

The knowledge of the plotting part mainly comes from the Griffith lab website, which is here. There might be some websites helpful for individual plotting, I will list each of the websites in their own part.

The MA plot

MAplots are plots that can display the difference of measurements between two sets of data.

DESeq2 have it’s own MAplotting function. Which is simply:

plotMA(res, ylim = c(-3,3))

But this MA plot only have two colours, and is rather lame. With ggplot’s function geom_point, we could make a similar MA plot.

library(ggplot2)
deseq2ResDF <- as.data.frame(res) ## convert the result file into a data frame
deseq2ResDF$significant <- ifelse(deseq2ResDF$padj < .1, "Significant", NA) ## a new column for significance

ggplot(deseq2ResDF, aes(baseMean, log2FoldChange, colour = padj)) + 
  geom_point(size=0.5) + scale_y_continuous(limits=c(-3, 3)) + 
  scale_x_log10() + geom_hline(yintercept = 0, colour="darkorchid4", size=1, linetype="longdash") + 
  labs(x="mean of normalized counts", y="log fold change") + 
  theme_bw() + geom_density_2d(colour="black", size=0.25) +
  scale_colour_gradient( high = "#f8ff2b", low = "#ff2b2b", space = "Lab", na.value = "grey50", guide = "colourbar", aesthetics = "colour")

By adding a series of constraints, we can make the geom_point look just like our MAplot. But first, we need to input our DESeq Data as a dataframe, and than add a new column called significant to the end of the dataframe. Remember to add scale_x_log10 or else it will display the original counts, which are massive.