author: Lumi author_gh_user: BHAAA-ZLM read_time: 30min publish_date: 2022.8.3
DESeq2 is a very special R package made for performing differential expression analysis on your sequence, especially when you are tring to define differences between multiple biological conditions (e.g. treated, untreated).
Most of my overall knowledge comes from surfing the internet, and I found:
A wonderful girl teaching DESeq2. Her channel is so rich in knowledge, everyone, subscribe now!!!
The original manual for DESeq2.
Another DESeq2 tutorial made very thoroughly and clear, I love it!
The DESeq2 tutorial made by Griffith lab of Washington University. Which is surprisingly detailed when it comes to plotting.
In our RNA-seq workflow, we used cuffdiff to get a general differential expression result, from it, you can easily see which genes are expressed more, which are suppressed, in other words, which genes have a significant count difference from the other genes.
But when we come to the real biological experiments and analysis, we usually want to know more about our gene and about our test subjects. We might even want to draw fancy pictures to convey our ideas. That’s when our rather old tool cuffdiff doesn’t shine.
DESeq2 basically work nearly the same as cuffdiff (although the algorithms might be a bit different), but the final result we are looking for are bascially the same. The log2(foldchanges) and p_values and all those things. So how does it work?
First we need the count matrix from our data. We need an old friend from our RNA-seq workflow - featureCounts.
featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...
By typing this command into the terminal, feature counts will automatically map all the counts to the genes in your annotation file. Thus providing you with a matrix of counts for specific genes. We only need the counts for each matrix, so we can cut the matrix in the command line or in R. We might also want to remove some data with very low counts, since they are not very significant, and we can do that also in the command line or in R.
The next step is to get your matrix into R and prepare it for DESeq2. The code I used was this, and I found it pretty elegant.
# input data
input_data <- read.table("deseq2_input.txt",header = T, row.names = 1)
# preperation
input_data <- as.matrix(input_data)
condition <- factor(c(rep("ctrl",4), rep("DDX6KO",4)))
coldata <- data.frame(row.names = colnames(input_data),condition)
First we input the data into the
variable input_data by the R function read.table(). You can always get
help in R by just typing ?functionName()
in the R console.
So in the read.table() function, we set the first row as our header with
header = T
. row.names = 1
indicates that the
first column of each row gives the name of the first row.
Then, we prepare the raw data so
that it fits the requirements of DESeq2. Apart from the original data
matrix from your sequencing results, DESeq2 also requires another matrix
containing informations on the columns of our data (in other words, the
names and types of your contol and experiment groups). So the
as.matrix
turns the data into a matrix fit for calculation.
condition
stores a factor of two (it can be more than two).
coldata <- data.frame()
gives DESeq2 the column data it
needs.
Then we use our data to construct
a DESeq Dataset from our matrix (in this case,
input_data
).
library(DESeq2)
# construct a deseq input data matrix
dds <- DESeqDataSetFromMatrix(countData = input_data, colData = coldata, design = ~condition)
dim(dds)
dim(dds[rowSums(counts(dds)) > 5,])
dds <- dds[rowSums(counts(dds)) > 5,]
We first need to import the DESeq2 library (package) (of course).
Then, we use the function
DESeqDataSetFromMatrix()
from DESeq2 package to input our
raw data. colData
is the matrix we made earlier containing
the data for row names and types and all those stuff.
design = ~condtion
means that the data is designed by the
factor condition
in the colData
variable. In
designed, I mean that the DESeq process will perform the differential
expression analysis based on the different conditions of the
experiments. The conditions are also set manually. You can add another
column in the colData
matrix called time
, and
set the design to design = ~condition + time
, DESeq will
restrain the effects of condition and focus on the time factor. For more
information, see this
blog on BioStars.
We can check the dimensions of
dds using dim(dds)
. We could remove the results when the
gene’s counts are less than 5 in total, by reassigning the value of dds
to dds[rowSums(counts(dds)) > 5,]
. Thus we got our DESeq
Dataset, named dds
, and we can happily perform differential
expression analysis on it!~
The actual analysis part is really easy to understand, and doesn’t require a lot of code of brain.
# perform the differential expression analysis
dds <- DESeq(dds)
res <- results(dds)
Just type these words and everything will be done in the background thanks to our lovely DESeq2 developers. The deeper truth behind all the analysis is much more complex, and I don’t think I fully understand it. I will leave a link here to the lovely Indian girl’s video, and I might make another whole chapter just talking about how it works. But for us who are just using it and only cares about the final output data, these two lines of code is enough.
After we obtained our data from DESeq, we need to know how to show everyone else in a clear and fasionable way, what is happening in our experiments. That’s where our beautiful plots come. A wide variaty of plots can be used to express our data, and it’s best if we know how to draw each one of them, and more importantly, the essence behind every plot.
The knowledge of the plotting part mainly comes from the Griffith lab website, which is here. There might be some websites helpful for individual plotting, I will list each of the websites in their own part.
MAplots are plots that can display the difference of measurements between two sets of data.
DESeq2 have it’s own MAplotting function. Which is simply:
plotMA(res, ylim = c(-3,3))
But this MA plot only have two colours, and is rather lame. With ggplot’s function geom_point, we could make a similar MA plot.
library(ggplot2)
deseq2ResDF <- as.data.frame(res) ## convert the result file into a data frame
deseq2ResDF$significant <- ifelse(deseq2ResDF$padj < .1, "Significant", NA) ## a new column for significance
ggplot(deseq2ResDF, aes(baseMean, log2FoldChange, colour = padj)) +
geom_point(size=0.5) + scale_y_continuous(limits=c(-3, 3)) +
scale_x_log10() + geom_hline(yintercept = 0, colour="darkorchid4", size=1, linetype="longdash") +
labs(x="mean of normalized counts", y="log fold change") +
theme_bw() + geom_density_2d(colour="black", size=0.25) +
scale_colour_gradient( high = "#f8ff2b", low = "#ff2b2b", space = "Lab", na.value = "grey50", guide = "colourbar", aesthetics = "colour")
By adding a series of
constraints, we can make the geom_point look just like our MAplot. But
first, we need to input our DESeq Data as a dataframe, and than add a
new column called significant to the end of the dataframe. Remember to
add scale_x_log10
or else it will display the original
counts, which are massive.