Skip to content

The Tools of RNA-seq

In this Article, you can learn about the tools in RNA-Seq. The articles come from these websites:

General (with huge thanks to Wen Yi):

FASTQC


FastQC

FastQC is for the quality control of the raw data from sequencing. It's best to check that the data are good before you do anything further.

You can use the user interface friendly FastQC java swing application. But when you are processing multiple files, you can use the terminal command fastqc in such a way that you cna process multiple files at the same time.

 fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN

Remember to switch your shell to bash when you execute this command. I don't actually know the alternative command for zsh, but there must be some kind of command.

MultiQC

However, sometimes we have a lot of test data, and we don't actually have the time to check the HTML output of the FastQC for each and every file. Thus, we can use MultiQC.

multiqc /raw_data_dir

Will auto generate a 4 in 1 html report.

Trimmomatic

Installing trimmomatic is a very tricky thing to do on a M1 chip mac. I followed this instruction to get it finally done.

http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf

fastp

Basically fastp is just a all in one quality check machine. It can perform quality check and trim the data at the same time.

fastp -i Spm_R1.fastq.gz -I Spm_R2.fastq.gz -o Spm_R1.fastp.fastq.gz -O Spm_R2.fastp.fastq.gz -w 8 -q 20 –u 20 -j Spm.fastp.json -h Spm.fastp.html

Here is the code for fastp.

  • -i, --in1 is the input file name of read1
  • -I, --in2 is the input file name of read2
  • -o, --out1 is the output file name of read1
  • -O, --out2 is the output file name of read2

There are some other commands regarding whether to discard one of the pair end read or not, which can bee found by typing fastp -help.

  • -w, --thread means the thread number, default thread # is 2
  • -q, --qualified_quality_phred the quality phred value that a base is qualified. Default is 15, meaning that >=Q15 is qualified.
  • -u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%.
  • -g force polyG tail trimmingmby default trimming is automatically enabled for Illumina data.

And their are the commands that controls the outputs.

  • -j, --json the json format report file name (string [=fastp.json])
  • -h, --html the html format report file name (string [=fastp.html])

To download files from the server, use the scp command which works like this:

scp username@sth.sth.sth.sth:/file/path/on/server.html /Local/path/to/store/file

Remember not to add any extra space between.

HISAT2

The HISAT2 official website can be found here. It is a fast and sensitive alignment program for mapping next-generation sequencing reads.

hisat2 -p 8 -x tair10_index --summary-file Spm.summary --dta-cufflinks -1 Spm_R1.fastp.cutadapt.fastq.gz -2 Spm_R2.fastp.cutadapt.fastq.gz -S output.sam

The basic usage of HISAT2 is:

hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r> | --sra-acc <SRA accession number>} [-S <sam>]

  <ht2-idx>  Index filename prefix (minus trailing .X.ht2).
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <SRA accession number>        Comma-separated list of SRA accession numbers, e.g. --sra-acc SRR353653,SRR353654.
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

The interesting commands for HISAT2 are the following:

  • -p, --threads (int) means the number of threads you launch. Default is 1.
  • --summary-file print the alignment summary to this file.
  • --dta-cufflinks reprots alignments tailored specifically for cufflinks.

samtools

Samtools is a tool to handle sam, bam and cram files.

samtools view -ShuF 4 -q 30 -f 2 -@ 2 file.name.sam | samtools sort -@ 2 -o output/file.bam 

samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

samtools viewis for viewing the bam/sam/cram file in bam/sam/cram type.

  • -S Ignored (the input format is auto-detected).
  • -h,--with-header Include header in the SAM output.
  • -u,--uncompressed Uncompressed BAM output (and default to --bam).
  • -F FLAG Have none of the FLAGs.
  • -q int Have mapping quality >= int.
  • -f FLAG Have all of the flags present.
  • -@ int The number of threads.
  • -b output bam file

samtools sort is sorting the data by your need.

- -u Set the compression level to 0 (uncompressed).

When you want to see your gene inside for insatance igv, you will also need to index your bam files.

featureCounts

FeatureCounts is a tool to quantify the counts on genes based on the bam file. A link to a desctiption of featureCounts is here.

featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 

After using featureCounts, you can use your command line tools:

cut -f1,7,...

To cut the columns you need.

bedtools

genomeCoverageBed -ibam Spm.sorted.bam -bg -split > Spm.bdg
  • -ibam input bam files for coverage. Use stdin or - for a Unix pipe.

cuffdiff

Cuffdiff is a program included in cufflinks. You can use this program to find significant changes in transcript expression. Use cuffdiff -h to look for the manual.

When cuffdiff give you all 0 counts, there might be something wierd with your sam file. The name of the chromosome might be named wrong. Instead of naming them 'chrN' they are just named with 'N'.

At Last ... DEseq2

I think DEseq2 might need a seperate article.

Others

To communicate with the server using scp command, see here for more info.