Files involved in RNA-seq

Here, I learnt about all the different files that may come up during a NGS analysis. All the knowledge comes from:

General:

https://learn.gencore.bio.nyu.edu A very useful website teaching about NGS analysis.

FASTA & FASTQ:

SAM/BAM/CRAM:

https://samtools.github.io/hts-specs/SAMv1.pdf

GFF/GTF:

Unix

http://korflab.ucdavis.edu/Unix_and_Perl/current.html

FASTQ and FASTA

The first section is about FASTQ and FASTA files.

FASTQ

FASTQ files contains four lines including:

The sequence Identifier with information about the sequencing run and the cluster.
The sequence read.
A separator, which is simply a plus (+) sign, but sometimes might also be a (-) sign.
The quality scores. Using ASCII characters to represent the numerical quality of the scores.

sample fastq file

Quality Scores

Quality scores are a way to assign confidence to a particular base within a read. The code is related to the ASCII table and each letter is related to a different Q score.

The quality score Q is related to the probability of incorrect corresponding base call.

Quality value Q

By matching the ASCII table characters to their Q code under certain protocals, we can easily (?) know the quality of the specific gene.

ASCII to Q to P

FASTA

FASTA format is a basic format for repoting a sequence whether it's a nucleic acid or amino acid. Made up by two parts:

A sequence header which starts with a '>' and contaisn the sequence description.

The sequence itself (containing no space within).

>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP

SAM/BAM/CRAM Format

SAM

The Sequence Alignment/Map (SAM) format is the most basic and most human readable format of the three. It consists of a header, and a row for every read in your dataset, and 11 tab-delimited fields describing that read.

SAM Header

The SAM header varies in size, but header lines start with '@', while alignment lines do not. You can customize the header when you generate the SAM file, add the information you decided to add. The full list of header fields are found below. '*' means that this tag is required; e.g., every @SQ header line must contain SN and LN fields.

SAM header fields

Fields Descriptions

Every row contains 11 mandatory fields, which are:

Fields Descriptions

Bitwise Flag is a lookup code to explain certain features about the particular read (exact same concept as Linux permission codes!). It tells you whether the read aligned, is marked a PCR duplicate, if it’s mate aligned, etc. and any combination of the available tags, seen below:

Bitwise Flag

Sometimes it might be a bit hard to explain the integers of the Bitwise Flag, you can go on Picard to get a quick analysis.

Map Q (Mapping Quality) reveals how well the read are aligned to the reference. Different algorithm might lead to different scores, but generally, the greater the number, the better the alignment.

CIGAR String is a special string that can tell you the alignment information of the whole sequence.

CIGAR

CIGAR example

BAM

BAM format has literally the same content as the SAM file, except it's in Binary format. Thus it's not legible to human but is smaller and faster to read for a computer.

Tools like Samtools, Picard Tools and IGV are required to make sence of BAM files.

CRAM

This is a relatively new format that is very similar to BAM as it also retains the same information as SAM and is compressed, but it is much smarter in the way that it stores the information. It’s very interesting and up and coming but is a bit beyond my level and not so relative to RNA-seq. To learn more about it, you can read this.

VCF Format

Variant calling format is a tab-delimited text file (different from Virtual Card Format), it is used to describe single nucleotide variants (SNVs) as well as insertions, deletions and other sequence variations. It is limited to only show the variations not the genetic features.

Chromosome Name
Chromosome Position
ID

This is generally used to reference an annotated variant in dbSNP or other curate variant database.
Reference base(s)
Alternate base(s)
Variant quality

Phred-scaled quality for the alternate base. Usually, a >20 score is acceptable.
Filter

Whether or not this has passed all filters – generally a QC measure in variant calling algorithms
Additional Information

This is for additional information, generally describing the nature of the position/variants with respect to other data.

For example vcf files, look at your own files during the class BIOS201, Genome, why are we different.

GFF and GTF

Gene transfer format (GTF) and General feaure format (GFF) are kinds of file format used to hold information about gene structure.

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

BED

The formal bed format description can be found here.

In shorter words, BED files allows an easy way to define the things you want to show in your track annotation (like a reference genome).

Bed Graph

The detailed discription for bed graph is here, I think this file can be used to store the reads for each small sets of nucleotide on your chromosome after your sequencing.

BigWig

The BigWig files indicates the reads on a certain position. You can view it with igv, but also other applications and can turn it into a very beautiful graph. Requires further inquiry. The official BigWig file explaination can be found here.