RNA-velocity Analysis for Bioinfo Project

This is my note & plan markdown for my bioinformatics project about RNA velocity anlaysis. The project is about reproducing some of the results, and I think this markdown is quite helpful for further analysis. So I put it here in my blog, in case I might need it in the future. All the other codes are stored in the GitHub repository for this project.

To Do List

[x] Perform velocyto analysis on the already single Bam files.

- [x] Find out what happened to the python error output.
- [ ] Check the mask GTF file to see what problems is it causing.
- [x] Inspect the GTF file to see if it might be the cause of all the problems.

[x] Perform velocyto analysis on the squashed and then seperated Bam files.
[ ] Try cellranger pipeline for other datasets.
[x] Prepare files required for velocyto analysis.
[x] Turn the squashed Bam into multiple Bam files.
[x] Figure out a way to run velocyto somewhere.

Starting from GEO files

Environment Setup

# A new conda environment on taiyi
conda create -n pre python=3.8

# install required packages
conda install numpy scipy cython numba matplotlib scikit-learn h5py click
pip install velocyto

Chromaffin Cells Analysis

In the directory /scratch/2024-05-04/bio-zhanglm/bioinfo/chromaffin.

Acquiring Data from GEO

#!/bin/sh                          
#BSUB -J N_F                        ##job name
#BSUB -q ser                     ##queue name
#BSUB -n 10                        ##number of total cores     
#BSUB -R "span[ptile=40]"          ##40 cores per node
#BSUB -W 12:00                     ##walltime in hh:mm 
#BSUB -R "select[hname!='r13n18']" ##exclusive r13n18
#BSUB -e log/err.log                   ##error log
#BSUB -o log/H.log                     ##output log

cd /scratch/2024-05-04/bio-zhanglm/bioinfo/chromaffin

geofetch -i GSE99933

mkdir fastq 
ls | while read line
do
    echo $line
    fastq-dump --outdir fastq $line
done

Making mm10 index for STAR alignment

The mm10 reference directory is /data/bio-zhanglm/genome/mm10. STAR was setup in the (base) environment.

cd /data/bio-zhanglm/genome/mm10

wget https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.toplevel.fa.gz
wget https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.gtf.gz

# 'dna_sm' - soft-masked genomic DNA. 
# All repeats and low complexity regions have been replaced 
# with lowercased versions of their nucleic base

gunzip Mus_musculus.GRCm39.111.gtf.gz
gunzip Mus_musculus.GRCm39.dna_sm.toplevel.fa.gz

STAR \
        --runThreadN 25 \
        --runMode genomeGenerate \
        --genomeDir /data/bio-zhanglm/genome/mm10 \
        --genomeFastaFiles Mus_musculus.GRCm39.dna_sm.toplevel.fa \
        --sjdbGTFfile Mus_musculus.GRCm39.111.gtf

Out put:

STAR --runThreadN 25 --runMode genomeGenerate --genomeDir /data/bio-zhanglm/genome/mm10 --genomeFastaFiles Mus_musculus.GRCm39.dna_sm.toplevel.fa --sjdbGTFfile Mus_musculus.GRCm39.111.gtf
        STAR version: 2.7.10b   compiled: 2022-11-01T09:53:26-04:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
May 08 15:43:29 ..... started STAR run
May 08 15:43:29 ... starting to generate Genome files
May 08 15:44:14 ..... processing annotations GTF
May 08 15:44:35 ... starting to sort Suffix Array. This may take a long time...
May 08 15:44:47 ... sorting Suffix Array chunks and saving them to disk...
May 08 15:50:58 ... loading chunks from disk, packing SA...
May 08 15:52:04 ... finished generating suffix array
May 08 15:52:04 ... generating Suffix Array index
May 08 15:56:09 ... completed Suffix Array index
May 08 15:56:09 ..... inserting junctions into the genome indices
May 08 15:59:04 ... writing Genome to disk ...
May 08 15:59:04 ... writing Suffix Array to disk ...
May 08 15:59:21 ... writing SAindex to disk
May 08 15:59:24 ..... finished successfully

STAR alignment

Aligning Together and Splitting

First we construct a tab delimited file matching the sequence to the cell barcode (in our case GEO session number).

ls fastq/*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.' | while read line 
do 
        # STAR requires 3 columns, second should be "-" in single-end sequencing.
        echo -e "fastq/${line}.fastq\t-\t${line}" 
done > manifest.tsv

Then we align the sequences to the genome using the --readFilesManifest option in STAR. Also we specified the sequencing method with --soloType SmartSeq.

# Submitted as ll_star_02.txt
STAR --runThreadN 25 \
        --genomeDir /data/bio-zhanglm/genome/mm10 \
        --readFilesManifest manifest.tsv \
        --outFileNamePrefix ./mapping/split/chromaffin. \
        --outSAMtype BAM SortedByCoordinate \
        --quantMode GeneCounts \
        --soloType SmartSeq \
        --soloUMIdedup Exact NoDedup \
        --outSAMattributes NH HI AS nM RG

STARsolo's github page said that:

Cell-id can be any string without spaces. Cell-id will be added as ReadGroup tag (RG:Z:) for each read in the SAM/BAM output. If Cell-id starts with ID:, it can contain several fields separated by tab, and all the fields will be copied verbatim into SAM @RG header line.

Thus we might try to split the BAM file into multiple BAM files based on the cell barcode with samtools split, which by default separate with the @RG header.

# at /scratch/2024-05-11/bio-zhanglm/bioinfo/chromaffin/mapping/split/split_out
samtools split -f %#_%*.%. ../chromaffin.Aligned.sortedByCoord.out.bam

In order for the Bam file output by STAR to have the @RG tag, we need to add one line of command when doing the alignmnet. The last line --outSAMattributes NH HI AS nM RG is required for the each read in the Bam file to contain the @RG tag.

It worked and the overall time for the jobs are: - Alignment: 30 minutes. - Splitting: 30 minutes.

The output files are stored in /scratch/2024-05-11/bio-zhanglm/bioinfo/chromaffin/mapping/split/split_out.

Aligning One-by-one

This probably will work just fine, but it will take a long time. With this method, you can't use the soloType command since we are not using starsolo, but only the regular STAR.

# Submitted as ll_star_01.txt
ls fastq/*.fastq | cut -f 2 -d '/' | cut -f 1 -d '.' | while read line 
do 
        STAR --runThreadN 25 \ 
                --genomeDir /data/bio-zhanglm/genome/mm10 \
                --readFilesIn fastq/${line}.fastq \
                --outFileNamePrefix mapping/${line} \ 
                --outSAMtype BAM SortedByCoordinate \
                --quantMode GeneCounts \
done

The job process took around 6 hours.

Velocyto Analysis

We first performed the velocyto analysis on the one-by-one aligned BAM files.

Preperation

Before we start to do the analysis, velocyto analysis for the SmartSeq2 data requires two additional files. 1. The gene annotation file in GTF format Mus_musculus.GRCm39.111.gtf. 2. The repeat masker file mm10_rmsk.gtf, . All could be found in /data/bio-zhanglm/genome/mm10/ directory.

Analysis

One-by-one output BAMs

mask=/data/bio-zhanglm/genome/mm10/mm10_rmsk.gtf
annotation=/data/bio-zhanglm/genome/mm10/Mus_musculus.GRCm39.111.gtf

velocyto run-smartseq2 -o velocyto -m $mask -e chromaffin mapping/*.bam $annotation

Which sadly resulted in the error:

Traceback (most recent call last):
  File "/work/bio-zhanglm/conda/envs/pre/bin/velocyto", line 8, in <module>
    sys.exit(cli())
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/velocyto/commands/run_smartseq2.py", line 70, in run_smartseq2
    return _run(bamfile=bamfiles, gtffile=gtffile, bcfile=None, outputfolder=outputfolder,
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/velocyto/commands/_run.py", line 196, in _run
    mask_ivls_by_chromstrand = exincounter.read_repeats(repmask)
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/velocyto/counter.py", line 347, in read_repeats
    gtf_lines = sorted(gtf_lines, key=sorting_key)
  File "/work/bio-zhanglm/conda/envs/pre/lib/python3.8/site-packages/velocyto/counter.py", line 345, in sorting_key
    return (x[0], x[6] == "+", int(x[3]), entry)  # The last element of the touple corresponds to the `last resort comparison`
IndexError: list index out of range

Squashed BAMs

cd /scratch/2024-05-11/bio-zhanglm/bioinfo/chromaffin

mask=/data/bio-zhanglm/genome/mm10/mm10_rmsk.gtf
annotation=/data/bio-zhanglm/genome/mm10/Mus_musculus.GRCm39.111.gtf
bam=/scratch/2024-05-11/bio-zhanglm/bioinfo/chromaffin/mapping/split/split_out


velocyto run-smartseq2 -o velocyto -m $mask -e chromaffin ${bam}/*.bam $annotation

Which outputs the same error as above.

Hippocampus

Since the SmartSeq2 chromaffin data didn't work, we wanted to try the 10x Genomics data of hippocampus to see if we can reproduce their results.

SmartSeq2 might not be that popular, so there aren't many solutions in the GitHub issue page, but I found a lot of people asking and answering 10x related questions.

In the directory /scratch/2024-05-11/bio-zhanglm/bioinfo/hippocampus.

However, this dataset seems to be too large, I've been downloading it via the command geofetch -i GSE104323 for over 3 days now, and it seems to be still downloading. I give up.

Starting from Aligned Bam files

We found some pre-aligned Bam files by P.Kharchenko's lab deposited at http://pklab.med.harvard.edu/velocyto/notebooks/R/chromaffin2.nb.html which is great news. We can now compare the difference between our aligned Bam files and theirs.

Downloading new Bam files

We just download and extract it with the following commands:

wget http://pklab.med.harvard.edu/velocyto/chromaffin/bams.tar
tar xvf bams.tar

Resulting in the following files:

e12.5.bams/
|-- A1
|   `-- A1_unique.bam
|-- A10
|   `-- A10_unique.bam
|-- A11
|   `-- A11_unique.bam
|-- A12
|   `-- A12_unique.bam
|-- A13
|   `-- A13_unique.bam
.
.
.

Analysis of Velocyto

Obtaining the `.loom` Files

We then try to run velocyto on this public dataset to see if it works. This time, we followed the instructions from Kharchenko's Lab and removed the mask field -m from the command.

velocyto run-smartseq2 -o velocyto ${bam}/*/*.bam $annotation

Suprisingly, it worked! The output files are stored as /scratch/2024-05-11/bio-zhanglm/bioinfo/chromaffin/public_bam/velocyto/onefilepercell_A10_unique_and_others_1N5XB.loom.

Thus, we removed the -m field from the command for our own Bam files, and it worked as well. Seems that there is a problem with the mask data, which is reasonable since the Bam files have similar structure. The codes are shown beneath.

cd /scratch/2024-05-11/bio-zhanglm/bioinfo/chromaffin

mask=/data/bio-zhanglm/genome/mm10/mm10_rmsk.gtf # something wrong
annotation=/data/bio-zhanglm/genome/mm10/Mus_musculus.GRCm39.111.gtf
bam=/scratch/2024-05-11/bio-zhanglm/bioinfo/chromaffin/mapping/split/split_out

velocyto run-smartseq2 -o velocyto -e chromaffin ${bam}/*.bam $annotation

The output from our own Bam files are stored as .../bioinfo/chromaffin/velocyto/chromaffin_split.loom.

The .loom files are binary, so we don't know what's the structure like. But we know that it contains information linking UMI to splicing information. This can be accessed with the velocyto package.

Analysis with R and Python

R

First, we need to install the required packages from the github velocyto.R Repository by using the following command:

devtools::install_github("velocyto-team/velocyto.R")

R output the following error:

ERROR: dependency ‘pcaMethods’ is not available for package ‘velocyto.R’
* removing ‘/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/velocyto.R’
Warning message:
In i.p(...) :
  installation of package ‘/var/folders/ck/9dvycsw50dz1r928cgvks0tr0000gn/T//RtmpqeuAwC/file22d973751c84/velocyto.R_0.6.tar.gz’ had non-zero exit status

Huge Problem

velocyto don't support arm64 architecture, and sadly I'm using a m1 pro chip. This will not work at all.

Analysis with scVelo

When researching the follow-up works of velocyto, I found a new analysis pipeline named scVelo. Which is the counterpart of velocyto in scanpy. It assumed different splicing rates for different genes, so the model is slightly different from the original velocyto model. The analysis pipeline and results can be found in the github repository for this project.