Chapter 12 Steps to make your own custom reference files for RIMA pipeline

12.0.1 Reference fasta

The human GDC hg38 fasta file is downloaded from GDC website.

12.0.2 Gene annotation file (gtf)

The human gtf annotation file is downloaded from GENCODE website. The current annotation file we used is V27.

12.0.3 build STAR index

conda activate rna

## STAR Version: STAR_2.6.1d
STAR --runThreadN 16 --runMode genomeGenerate --genomeDir ./ref_files/v27_index --genomeFastaFiles GRCh38.d1.vd1.CIDC.fa --sjdbGTFfile gencode.v27.annotation.gtf
...
00:04:54 ..... started STAR run
00:04:54 ... starting to generate Genome files
00:05:57 ... starting to sort Suffix Array. This may take a long time...
00:06:11 ... sorting Suffix Array chunks and saving them to disk...
00:17:43 ... loading chunks from disk, packing SA...
00:19:31 ... finished generating suffix array
00:19:31 ... generating Suffix Array index
00:23:20 ... completed Suffix Array index
00:23:20 ..... processing annotations GTF
00:23:35 ..... inserting junctions into the genome indices
00:26:49 ... writing Genome to disk ...
00:27:06 ... writing Suffix Array to disk ...
00:28:53 ... writing SAindex to disk
00:29:05 ..... finished successfully

12.0.4 RSeQC reference files

We download the human annotation bed file including the whole genome bed file, and house keeping bed file from RSeQC page from sourcforge website.

./ref_files/refseqGenes.bed
./ref_files/housekeeping_refseqGenes.bed

12.0.5 build salmon index

conda activate rna

## salmon Version: salmon 1.1.0
salmon index -t GRCh38.d1.vd1.CIDC.fa -i salmon_index

...
index ["salmon_index"] did not previously exist  . . . creating it
[jLog] [info] building index 
[jointLog] [info] [Step 1 of 4] : counting k-mers
[jointLog] [info] Replaced 164,553,847 non-ATCG nucleotides
[jointLog] [info] Clipped poly-A tails from 0 transcripts
[jointLog] [info] Building rank-select dictionary and saving to disk
[jointLog] [info] done
Elapsed time: 0.191866s
[jointLog] [info] Writing sequence data to file . . .
[jointLog] [info] done
Elapsed time: 1.91244s
[jointLog] [info] Building 64-bit suffix array (length of generalized text is 3,088,286,426)
[jointLog] [info] Building suffix array . . .
success
saving to disk . . . done
Elapsed time: 18.3072s
done
Elapsed time: 703.843s

12.0.6 GMT file for gene set analysis

The GMT file is downloaded from BROAD release page. The current GMT file we used is “c2.cp.kegg.v6.1.symbols.gmt”

12.0.7 STAR-Fusion genome resource lib

The genome resource lib is downloaded from BROAD release page. The current lib we used is GRCh38_v22_CTAT_lib.

You can also prep it for use with STAR-fusion. More details, read:

12.0.8 Centrifuge index

The human Centrifuge index is downloaded from Centrifuge website. The current index we used is p_compressed+h+v that includes human genome, prokaryotic genomes, and viral genomes.

You can also build your own custom Centrifuge index. More details, read:

12.0.9 TRUST4 reference files

TRUST4 reference files includes 1. TCR, BCR genomic sequence fasta file; 2. Reference database sequence containing annotation information.

hg38_bcrtcr.fa
human_IMGT+C.fa

These reference files can directlt be downloaded from TRUST4 github.