Chapter 2 How to run RIMA

2.1 Install RIMA and Set Up the Running Environment

To run RIMA, you will create a working directory. In the working directory, you will install the pipeline code and other required software.

Running machine requirements
  • Cores and memory – We usually run RIMA on a machine with at least 64 GB of memory to ensure a smooth read alignment run using STAR.
  • Storage – You will need to reserve enough storage for the following:
    1. Reference files and pipeline code – 65 GB
    2. Raw data – we recommend reserving five times the size of your data files. For example, if your fastq files are 50GB each, then you should allot 250GB for each fastq file. Zipped fastq files are recommended.
    3. Required software – reserve ~20GB for installation of miniconda, mamba and snakemake.

Prepare a working directory

mkdir RIMA

Get the pipeline code

Download the RIMA_pipeline folder to your working directory.

git clone  https://github.com/kateyliu/RIMA_pipeline.git

Install miniconda

Download and configure your conda environment. Follow the prompts on the installer screens.
#Note: You will need to install miniconda3 into the RIMA directory. This means you need to override the default directory when miniconda3 is installed. #If you have conda installed elsewhere, you will still need a local copy in your RIMA directory in order for the environment installation shell program to work. (see below)

#go to your RIMA directory if you are not there already
cd RIMA

#obtain the Miniconda3 code and install in the RIMA directory

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

bash Miniconda3-latest-Linux-x86_64.sh  
# !Don't forget to override the default directory for miniconda3!
# Install miniconda3 into your RIMA directory.

conda list  #to see whether you have successfully installed conda

Set up the running environment

The RIMA environment can be set up by running a shell command. In the command, you need to specify the platform you are using – Amazon Web Services (AWS) or Google Cloud Platform (GCP). This process may take ten to fifteen minutes. The user will need to answer two questions during the process – one at the beginning of the program and one near the end – to give permissions to install miniconda and snakemake.

#NOTE: before running the environment shell command, make sure to set the path for CONDA_ROOT

export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH

Run the shell program:

#shell command for 
bash ./RIMA_environment.sh -p {platform --AWS, GCP}

#example for AWS
bash ./RIMA_environment.sh -p AWS

Activate the RIMA environment

Based on the conda environments information, the main environment “RIMA” can be activated as below.

#set the path if you this is a new login to your shell
export CONDA_ROOT=/{path_to_your_RIMA_directory}/RIMA/miniconda3
export PATH=/{path_to_your_RIMA_directory}/RIMA/miniconda3/bin:$PATH

# activate the RIMA conda environment
source activate RIMA

2.2 Download pre-built references

A pre-prepared RIMA reference folder can be downloaded using the code below.

If you want to prepare a customized reference, you can follow this tutorial to build your own reference.

Currently, RIMA supports the hg38 reference downloaded from Genomic Data Commons .

cd RIMA
wget http://cistrome.org/~lyang/ref.tar.gz  

# unzip the reference  
tar -zxvf ref.tar.gz

# remove the reference zip file to save some space (optional)
rm ref.tar.gz 

Optional step

If you download your RIMA reference folder into a directory other than RIMA, you have to change the symbolic link for the ref_files under the rnaseq_pipeline folder: For example: if you downloaded the RIMA reference folder into /mnt/tutorial/ref_files, use the ln command to change the symbolic link. Please refer to here if you are building your own reference.

cd RIMA

# remove the current link of ref_files
rm ref_files

# create a new symoblic link to your reference folder
ln -s /mnt/tutorial/ref_files

2.3 Prepare input files

Metasheet.csv

Metasheet.csv is a comma-delimited file that resides in the RIMA_pipeline folder. The Metasheet.csv should contain phenotypic information about your samples that can be used for downstream analysis. Ensure your metasheet contains Two Required Columns (SampleName, Group) in comma-delimited format. RIMA uses demo data from Immune and genomic correlates of response to anti-PD-1 immunotherapy in glioblastoma . We selected 12 samples from this trial and included 6 treatment naive samples (3 responder and 3 non-responders) for cohort level comparison.

  • SampleName: Sequencing sample id.
  • PatName: Patient id
  • Group: Sample included for cohort level comparison. ‘NA’ indicates samples that should not be included in the comparison. In the example below, we compare responders(R) and non-responders(NR) using only treatment naive (pre) patients.

You can add columns to the metasheet in order to compare other phenotypes e.g. columns for Timing Age, Sex etc.

SampleName,PatName,Group,Age,Tissue,Gender,Timing,Responder,TumorLocation,OngoingTreatment,PFS,Survival,OS,SampleId,syn_batch
SRR8281218,P20,NR,63,Brain,male,Pre,NR,temporal,no,110,1,278,4790-NL-AS,1
SRR8281219,P20,NA,63,Brain,male,Post,NR,temporal,no,110,1,278,4975-NL-AS,1
SRR8281226,P53,NR,70,Brain,male,Pre,NR,frontal,Yes,83,1,337,3981-NL-AS,2
SRR8281236,P53,NA,70,Brain,male,Post,NR,frontal,Yes,83,1,337,4760-D1,2
SRR8281233,P56,NR,54,Brain,female,Pre,NR,Parietal,no,83,1,83,4341-D3,3
SRR8281230,P56,NA,54,Brain,female,Post,NR,Parietal,no,83,1,83,4680-A1,3
SRR8281244,P100,NA,31,Brain,male,Post,R,Frontal-crossesmidline,yes,151,0,615,4956-NL-AS,2
SRR8281245,P100,R,31,Brain,male,Pre,R,Frontal-crossesmidline,yes,151,0,615,4595-G1,2
SRR8281243,P101,R,32,Brain,male,Pre,R,frontal,no,0,1,414,3542-NL-AS,2
SRR8281238,P102,NA,64,Brain,female,Post,R,Temporal-anterior,yes,519,0,539,5094-NL-AS,1
SRR8281251,P101,NA,32,Brain,male,Post,R,frontal,no,0,1,414,4943-A1,2
SRR8281250,P102,R,64,Brain,female,Pre,R,Temporal-anterior,yes,519,0,539,4443-NL-AS,1

Config.yaml

In the RIMA_pipeline folder, we have prepared a config.yaml as a template to run the pipeline. The config file is divided into three sections:
  • Fixed and user-defined parameters.
  • Cohort level analysis parameters.
  • Samples list.

You should provide these parameters with column names that match the columns in metasheet.csv.

Below is an example of a config.yaml file. You can set the patient name as the sample name. Note: Please make sure that sample names match the metasheet. Currently, only fastq files are accepted as input (including fastq.gz).

#########Fixed and user-defined parameters################
metasheet: metasheet.csv  # Meta info 
ref: ref.yaml             # Reference config 
assembly: hg38
cancer_type: GBM          #TCGA cancer type abbreviations
rseqc_ref: house_keeping  #Option: 'house_keeping' or 'false'. 
                          #By default, a subset of housekeeping genes is used by RSeQC to assess alignment quality.  
                          #This reduces the amount of time needed to run RSeQC.  
mate: [1,2]               #paired-end fastq format, we recommend naming paired-end reads with _1.fq.gz and _2.fq.gz


#########Cohort level analysis parameters################
design: Group             # Condition on which to do comparsion (as set up in metasheet.csv)
Treatment: R              # Treatment use in DESeq2, corresponding to positive log fold change
Control: NR               # Control use in DESeq2, corresponding to negative log fold change
batch: syb_batch          # Options: 'false' or a column name from the metasheet.csv.  
                          # If set to a column name in the metasheet.csv, the column name will be used for batch effect analysis (limma).  
                          # It will also be used as a covariate for differential analysis (DESeq2) to account for batch effect.  

pre_treated: false        # Option: true or false. 
                          # If set to false, patients are treatment naive.  
                          # If set to true, patients have received some form of therapy prior to the current study.


#########Samples list##############
samples:
  SRR8281218:
    - /mnt/zhao_trial/data/SRR8281218_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281218_2.fastq.gz
  SRR8281219:
    - /mnt/zhao_trial/data/SRR8281219_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281219_2.fastq.gz
  SRR8281226:
    - /mnt/zhao_trial/data/SRR8281226_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281226_2.fastq.gz
  SRR8281236:
    - /mnt/zhao_trial/data/SRR8281236_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281236_2.fastq.gz
  SRR8281230:
    - /mnt/zhao_trial/data/SRR8281230_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281230_2.fastq.gz
  SRR8281233:
    - /mnt/zhao_trial/data/SRR8281233_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281233_2.fastq.gz
  SRR8281244:
    - /mnt/zhao_trial/data/SRR8281244_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281244_2.fastq.gz
  SRR8281245:
    - /mnt/zhao_trial/data/SRR8281245_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281245_2.fastq.gz
  SRR8281243:
    - /mnt/zhao_trial/data/SRR8281243_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281243_2.fastq.gz
  SRR8281251:
    - /mnt/zhao_trial/data/SRR8281251_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281251_2.fastq.gz
  SRR8281238:
    - /mnt/zhao_trial/data/SRR8281238_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281238_2.fastq.gz
  SRR8281250:
    - /mnt/zhao_trial/data/SRR8281250_1.fastq.gz
    - /mnt/zhao_trial/data/SRR8281250_2.fastq.gz

execution.yaml

Use execution.yaml to control which modules to run in RIMA. The preprocess module outputs are required for optional downstream analysis modules. Details of each module will be introduced in the next chapters.

##Note: The preprocess individual and cohort modules are necessary to obtain alignment and quality results. 
##Run the remaining modules only after these two modules.
preprocess_individual: true
preprocess_cohort: true

##Optional modules
##Note: The below modules are specialized modules, each dealing with specific targets. 
##Make sure to run both the individual and cohort parts of each module 
##to get all results.

##Individual runs
mutation_individual: false
immune_response_individual: false
immune_repertoire_individual: false
microbiome_individual: false
neoantigen_individual: false



##Cohort runs
differential_expression_cohort: false
immune_infiltration_cohort: false
mutation_cohort: false
immune_response_cohort: false
immune_repertoire_cohort: false
microbiome_cohort: false
neoantigen_cohort: false

ref.yaml The ref.yaml file provides the paths for all reference files. If you downloaded the pre-built reference files, you should not need to change the ref.yaml file.

hg38:
###annotation path
  fasta_path: ./ref_files/GRCh38.d1.vd1.fa 
  gtf_path: ./ref_files/gencode.v22.annotation.gtf
  housekeeping_bed_path: ./ref_files/housekeeping_refseqGenes.bed
  trust4_reper_path: ./ref_files/TRUST4/hg38_bcrtcr.fa
  trust4_IMGT_path: ./ref_files/TRUST4/human_IMGT+C.fa
  gmt_path: ./static/ssgsea/c2.cp.kegg.v6.1.symbols.gmt

###index path
  star_index: ./ref_files/star_gdc_index/hg38/v22
  star_fusion_index: ./ref_files/fusion_gdc_index/GRCh38_v22_CTAT_lib_GDC_Mar162019/ctat_genome_lib_build_dir
  salmon_index:  ./ref_files/salmon_index/gencode.v22.ts.fa
  msisensor_index: ./ref_files/msisensor_gdc_index/hg38/microsatellite.list
  centrifuge_index: ./ref_files/centrifuge_index/p_compressed+h+v

###tool path
  msisensor2_path: ./ref_files/msisensor2
  trust4_path: ./ref_files/TRUST4
  arcasHLA_path: ./ref_files/arcasHLA
  prada_path: ./ref_files/pyPRADA/pyPRADA_1.2

2.4 Running RIMA

Check the path of your conda env

conda env list 
# conda environments:
#
base                  *  /home/ubuntu/miniconda3
centrifuge_env           /home/ubuntu/miniconda3/envs/centrifuge_env
fusion_env               /home/ubuntu/miniconda3/envs/fusion_env
rnaseq                   /home/ubuntu/miniconda3/envs/rnaseq
rseqc_env                /home/ubuntu/miniconda3/envs/rseqc_env
stat_perl_r              /home/ubuntu/miniconda3/envs/stat_perl_r

Check the pipeline with a dry run to ensure correct script and data usage.

snakemake -s RIMA.snakefile -np 

Submit the job.

Alignment and some of the other modules of RIMA will take several hours to run. It is recommended that you run RIMA in the background using a command such as nohup as below.

nohup time snakemake -p -s RIMA.snakefile -j 4 > RIMA.out &

note: Argument -j sets the cores for parallel runs (e.g. ‘-j 4’ can run 4 jobs in parallel at the same time.). Argument -p prints the command in each rule. Note: Here, output log records the run information. A user may run one module at a time to obtain a record of each module’s output log.

2.5 Output

Output folders are generated in a folder called “analysis”. The output structure of each module will be introduced in the next chapters.