Skip to content

RNA-seq Pipeline Specification

Pipeline Details

  • Name: RNA-seq Pipeline
  • Pipeline UUID: 5ef44138e2c2418ebabbc8e2789671a2
  • Version: 2.8.2
  • View Pipeline:

Overview

RNA-seq Pipeline is designed for comprehensive RNA-sequencing data analysis, including quality control, rRNA filtering, genome alignment using HISAT2 and STAR, and estimating gene and isoform expression levels by RSEM, featureCounts and Salmon. Alternatively, Kallisto or Salmon can be used for quantifying abundances of transcripts based on pseudoalignments, without the need for alignment.

Key Use cases:

  • Differential Gene Expression Analysis: Comprehensive RNA-seq data processing with DESeq2 and Limma Voom for identifying differentially expressed genes between conditions.
  • Transcript Quantification: Accurate estimation of gene and isoform expression levels using multiple quantification methods including RSEM, Salmon, and Kallisto.
  • Quality Control and Preprocessing: Automated quality assessment, read trimming, adapter removal, and rRNA filtering for reliable downstream analysis.

Features

  • Multiple Alignment Options: Supports STAR and HISAT2 aligners for genome alignment, plus RSEM for transcriptome alignment.
  • Flexible Quantification Methods: Includes RSEM, featureCounts, Salmon, and Kallisto for expression quantification with both alignment-based and pseudoalignment approaches.
  • Comprehensive Quality Control: Implements FastQC, Picard, and RSeQC for thorough quality assessment and genome-wide BAM analysis.
  • Differential Expression Analysis: Built-in DE module supporting both DESeq2 and Limma Voom with customizable statistical parameters and batch correction.
  • rRNA and Contaminant Filtering: Uses Bowtie2/Bowtie/STAR to filter out common RNAs (rRNA, miRNA, tRNA, piRNA).
  • Visualization Support: Generates IGV and Genome Browser files (TDF and BigWig) for interactive data exploration.
  • UMI Support: Includes UMI extraction capabilities for single-cell and other UMI-based protocols.
  • GSEA Integration: Performs Gene Set Enrichment Analysis on differential expression results.
  • Scalable Processing: Can handle thousands of samples in parallel with containerized processes.

Input/Output Specification

Inputs

Required

Raw Sequencing Reads

  • Description: FASTQ files containing raw RNA-seq reads from sequencing platforms
  • Format: .fastq.gz (compressed FASTQ)
  • Example File Path: /path/to/input/sample_R1.fastq.gz, /path/to/input/sample_R2.fastq.gz

Mate Information

  • Description: Specifies whether reads are single-end or paired-end
  • Format: String parameter ("single" or "pair")

Reference Genome Index

  • Description: Pre-built genome indices for selected aligners (STAR, HISAT2)
  • Format: Directory containing index files

Optional Inputs

Metadata File (Groups File)

  • Description: Tab-separated file containing sample information for differential expression analysis
  • Required Columns: sample_name, group
  • Format: Tab-separated values (.tsv)
  • Example:
    sample_name group   batch
    control_1   ctrl    Day1
    control_2   ctrl    Day1
    treat_1 treat   Day1
    treat_2 treat   Day1
    

Comparison File

  • Description: Specifies which groups to compare in differential expression analysis
  • Required Columns: controls, treats, names
  • Format: Tab-separated values (.tsv)
  • Example:
    controls    treats  names
    ctrl    treat   treat_v_ctrl
    

Custom Reference Sequences

  • Description: Additional FASTA sequences to add to reference genome
  • Format: .fasta
  • Example File Path: /path/to/custom_sequences.fasta

Outputs

Reported Outputs

  • Gene Expression Matrix:
  • Description: Normalized gene expression counts suitable for downstream analysis
  • Format: .tsv
  • Example File Path: /output/gene_featureCounts.tsv
  • Visualization App: DE Browser, R/Bioconductor
  • Location: Results Folder

  • Differential Expression Results:

  • Description: Statistical results from DESeq2 or Limma Voom analysis with fold changes and p-values
  • Format: .tsv
  • Example File Path: /output/DE_reports/treat_v_ctrl_DESeq2.tsv
  • Visualization App: DE Browser, IGV
  • Location: DE_reports Folder

  • Quality Control Reports:

  • Description: Comprehensive QC metrics including FastQC, Picard, and MultiQC reports
  • Format: .html, .pdf
  • Example File Path: /output/multiqc_report.html
  • Visualization App: Web browser
  • Location: QC Folder

Supporting Outputs

  • Aligned BAM Files:
  • Description: Genome-aligned reads in BAM format for visualization and further analysis
  • Format: .bam
  • Example File Path: /intermediate/sample_aligned.bam
  • Visualization App: IGV, UCSC Genome Browser

  • Transcript Abundance Files:

  • Description: Transcript-level expression estimates from Salmon/Kallisto
  • Format: .tsv
  • Example File Path: /intermediate/salmon_quant/quant.sf

  • GSEA Results:

  • Description: Gene Set Enrichment Analysis results and visualizations
  • Format: .tsv, .html
  • Example File Path: /output/GSEA_reports/

Associated Processes

References & Additional Documentation