Somatic Variant Calling Pipeline from BAM for RNA-seq Pipeline Specification
Pipeline Details
- Name:
Somatic Variant Calling Pipeline from BAM for RNA-seq - Pipeline UUID:
f931g3zfpy4n03236h09w0j9temytr - Version:
1.1.1 - View Pipeline:
Overview
Somatic Variant Calling Pipeline from BAM for RNA-seq pipeline is designed for identifying somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample. It processes RNA-sequencing BAM files through comprehensive quality control, read group management, base quality score recalibration, and variant calling to ensure reliable and reproducible somatic variant detection results.
Key Use cases:
- Cancer Research: Identification of somatic mutations in tumor RNA-seq samples compared to matched normal controls.
- Comparative Genomics: Detection of variants between different sample conditions or treatment groups in RNA-seq data.
- Clinical Diagnostics: Discovery of actionable somatic variants in cancer patient samples for precision medicine applications.
Features
- RNA-seq Specific Processing: Specialized handling of RNA-seq data with SplitNCigarReads for proper junction processing.
- Comprehensive Quality Control: Implements duplicate marking, base quality score recalibration (BQSR), and read group management.
- Somatic Variant Detection: Utilizes GATK Mutect2 for accurate somatic SNV and Indel calling with tumor-normal comparison capability.
- Flexible Sample Comparison: Supports both tumor-only and tumor-normal paired analysis workflows.
- Variant Annotation: Integrates Ensembl VEP for comprehensive variant effect prediction and annotation.
- GATK Best Practices: Follows GATK recommended workflows for RNA-seq variant calling with proper reference preparation.
- Containerized Execution: All processes run in standardized Docker containers ensuring reproducibility across environments.
Input/Output Specification
Inputs
Required
The pipeline processes BAM files from RNA-seq data along with reference materials and sample metadata for somatic variant calling.
BAM Files
- Description: Aligned RNA-seq BAM files containing mapped reads from tumor and/or normal samples
- Format: .bam
- Example File Path: /path/to/input/sample_aligned.bam
Reference Genome
- Description: Reference genome sequence in FASTA format for variant calling
- Format: .fa/.fasta
- Example File Path: /path/to/reference/genome.fa
Known Variants
- Description: VCF files containing known SNPs and Indels for base quality score recalibration
- Format: .vcf/.vcf.gz
- Example File Path: /path/to/known_sites/dbsnp.vcf.gz
Optional Inputs
Sample Groups TSV
- Description: Tab-separated file defining sample groupings for comparison analysis
- Required Columns: Sample ID, Group assignment
- Format: Tab-separated values (.tsv)
- Example File Path: /path/to/metadata/groups.tsv
Comparisons TSV
- Description: Tab-separated file defining which sample groups to compare for variant calling
- Required Columns: Control group, Treatment group, Comparison name
- Format: Tab-separated values (.tsv)
- Example File Path: /path/to/metadata/comparisons.tsv
GTF Annotation
- Description: Gene annotation file for variant effect prediction
- Format: .gtf
- Example File Path: /path/to/annotation/genes.gtf
Outputs
Reported Outputs
- Somatic Variants VCF:
- Description: Called somatic variants in VCF format with quality scores and filters
- Format: .vcf.gz
- Example File Path: /output/somatic_variants/comparison_name.vcf.gz
- Visualization App: IGV, UCSC Genome Browser
-
Location: Variant Calls Folder
-
Annotated Variants VCF:
- Description: Somatic variants with functional annotations from Ensembl VEP
- Format: .vcf
- Example File Path: /output/annotated/sample_annotated.vcf
- Visualization App: VEP Web Interface, IGV
- Location: Annotations Folder
Supporting Outputs
- Recalibrated BAM Files:
- Description: Quality score recalibrated BAM files ready for variant calling
- Format: .bam + .bai
-
Example File Path: /intermediate/recalibrated/sample_recal.bam
-
Duplicate Metrics:
- Description: Statistics on duplicate read identification and removal
- Format: .txt
-
Example File Path: /intermediate/metrics/sample_dedup_metrics.txt
-
Base Recalibration Tables:
- Description: Base quality score recalibration data tables
- Format: .txt
- Example File Path: /intermediate/bqsr/sample_recal_data.txt
Associated Processes
- AddOrReplaceReadGroups
- Base Recalibration
- BQSR
- build gatk4 genome dictionary
- markDuplicates
- mutect2
- prepare comparisons
- splitNCigarReads
- vep
References & Additional Documentation
- Related Papers/links:
- GATK Best Practices for RNA-seq variant calling: https://gatk.broadinstitute.org/hc/en-us/articles/360035531192
- Mutect2 Publication: https://www.nature.com/articles/nbt.2514
- Ensembl VEP Documentation: https://useast.ensembl.org/info/docs/tools/vep/index.html
- Pipeline Repository: Contact ViaFoundry for access to pipeline source code
- Workflow Diagram: Available in the pipeline description page on ViaFoundry platform