Somatic Variant Calling (vs Control) Pipeline Specification
Pipeline Details
- Name:
Somatic Variant Calling (vs Control) Pipeline - Pipeline UUID:
ncbu6ks6l004k45j6qjwdwjn8vp1hf - Version:
1.0.0 - View Pipeline:
Overview
Somatic Variant Calling (vs Control) Pipeline is designed for identifying somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample. The pipeline automates the complete workflow from raw sequencing data preprocessing to variant annotation, ensuring reliable and reproducible somatic variant detection results.
Key Use cases:
- Cancer Genomics: Identification of somatic mutations in tumor samples compared to matched normal controls for oncology research.
- Tumor-Normal Comparison: Detection of acquired mutations in tumor tissue by comparing against normal tissue from the same individual.
- Multi-Sample Analysis: Processing multiple tumor samples from a single patient to identify common and unique somatic variants.
Features
- GATK4-Based Variant Calling: Utilizes GATK4 Mutect2 for robust somatic variant detection with industry-standard algorithms.
- BWA MEM Alignment: High-performance read alignment using BWA MEM with proper read group handling for GATK compatibility.
- Base Quality Score Recalibration (BQSR): Implements GATK BaseRecalibrator and ApplyBQSR for improved variant calling accuracy.
- Duplicate Removal: Automated duplicate read marking using GATK MarkDuplicates to reduce PCR and optical duplicates.
- Variant Annotation: Integration with Ensembl VEP for comprehensive variant effect prediction and annotation.
- Flexible Input Handling: Supports both tumor-only and tumor-normal paired analysis workflows.
- Quality Control Integration: Built-in QC steps including base recalibration and duplicate metrics generation.
- Containerized Execution: All processes run in Docker containers ensuring reproducibility and consistent environments.
Input/Output Specification
Inputs
Required
Sequencing Reads
- Description: Raw sequencing reads in FASTQ format from tumor and/or normal samples
- Format: .fastq or .fastq.gz
- Example File Path: /path/to/input/sample_R1.fastq.gz
Reference Genome
- Description: Reference genome in FASTA format for read alignment and variant calling
- Format: .fa or .fasta
- Example File Path: /path/to/reference/genome.fa
Known Variants Database
- Description: VCF files containing known SNPs and indels for base quality score recalibration
- Format: .vcf or .vcf.gz
- Example File Path: /path/to/known_sites/dbsnp.vcf.gz
GTF Annotation File
- Description: Gene annotation file in GTF format for variant effect prediction
- Format: .gtf
- Example File Path: /path/to/annotation/genes.gtf
Optional Inputs
Sample Groups TSV
- Description: Tab-separated file defining sample groupings for comparison analysis
- Required Columns: Sample ID, Group (control/treatment)
- Format: Tab-separated values (.tsv)
- Example:
Sample_ID Group Normal_01 control Tumor_01 treatment
Comparison Design TSV
- Description: File specifying which comparisons to perform between sample groups
- Format: Tab-separated values (.tsv)
- Example File Path: /path/to/comparisons.tsv
Outputs
Reported Outputs
- Somatic Variants VCF:
- Description: Called somatic variants in Variant Call Format with quality scores and filters
- Format: .vcf.gz
- Example File Path: /output/variants/somatic_variants.vcf.gz
- Visualization App: IGV, UCSC Genome Browser
-
Location: variants/
-
Annotated Variants VCF:
- Description: Somatic variants annotated with functional effects using Ensembl VEP
- Format: .vcf
- Example File Path: /output/annotated/annotated_variants.vcf
- Visualization App: VEP Web Interface, IGV
- Location: annotated/
Supporting Outputs
- Aligned BAM Files:
- Description: Quality-processed and recalibrated BAM files for each sample
- Format: .bam with .bai index
-
Example File Path: /output/alignments/sample_recal.bam
-
Duplicate Metrics:
- Description: Detailed statistics on duplicate reads identified and marked
- Format: .txt
-
Example File Path: /output/qc/sample_dedup_metrics.txt
-
Base Recalibration Tables:
- Description: GATK BaseRecalibrator output tables used for quality score recalibration
- Format: .txt
- Example File Path: /output/recal/sample_recal_data.txt
Associated Processes
- Base Recalibration
- BQSR
- build gatk4 genome dictionary
- bwa align
- Check Build BWA
- check BWA files
- markDuplicates
- mutect2
- prepare comparisons
- vep
References & Additional Documentation
- Related Papers:
- Van der Auwera GA, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1-33.
- Benjamin D, et al. Calling Somatic SNVs and Indels with Mutect2. bioRxiv. 2019.
- GATK Documentation: https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows
- BWA Documentation: http://bio-bwa.sourceforge.net/bwa.shtml
- Ensembl VEP Documentation: https://useast.ensembl.org/info/docs/tools/vep/index.html