Skip to content

Somatic Variant Calling (vs Control) Pipeline Specification

Pipeline Details

  • Name: Somatic Variant Calling (vs Control) Pipeline
  • Pipeline UUID: ncbu6ks6l004k45j6qjwdwjn8vp1hf
  • Version: 1.0.0
  • View Pipeline:

Overview

Somatic Variant Calling (vs Control) Pipeline is designed for identifying somatic short variants (SNVs and Indels) in one or more tumor samples from a single individual, with or without a matched normal sample. The pipeline automates the complete workflow from raw sequencing data preprocessing to variant annotation, ensuring reliable and reproducible somatic variant detection results.

Key Use cases:

  • Cancer Genomics: Identification of somatic mutations in tumor samples compared to matched normal controls for oncology research.
  • Tumor-Normal Comparison: Detection of acquired mutations in tumor tissue by comparing against normal tissue from the same individual.
  • Multi-Sample Analysis: Processing multiple tumor samples from a single patient to identify common and unique somatic variants.

Features

  • GATK4-Based Variant Calling: Utilizes GATK4 Mutect2 for robust somatic variant detection with industry-standard algorithms.
  • BWA MEM Alignment: High-performance read alignment using BWA MEM with proper read group handling for GATK compatibility.
  • Base Quality Score Recalibration (BQSR): Implements GATK BaseRecalibrator and ApplyBQSR for improved variant calling accuracy.
  • Duplicate Removal: Automated duplicate read marking using GATK MarkDuplicates to reduce PCR and optical duplicates.
  • Variant Annotation: Integration with Ensembl VEP for comprehensive variant effect prediction and annotation.
  • Flexible Input Handling: Supports both tumor-only and tumor-normal paired analysis workflows.
  • Quality Control Integration: Built-in QC steps including base recalibration and duplicate metrics generation.
  • Containerized Execution: All processes run in Docker containers ensuring reproducibility and consistent environments.

Input/Output Specification

Inputs

Required

Sequencing Reads

  • Description: Raw sequencing reads in FASTQ format from tumor and/or normal samples
  • Format: .fastq or .fastq.gz
  • Example File Path: /path/to/input/sample_R1.fastq.gz

Reference Genome

  • Description: Reference genome in FASTA format for read alignment and variant calling
  • Format: .fa or .fasta
  • Example File Path: /path/to/reference/genome.fa

Known Variants Database

  • Description: VCF files containing known SNPs and indels for base quality score recalibration
  • Format: .vcf or .vcf.gz
  • Example File Path: /path/to/known_sites/dbsnp.vcf.gz

GTF Annotation File

  • Description: Gene annotation file in GTF format for variant effect prediction
  • Format: .gtf
  • Example File Path: /path/to/annotation/genes.gtf

Optional Inputs

Sample Groups TSV

  • Description: Tab-separated file defining sample groupings for comparison analysis
  • Required Columns: Sample ID, Group (control/treatment)
  • Format: Tab-separated values (.tsv)
  • Example:
    Sample_ID    Group
    Normal_01    control
    Tumor_01     treatment
    

Comparison Design TSV

  • Description: File specifying which comparisons to perform between sample groups
  • Format: Tab-separated values (.tsv)
  • Example File Path: /path/to/comparisons.tsv

Outputs

Reported Outputs

  • Somatic Variants VCF:
  • Description: Called somatic variants in Variant Call Format with quality scores and filters
  • Format: .vcf.gz
  • Example File Path: /output/variants/somatic_variants.vcf.gz
  • Visualization App: IGV, UCSC Genome Browser
  • Location: variants/

  • Annotated Variants VCF:

  • Description: Somatic variants annotated with functional effects using Ensembl VEP
  • Format: .vcf
  • Example File Path: /output/annotated/annotated_variants.vcf
  • Visualization App: VEP Web Interface, IGV
  • Location: annotated/

Supporting Outputs

  • Aligned BAM Files:
  • Description: Quality-processed and recalibrated BAM files for each sample
  • Format: .bam with .bai index
  • Example File Path: /output/alignments/sample_recal.bam

  • Duplicate Metrics:

  • Description: Detailed statistics on duplicate reads identified and marked
  • Format: .txt
  • Example File Path: /output/qc/sample_dedup_metrics.txt

  • Base Recalibration Tables:

  • Description: GATK BaseRecalibrator output tables used for quality score recalibration
  • Format: .txt
  • Example File Path: /output/recal/sample_recal_data.txt

Associated Processes

References & Additional Documentation

  • Related Papers:
  • Van der Auwera GA, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics. 2013;43:11.10.1-33.
  • Benjamin D, et al. Calling Somatic SNVs and Indels with Mutect2. bioRxiv. 2019.
  • GATK Documentation: https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows
  • BWA Documentation: http://bio-bwa.sourceforge.net/bwa.shtml
  • Ensembl VEP Documentation: https://useast.ensembl.org/info/docs/tools/vep/index.html