Skip to content

Somatic Variant Calling Pipeline (GATK) Pipeline Specification

Pipeline Details

  • Name: Somatic Variant Calling Pipeline (GATK)
  • Pipeline UUID: f931wyvxmf3xk5k6h5cg1x54nt0j4w
  • Version: 1.1.0
  • View Pipeline:

Overview

Somatic Variant Calling Pipeline (GATK) pipeline is designed for identifying somatic short variants (SNVs and Indels) in matched tumor and normal samples. It automates the complete workflow from raw FASTQ files through variant calling and annotation, implementing GATK best practices for somatic variant discovery to ensure reliable and reproducible results.

Key Use cases:

  • Cancer Genomics Research: Identification of somatic mutations in tumor samples compared to matched normal controls for cancer research and clinical applications.
  • Precision Medicine: Discovery of actionable somatic variants that may inform treatment decisions in oncology.
  • Comparative Genomics: Analysis of genomic differences between paired samples to understand disease mechanisms and progression.

Features

  • GATK Best Practices Implementation: Follows established GATK workflows for somatic variant calling including base quality score recalibration and duplicate marking.
  • BWA-MEM Alignment: High-performance read alignment with proper read group assignment for downstream GATK compatibility.
  • Comprehensive Quality Control: Implements duplicate removal, base recalibration, and systematic error correction for improved variant calling accuracy.
  • Variant Effect Prediction: Integrated VEP (Variant Effect Predictor) annotation for functional impact assessment of identified variants.
  • Flexible Sample Organization: Supports complex experimental designs through groups and comparison files for matched tumor-normal analysis.
  • Mutect2 Integration: Utilizes GATK's Mutect2 caller specifically designed for somatic variant detection with local assembly of haplotypes.
  • Containerized Execution: All processes run in Docker containers ensuring reproducibility and consistent environments.

Input/Output Specification

Inputs

Required

FASTQ Files

  • Description: Raw sequencing reads for both tumor and matching normal samples in compressed FASTQ format.
  • Format: .fastq.gz
  • Example File Path: /path/to/input/sample_R1.fastq.gz

Groups File

  • Description: Tab or comma-separated file containing sample information with required sample_name and group columns.
  • Format: .tsv or .csv
  • Required Columns: sample_name, group
  • Constraints: Sample names cannot contain spaces and must match FASTQ file prefixes
  • Example File Path: /path/to/metadata/groups.tsv

Comparison File

  • Description: Specifies which groups to compare in somatic variant analysis.
  • Format: .tsv or .csv
  • Required Columns: controls, treats, names
  • Constraints: Values in treats/controls must exist in groups file; names column cannot contain forbidden filename characters
  • Example File Path: /path/to/metadata/comparisons.tsv

Known Variants Files

  • Description: Reference databases of known SNPs and Indels in VCF format with accompanying indexes for base recalibration.
  • Format: .vcf.gz with .tbi indexes
  • Example File Path: /path/to/references/known_snps.vcf.gz

Optional Inputs

GTF File for VEP

  • Description: Gene annotation file for Variant Effect Predictor functional annotation.
  • Format: .gtf
  • Example File Path: /path/to/annotations/genes.gtf

Reference Genome

  • Description: Reference genome sequence in FASTA format for alignment and variant calling.
  • Format: .fa or .fasta
  • Example File Path: /path/to/reference/genome.fa

Outputs

Reported Outputs

  • Annotated VCF File:
  • Description: Somatic variants with functional annotations from VEP including gene symbols and biotype information
  • Format: .vcf
  • Example File Path: /output/variants/sample_annotated.vcf
  • Visualization App: IGV, UCSC Genome Browser
  • Location: Variants folder

Supporting Outputs

  • Recalibrated BAM Files:
  • Description: Quality-recalibrated alignment files after BQSR processing
  • Format: .bam with .bai index
  • Example File Path: /output/alignments/sample_recal.bam

  • Duplicate Metrics:

  • Description: Statistics on duplicate reads identified and removed during processing
  • Format: .txt
  • Example File Path: /output/qc/sample_dedup_metrics.txt

  • Base Recalibration Tables:

  • Description: Quality score recalibration data tables generated during BQSR
  • Format: .txt
  • Example File Path: /output/recalibration/sample_recal_data.txt

Associated Processes

References & Additional Documentation