Germline Variant Calling Pipeline (GATK) Pipeline Specification
Pipeline Details
- Name:
Germline Variant Calling Pipeline (GATK) - Pipeline UUID:
f931gy4f2onyfr3bf6bkhuydydt1ej - Version:
2.1.2 - View Pipeline:
Overview
Germline Variant Calling Pipeline (GATK) pipeline is designed for calling variants in samples that are clonal – i.e. a single individual. It uses HaplotypeCaller to call germline SNPs and indels via local re-assembly of haplotypes and implements Base Quality Score Recalibration (BQSR) to minimize the effect of technical variation on base quality scores for accurate variant detection.
Key Use cases:
- Germline Variant Detection: Identification of SNPs and indels in clonal samples with expected variant frequencies of 1 (for haploids or homozygous diploids) or 0.5 (for heterozygous diploids).
- Base Quality Score Recalibration: Systematic correction of base quality scores to improve variant calling accuracy.
- Variant Annotation and Effect Prediction: Functional annotation of identified variants using SnpEff to predict biological effects.
Features
- GATK4 Best Practices Implementation: Follows established GATK4 germline variant calling workflow with HaplotypeCaller.
- BWA MEM Alignment: High-quality read alignment with proper read group assignment required for GATK functionality.
- Duplicate Marking: Automated identification and marking of PCR and optical duplicates using GATK MarkDuplicates.
- Base Quality Score Recalibration (BQSR): Two-pass BQSR implementation with recalibration report generation.
- Comprehensive Variant Filtering: Hard filtering of SNPs and indels using GATK recommended parameters (QD, FS, MQ, SOR, MQRankSum, ReadPosRankSum).
- Variant Annotation: Integration with SnpEff for functional annotation and effect prediction.
- Quality Control Metrics: Collection of alignment metrics, insert size metrics, and coverage depth analysis.
- Variant Comparison: Optional multi-sample VCF comparison and intersection analysis.
Input/Output Specification
Inputs
Required
Sequencing Reads
- Description: FASTQ files containing raw sequencing reads from clonal samples
- Format: .fastq or .fastq.gz
- Example File Path: /path/to/input/sample.fastq.gz
Reference Genome
- Description: Reference genome in FASTA format for alignment and variant calling
- Format: .fa or .fasta
- Example File Path: /path/to/reference/genome.fa
Optional Inputs
BWA Index
- Description: Pre-built BWA index for the reference genome (will be created if not provided)
- Format: BWA index directory
- Example File Path: /path/to/bwa/index/
Known Variants Database
- Description: Database identifier for SnpEff annotation (e.g., GRCh38.p7.RefSeq for human, GRCm38.75 for mouse)
- Format: SnpEff database identifier
- Example: GRCh38.p7.RefSeq
Outputs
Reported Outputs
- Annotated VCF File:
- Description: Final filtered and annotated VCF file containing SNPs with functional annotations
- Format: .vcf
- Example File Path: /output/directory/sample_filtered_snps.ann.vcf
- Visualization App: IGV, UCSC Genome Browser
-
Location: Results folder
-
SnpEff Summary Report:
- Description: HTML summary report of variant annotations and effects
- Format: .html
- Example File Path: /output/directory/sample_snpEff_summary.html
- Visualization App: Web browser
-
Location: Results folder
-
Recalibrated BAM File:
- Description: Base quality score recalibrated alignment file
- Format: .bam
- Example File Path: /output/directory/sample_recal.bam
- Visualization App: IGV, SAMtools
- Location: Results folder
Supporting Outputs
- Alignment Metrics:
- Description: Comprehensive alignment statistics and quality metrics
- Format: .txt
-
Example File Path: /intermediate/directory/sample_alignment_metrics.txt
-
Insert Size Metrics:
- Description: Insert size distribution statistics and histogram
- Format: .txt, .pdf
-
Example File Path: /intermediate/directory/sample_insert_metrics.txt
-
BQSR Recalibration Report:
- Description: Before and after base quality score recalibration plots
- Format: .pdf
-
Example File Path: /intermediate/directory/sample_recalibration_plots.pdf
-
Filtered VCF Files:
- Description: Intermediate filtered SNP and indel VCF files
- Format: .vcf
- Example File Path: /intermediate/directory/sample_filtered_snps_round1.vcf
Associated Processes
- AnalyzeCovariates
- applyBSQRS
- BaseRecalibrator
- build gatk4 genome dictionary
- bwa align
- Check Build BWA
- check BWA files
- getMetrics
- HaplotypeCaller
- markDuplicates
- selectVariants
- SnpEff
- VariantFiltration
- vcfcomparison
References & Additional Documentation
- Related Papers/links: GATK4 Variant Calling Pipeline - NYU Gencore
- GATK Documentation: Hard-filtering germline short variants
- Pipeline Repository: Based on GATK4 best practices workflow