Skip to content

GEO Custom Download Module Pipeline Specification

Pipeline Details

  • Name: GEO Custom Download Module
  • Pipeline UUID: c663lcw25iw6suraj9n2eu25zw7gqf
  • Version: 1.1.0
  • View Pipeline:

Overview

GEO Custom Download Module pipeline is designed for downloading GEO files including technical reads from the Sequence Read Archive (SRA). It automates the process of parsing GEO collection data and downloading FASTQ files with configurable options to include or exclude technical reads based on specific research requirements.

Key Use cases:

  • Standard GEO Data Download: Downloading biological reads from GEO/SRA datasets for standard genomic analysis workflows.
  • Technical Read Recovery: Retrieving both technical and biological reads when biological reads are mistakenly classified as technical reads.
  • Barcode Information Extraction: Downloading technical reads containing essential barcode information required for downstream processing.

Features

  • Flexible Technical Read Handling: Supports both standard mode (biological reads only) and comprehensive mode (all technical and biological reads).
  • Collection-Based Processing: Integrates with GEO/NCBI tab for creating collections and batch processing multiple samples.
  • Configurable Split Modes: Supports various fastq-dump split modes including --split-files for paired-end data handling.
  • Robust Download Management: Implements retry mechanisms with up to 3 attempts for failed downloads to ensure data integrity.
  • Automated File Organization: Handles file naming conventions and organizes outputs with proper FASTQ file extensions.
  • SRA Cache Management: Automatically manages SRA cache files and temporary directories for efficient storage utilization.

Input/Output Specification

Inputs

Required

Collection Data

  • Description: GEO collection information created using the GEO/NCBI tab containing sample metadata and file references.
  • Format: Collection metadata with file paths and identifiers
  • Parameters: Includes file_name, collection_type, s3_archive_dir, and archive_dir specifications

Optional Inputs

Skip Technical Reads

  • Description: Configuration parameter to control whether technical reads should be excluded from download.
  • Options: "Yes" (default - skip technical reads) or "No" (include all reads)
  • Default: Skip technical reads for standard biological analysis

Split Mode

  • Description: Fastq-dump parameter controlling how reads are split and organized.
  • Format: Command-line parameter (e.g., --split-files)
  • Usage: Essential for proper handling of paired-end sequencing data

Outputs

Reported Outputs

  • FASTQ Files:
  • Description: Downloaded sequencing reads in compressed FASTQ format
  • Format: .fastq.gz
  • Naming Convention:
    • Single-end: {file_name}.fastq.gz
    • Paired-end: {file_name}_R1.fastq.gz, {file_name}_R2.fastq.gz
  • Location: reads/ directory

Supporting Outputs

  • CSV Metadata:
  • Description: Parsed collection information with sample names, GEO IDs, remote directories, and collection types
  • Format: .csv
  • Columns: Name, GeoId, RemoteDir, CollectionType
  • Usage: Provides mapping between original GEO identifiers and processed file names

Associated Processes

References & Additional Documentation

  • SRA Toolkit Documentation: NCBI SRA Toolkit
  • GEO Database: Gene Expression Omnibus
  • Technical vs Biological Reads: Understanding the distinction between technical sequencing artifacts and biological sample reads
  • Pipeline Repository: Available through ViaFoundry platform