Skip to content

Load Data and QC h5 Specifications

Process Details

  • Name: Load Data and QC h5
  • Process UUID: a7gfRsaX6kDJBNZyRJ55O49RLJjxqF
  • Process Group: SingleCell

Overview

This process reads in h5 files (HDF5 Feature-Barcode Matrix Format from 10x Genomics), creates a Seurat object, and performs comprehensive quality control analysis for single-cell RNA sequencing data. The process outputs an RDS file for downstream analysis along with quality control reports. Note that this process specifically handles h5 files from 10x Genomics, not h5ad files from scanpy or h5seurat files from Seurat.

This process is implemented in Bash, which invokes a Python script for data loading, quality control, and Seurat object creation.

Key Functionality

  • Data Loading: Reads 10x Genomics h5 format files and creates Seurat objects
  • Quality Control Filtering: Applies multiple filtering criteria including gene counts, UMI counts, and mitochondrial/ribosomal gene percentages
  • Doublet Detection: Optional removal of doublet cells with configurable detection methods
  • Data Normalization: Multiple normalization methods including LogNormalize, CLR, RC, and SCT
  • Variable Feature Selection: Identifies highly variable genes for downstream analysis
  • Report Generation: Creates comprehensive HTML quality control reports

Input/Output Specification

Inputs

Required Inputs

  • h5 file
    • Description: HDF5 Feature-Barcode Matrix Format file from 10x Genomics containing single-cell gene expression data
    • Format: h5

Optional Inputs

  • inputFileTsv
    • Description: Metadata file containing sample information and cell annotations
    • Format: tsv

Outputs

  • rdsFile

    • Description: Processed Seurat object saved as RDS file for downstream analysis
    • Format: RDS
  • outputFileHTML

    • Description: Quality control report containing filtering statistics, visualizations, and data summaries
    • Format: html
  • outFileTSV

    • Description: Tab-separated file containing processed cell and gene metadata
    • Format: tsv

Parameters & Settings

These parameters can be adjusted in the Foundry UI when running this process.

  • Remove Mitochondrial Genes

    • Description: When checked, mitochondrial genes will be completely removed prior to data filtering
    • Available options: FALSE (default), TRUE
  • Remove ribosomal RNA Genes

    • Description: When checked, ribosomal RNA genes will be completely removed prior to data filtering
    • Available options: FALSE (default), TRUE
  • Minimal Genes per Cell

    • Description: Threshold quantile for minimum number of genes in a cell. Cells that don't meet this threshold will be removed
    • Default value: 0.01
  • Maximal Genes per Cell

    • Description: Cutoff quantile for maximum number of genes in a cell. Cells that exceed this cutoff will be removed
    • Default value: 0.99
  • Minimal UMIs per Cell

    • Description: Threshold quantile for minimum number of unique transcript molecules in a cell. Cells that don't meet this threshold will be removed
    • Default value: 0.01
  • Maximal UMIs per Cell

    • Description: Cutoff quantile for maximum number of unique transcript molecules in a cell. Cells that exceed this cutoff will be removed
    • Default value: 0.99
  • Maximal Percent Mitochondrial Reads per Cell

    • Description: Cutoff percentage for percentage of reads that come from the mitochondria. Cells that exceed this cutoff will be removed
    • Default value: 25
  • Maximal Percent Ribosomal RNA Reads per Cell

    • Description: Cutoff percentage for percentage of reads that come from the ribosomal RNA. Cells that exceed this cutoff will be removed
    • Default value: 50
  • Remove Doublets

    • Description: Whether doublet from sample should be removed. TRUE means that the doublet detection/removal will be run and FALSE means it will not. DEFAULT means the pipeline will try to detect whether the data is from a cellranger multi pipeline, if yes the doublet removal will not be run
    • Available options: TRUE (default), FALSE, DEFAULT
  • Doublet Percentage

    • Description: Doublet percentage to use during doublet removal
    • Default value: 0.01
  • Normalization Method

    • Description: Name of normalization method used: LogNormalize, CLR, RC, SCT
    • Available options: LogNormalize (default), CLR, RC, SCT
  • # of Variable Features

    • Description: Number of variable features to use after ranking by residual variance
    • Default value: 3000

References & Resources

  • Tool Documentation: Contact the team for details on build_QC_report.py
  • Related Papers: Butler, A., Hoffman, P., Smibert, P., Papalexi, E., & Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5), 411-420.