Skip to content

Load Data and QC h5 copy Specifications

Process Details

  • Name: Load Data and QC h5 copy
  • Process UUID: c663z69ihaki4sbit5tom9oakqtcks
  • Process Group: SingleCell

Overview

This process reads HDF5 Feature-Barcode Matrix Format files (h5 files from 10x Genomics), creates a Seurat object, performs comprehensive quality control analysis, and outputs a filtered RDS file for downstream single-cell RNA-seq analysis. The process includes empty droplet removal, doublet detection, cell filtering based on gene expression metrics, and data normalization.

This process is implemented in Bash, which invokes Perl and R scripts for data processing, quality control analysis, and report generation.

Key Functionality

  • Data Loading and Empty Droplet Removal: Reads h5 files and applies DropletUtils emptyDrops algorithm to remove empty droplets from raw count matrices
  • Quality Control Analysis: Calculates and visualizes key QC metrics including number of genes, UMIs, and mitochondrial/ribosomal content percentages
  • Doublet Detection: Uses DoubletFinder to identify and classify doublets/multiplets that could bias downstream analysis
  • Cell Filtering: Applies quantile-based thresholds to remove low-quality cells and those with excessive mitochondrial/ribosomal content
  • Data Normalization: Performs data normalization using various methods (LogNormalize, CLR, RC, or SCTransform) to make cells comparable
  • Report Generation: Creates comprehensive HTML reports with visualizations of all QC steps and filtering results

Input/Output Specification

Inputs

Required Inputs

  • h5_file
    • Description: HDF5 Feature-Barcode Matrix Format file from 10x Genomics single-cell gene expression pipelines
    • Format: h5

Optional Inputs

  • inputFileTsv
    • Description: Metadata file containing sample information and additional attributes
    • Format: tsv

Outputs

  • rdsFile

    • Description: Filtered and normalized Seurat object containing quality-controlled single-cell RNA-seq data
    • Format: RDS
  • outputFileHTML

    • Description: Comprehensive QC report with visualizations of filtering steps, doublet detection, and normalization results
    • Format: html

Parameters & Settings

These parameters can be adjusted in the Foundry UI when running this process.

  • Min Transcripts

    • Description: Cutoff Quantile for minimum number of unique transcript molecules in a cell
    • Default value: 0.01
  • Max Transcripts

    • Description: Cutoff Quantile for maximum number of unique transcript molecules in a cell
    • Default value: 0.99
  • Min Genes

    • Description: Cutoff Quantile for minimum number of genes in a cell
    • Default value: 0.01
  • Max Genes

    • Description: Cutoff Quantile for maximum number of genes in a cell
    • Default value: 0.99
  • percent_mt

    • Description: Cutoff removing the cells have higher percentage of mitochondrial contents than entered value
    • Default value: 25
  • percent_ribo

    • Description: Cutoff removing the cells have higher percentage of ribosomal contents than entered value
    • Default value: 50
  • # of Variable Features

    • Description: Use this many features as variable features after ranking by residual variance; default is 3000
    • Default value: 3000
  • Normalization Method

    • Description: Name of normalization method used: LogNormalize, CLR, RC, SCT
    • Available options: LogNormalize (default), CLR, RC, SCT
  • DoubletRemoval

    • Description: Whether doublet from sample should be removed. There are three options: TRUE, FALSE and DEFAULT. TRUE means that the doublet detection/removal will be run and FALSE means it will not. DEFAULT means the pipeline will try to detect whether the data is from a cellranger multi pipeline, if yes the doublet removal will not be run.
    • Available options: TRUE (default), FALSE, DEFAULT
  • RemoveMitoGenes

    • Description: Whether to remove mitochondrial genes from the data in the downstream analysis, the default is False.
    • Available options: TRUE, FALSE (default)
  • RemoveRiboGenes

    • Description: Whether to remove ribosomal genes from the data in the downstream analysis, the default is False.
    • Available options: TRUE, FALSE (default)

References & Resources

  • Tool Documentation: Contact the team for details on the R markdown script for quality control analysis
  • Related Papers:
  • Lun, A.T., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol 20, 63 (2019). https://doi.org/10.1186/s13059-019-1662-y
  • McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst 8, 329-337.e4 (2019). https://doi.org/10.1016/j.cels.2019.03.003