Load Data and QC h5 copy Specifications
Process Details
- Name:
Load Data and QC h5 copy - Process UUID:
c663z69ihaki4sbit5tom9oakqtcks - Process Group:
SingleCell
Overview
This process reads HDF5 Feature-Barcode Matrix Format files (h5 files from 10x Genomics), creates a Seurat object, performs comprehensive quality control analysis, and outputs a filtered RDS file for downstream single-cell RNA-seq analysis. The process includes empty droplet removal, doublet detection, cell filtering based on gene expression metrics, and data normalization.
This process is implemented in Bash, which invokes Perl and R scripts for data processing, quality control analysis, and report generation.
Key Functionality
- Data Loading and Empty Droplet Removal: Reads h5 files and applies DropletUtils emptyDrops algorithm to remove empty droplets from raw count matrices
- Quality Control Analysis: Calculates and visualizes key QC metrics including number of genes, UMIs, and mitochondrial/ribosomal content percentages
- Doublet Detection: Uses DoubletFinder to identify and classify doublets/multiplets that could bias downstream analysis
- Cell Filtering: Applies quantile-based thresholds to remove low-quality cells and those with excessive mitochondrial/ribosomal content
- Data Normalization: Performs data normalization using various methods (LogNormalize, CLR, RC, or SCTransform) to make cells comparable
- Report Generation: Creates comprehensive HTML reports with visualizations of all QC steps and filtering results
Input/Output Specification
Inputs
Required Inputs
- h5_file
- Description: HDF5 Feature-Barcode Matrix Format file from 10x Genomics single-cell gene expression pipelines
- Format: h5
Optional Inputs
- inputFileTsv
- Description: Metadata file containing sample information and additional attributes
- Format: tsv
Outputs
-
rdsFile
- Description: Filtered and normalized Seurat object containing quality-controlled single-cell RNA-seq data
- Format: RDS
-
outputFileHTML
- Description: Comprehensive QC report with visualizations of filtering steps, doublet detection, and normalization results
- Format: html
Parameters & Settings
These parameters can be adjusted in the Foundry UI when running this process.
-
Min Transcripts
- Description: Cutoff Quantile for minimum number of unique transcript molecules in a cell
- Default value: 0.01
-
Max Transcripts
- Description: Cutoff Quantile for maximum number of unique transcript molecules in a cell
- Default value: 0.99
-
Min Genes
- Description: Cutoff Quantile for minimum number of genes in a cell
- Default value: 0.01
-
Max Genes
- Description: Cutoff Quantile for maximum number of genes in a cell
- Default value: 0.99
-
percent_mt
- Description: Cutoff removing the cells have higher percentage of mitochondrial contents than entered value
- Default value: 25
-
percent_ribo
- Description: Cutoff removing the cells have higher percentage of ribosomal contents than entered value
- Default value: 50
-
# of Variable Features
- Description: Use this many features as variable features after ranking by residual variance; default is 3000
- Default value: 3000
-
Normalization Method
- Description: Name of normalization method used: LogNormalize, CLR, RC, SCT
- Available options: LogNormalize (default), CLR, RC, SCT
-
DoubletRemoval
- Description: Whether doublet from sample should be removed. There are three options: TRUE, FALSE and DEFAULT. TRUE means that the doublet detection/removal will be run and FALSE means it will not. DEFAULT means the pipeline will try to detect whether the data is from a cellranger multi pipeline, if yes the doublet removal will not be run.
- Available options: TRUE (default), FALSE, DEFAULT
-
RemoveMitoGenes
- Description: Whether to remove mitochondrial genes from the data in the downstream analysis, the default is False.
- Available options: TRUE, FALSE (default)
-
RemoveRiboGenes
- Description: Whether to remove ribosomal genes from the data in the downstream analysis, the default is False.
- Available options: TRUE, FALSE (default)
References & Resources
- Tool Documentation: Contact the team for details on the R markdown script for quality control analysis
- Related Papers:
- Lun, A.T., Riesenfeld, S., Andrews, T. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol 20, 63 (2019). https://doi.org/10.1186/s13059-019-1662-y
- McGinnis, C.S., Murrow, L.M. & Gartner, Z.J. DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst 8, 329-337.e4 (2019). https://doi.org/10.1016/j.cels.2019.03.003