Sample and QC filtering#
Motivation#
Quality control (QC) is an essential step on single cell RNA-seq projects. Low-quality cells, contaminants, and doublets can greatly impact data interpretability, i.e., concealing genuine biological signals. Additionally, QC in single-cell data is a challenging process that can only be evaluated through downstream analyses, making it an iterative procedure and case-specific. The major goals of performing QC and filtering include:
QC Metrics and Filtering Methods#
- Filtering by UMI Counts: This helps eliminate barcodes not representing single cells, with thresholds varying in literature.
- Filtering by Number of Features: Similar to UMI counts, this removes likely multiplets, but thresholds can vary.
- Filtering by Percent of Mitochondrial (mt) Reads: Elevated mt RNA suggests unhealthy cells; thresholds vary. Be cautious with meaningful mt gene expression.
- Doublet Detection: Detects multiplets using tools like DoubletFinder. Setting thresholds is subjective and data-dependent.
- Identifying Empty Droplets: Distinguishes cell-containing droplets from empty ones, with methods like emptyDrops and other tools available.
- Removing Ambient RNAs: Addresses contamination from ambient RNAs using statistical approaches.
Step-by-step#
The current pipeline version covers most common QC metrics, including the parameters described below. In this tutorial, we will cover the very basic steps regarding Cellranger alignment and sample filtering.
1. Running pipeline#
To improve reproducibility we suggest several thresholds based on multiple reports on literature (see below). In addition, for this training we will leverage a dataset derived from MSK Spectrum [1]. The dataset can be download through the BTC Buckets.
1.1. On HPC#
By default the previous command line considers thresholds.
HPC
workflow_level
= Basicthr_estimate_n_cells
= 300thr_mean_reads_per_cells
= 25000thr_median_genes_per_cell
= 900thr_median_umi_per_cell
= 1000thr_nFeature_RNA_min
= 300thr_nFeature_RNA_max
= 7500thr_percent_mito
= 25thr_n_observed_cells
= 300
nextflow run main.nf --workflow_level Basic --project_name Training --sample_csv sample_table.csv --meta_data meta_data.csv --cancer_type Ovarian -resume -profile seadragon
1.2. On Cirro#
Alternatively, we execute this task on Cirro.
Cirro
Defining the pipeline entrypoint
= BasicEstimated number of cells
= 300Mean reads per cell
= 25000Median genes per cell
= 900Median UMI per cell
= 1000Minimum features per cell
= 300Maximum features per cell
= 7500Percentage of mitochondrial genes
= 25Number of observed cells
= 300
On Cirro, users should (Do not run):
- Navigate to the Pipelines tab and enter "BTC scRNA Pipeline" in the search engine.
- Change the
Dataset
to BTC Training dataset and theCopy Parameters From option
to Run_01. - Double-check the aforementioned parameters and click Run.
2. Inspecting report#
A fundamental component in the pipeline is related to its HTML reports generation. Over the tutorials, we will browse several HTML reports and discuss key features in each analysis. The first report, "Rendering QC report", produces an interactive table reporting estimates and observed metrics for each sample. For convenience the figures can be located in the Test_project_metric_report.html
report within the Run_02 dataset.
The QC table displays metrics related to multiple samples, along with a QC label indicating the status of each sample (SUCCESS, FIXABLE, or FAILURE). The filtering system was developed with a focus on traceability, allowing users to inspect which metrics do not meet expectations and make necessary adjustments. Additionally, it enables users to determine whether the samples are failing at the library preparation stage or due to cell-level quality issues (see below).
3. Exercise: Adjusting filterings#
3.1. On HPC#
Now that we have assessed the quality control reports, we will proceed with the analysis by adjusting the threshold. In this case, we will be more permissive to include the SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY sample. To achieve this, we will change the thr_n_observed_cells
to 250 cells after filtering mitochondrial RNA percentage. Please note that this adjustment will be applied specifically to this subset, which contains only a fraction of cells per sample.
nextflow run main.nf --project_name Training --sample_csv sample_table.csv --meta_data meta_data.csv --cancer_type Ovarian --thr_n_observed_cells 250 -resume -profile seadragon
Tip
The Nextflow caching system ensures that the alignment step is not rerun. As a result, only the QC filtering will be executed, along with the generation of the new project report.
3.2. On Cirro#
Please note: When configuring the pipeline on Cirro, ensure that the Dataset
is set to BTC Training dataset and select Run_02 for the Copy Parameters From option
. Additionally, configure the Entrypoint parameter
to Basic.