Batch correction and evaluation#
Motivation#
Batch correction in single-cell transcriptomics mitigates unwanted technical variations inherent in single-cell RNA sequencing (scRNA-seq) data. These technical or batch effects can stem from varying experimental conditions or sequencing platforms. If left unaddressed, they can obscure genuine biological signals. To tackle this, we have incorporated several batch correction methods, including CCA, FastMNN, RPCA, and Harmony. Additionally, the pipeline can compute various quality metrics to determine which method excels for a specific dataset.
Step-by-step#
The batch module consists of two steps: batch effect correction and assessment using quality metrics. The criteria for these quality metrics were established based on the scIB publication. Additionally, we leverage kBET, a renowned method for assessing batch correction in single-cell projects.
1. Running pipeline#
1.1. On HPC#
By default the previous command line considers thresholds.
HPC
Defining the pipeline entrypoint
= nonMalignantinput_integration_method
= allinput_target_variables
= batchinput_integration_evaluate
= allthr_cell_proportion
= 0.30input_lisi_variables
= cLISI;iLISI
nextflow run main.nf --workflow_level nonMalignant --project_name Training --sample_csv sample_table.csv --meta_data meta_data.csv --cancer_type Ovarian -resume -profile seadragon
1.2. On Cirro#
Alternatively, we execute this task on Cirro.
Cirro
Batch correction / Integration methods
= allTarget variable for batch correction
= batchDefine methods to be evaluated
= allCell proportion for Batch evaluation
= 0.30Define LISI types for Density plot
= cLISI;iLISI
On Cirro, users should (Do not run):
- Navigate to the Pipelines tab and enter "BTC scRNA Pipeline" in the search engine.
- Change the
Dataset
to BTC Training dataset and theCopy Parameters From option
to Run_01. - Double-check the aforementioned parameters and click Run.
2. Inspecting report#
For convenience the figures can be located in the Test_evaluation_report.html
report within the Run_02 dataset.
2.1. Batch evaluation table#
To ensure interpretability we incorporated multiple quality metrics. These metrics are related to both biological conservation and clustering quality.
Furthermore, we leverage the scPOP z-score to aggregate multiple metrics. It is a basic approach, but it can be used to select which batch correction method is performing better on that specific dataset.
2.2. UMAP and LISI plots#
Alternatively, the pipeline also allows visual inspection through UMAP and LISI plots.
LISI plots comprise two components: iLISI and cLISI. iLISI evaluates the mixing of datasets, indicating the effectiveness of data integration. On the other hand, cLISI is related to cell-type correction, i.e., it measures if identical cells are grouped together across datasets. For iLISI, higher values are preferable, while for cLISI, lower values are desired.
Warning
Please note that because we are using a reduced dataset (cell subsampling), the LISI plots might not reflect the actual expectations.
3. Exercise: Selecting your favorite batch correction model#
Question
How does the batch correction method influence subsequent results? Furthermore, does a change in cell proportion affect the quality metrics? A: Run_Harmony
and Run_Harmony_Metrics
Please note: When configuring the pipeline on Cirro, ensure that the Dataset
is set to BTC Training dataset and select Run_02 for the Copy Parameters From option
. Additionally, configure the Entrypoint parameter
to nonMalignant.
Tip: Accelerate the process by skipping DEG and Doublets analyses