Dimensionality reduction and clustering#
Motivation#
Dimensionality reduction and cell clustering are fundamental steps in single-cell RNA-sequencing (scRNA-seq) analysis. In this tutorial, we emphasize the primary parameters associated with dimensionality reduction and clustering. We will also delve into PCA loadings, metrics for evaluating clustering, and UMAP visualization
Step-by-step#
In this hands-on section, we will cover several analytical steps. Initially, the pipeline will combine sample matrices into one Seurat object. Subsequently, it will carry out normalization, dimensionality reduction, and clustering. Collectively, these procedures will generate a preliminary clustering used for distinguishing between malignant and non-malignant cells.
1. Running pipeline#
Currently, the pipeline permits users to modify several parameters, encompassing crucial thresholds like the number of highly variable genes and clustering resolution. Please, see the list below.
1.1. On the HPC#
By default the previous command line considers thresholds.
HPC
workflow_level
= Basicthr_n_features
= 2000input_group_plot
= source_name,Sortthr_resolution
= 0.25thr_proportion
= 0.25
nextflow run main.nf --workflow_level Basic --project_name Training --sample_csv sample_table.csv --meta_data meta_data.csv --cancer_type Ovarian -resume -profile seadragon
1.2. On Cirro#
Alternatively, we execute this task on Cirro.
Cirro
Defining the pipeline entrypoint
= BasicNumber features for FindVariableFeatures
= 2000Meta-data columns for UMAP plot
= source_name,SortResolution threshold
= 0.25Cell proportion for ROGUE calculation
= 0.25
On Cirro, users should (Do not run):
- Navigate to the Pipelines tab and enter "BTC scRNA Pipeline" in the search engine.
- Change the
Dataset
to BTC Training dataset and theCopy Parameters From option
to Run_01. - Double-check the aforementioned parameters and click Run.
2. Inspecting report#
For convenience the figures can be located in the Test_merged_report.html
and Test_main_cluster_report.html
reports. These reports are located within the Run_02 dataset.
2.1. Highly variable genes (HVG)#
The first report produces multiple figures, including the HGV distribution on the dataset. In addition, the user can doublecheck which genes is contributing (loadings) to each principal component.
As mentioned earlier, loadings represent the genes/features contribution to each principal component or other dimension reduction. Visualizing loadings provides insights into which genes are driving the separation observed in a particular component. In turn, these genes will strongly affect the clustering process.
2.2. UMAP and cluster composition#
The next step on the pipeline will perform clustering over the dimensions (e.g. PCs). Here, we can access the clustering profile, composition, and quality.
The barplot illustrates the cluster composition per sample. Clusters dominated by a single sample might indicate populations of malignant cells.
2.3. Clustering assessement#
ROGUE is an entropy-based metric designed to evaluate cluster purity in single-cell RNA sequencing (scRNA-seq) data. In essence, it assists in refining clustering by suggesting when clusters should be split or merged. High ROGUE scores correlate with purity, meaning clusters consist of cells displaying similar transcriptional backgrounds. Conversely, low ROGUE scores indicate cluster heterogeneity, suggesting clusters comprise varied cell populations and should be further subdivided.
The ROGUE score can also provide insights regarding data quality. For example, samples with a low average ROGUE score may contain a higher proportion of doublets.
3. Exercise: Playing around with multiple paremeters#
Question
What would happen if we changed the features and the resolution threshold? A: Run_Dimensionality
and Run_Clustering
Please note: When configuring the pipeline on Cirro, ensure that the Dataset
is set to BTC Training dataset and select Run_02 for the Copy Parameters From option
. Additionally, configure the Entrypoint parameter
to Basic.