Getting started#

Installation#

1. Nextflow and third-party software#

Nextflow can be used on any POSIX-compatible system (Linux, OS X, WSL). It requires Bash 3.2 (or later) and Java 11 (or later, up to 18) to be installed.

wget -qO- https://get.nextflow.io | bash

After it, we need to do two easy steps:

Make the binary executable on your system by running chmod +x nextflow.
Optionally, move the nextflow file to a directory accessible by your $PATH variable (this is only required to avoid remembering and typing the full path to nextflow each time you need to run it).

2. Containerization#

In line with contemporary pipelines, the BTC scRNA pipeline is powered by multiple Docker containers. On that note, distinct computational environments depend on container technologies, such as Docker (v20.10.22) and Singularity (v3.7.0). For instance, HPC strongly depend on Singularity, therefore it should be explicitly defined into profile configurations. For a better understanding, refer to the advanced section. Additionally, check the containers repository.

Warn

Please, note that Docker/Singularity images will be automatically download by the pipeline.

3. Cloning scRNA-Seq Pipeline#

git clone --recurse-submodules https://github.com/WangLab-ComputationalBiology/btc-scrna-pipeline

4. Running single-cell pipeline#

The pipeline requires four parameters: project name, sample_table, meta_data, cancer_type. In particular, sample_table and meta_data should follow a mandatory format as described below.

4.1. Preparing inputs#

The sample table must be a CSV file containing three columns: sample, fastq_1, and fastq_2. The sample column will be linked to all reports generated by the pipeline. Additionally, it's essential for merging the metadata with the Seurat object. Example sample sheet.

sample	fastq_1	fastq_2
SPECTRUM-OV-009_S1_CD45N_BOWEL	path/to/fastq/SPECTRUM-OV-009_S1_CD45N_BOWEL_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-009_S1_CD45N_BOWEL_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY	path/to/fastq/SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-009_S1_CD45P_ASCITES	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_ASCITES_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_ASCITES_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-009_S1_CD45P_BOWEL	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_BOWEL_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_BOWEL_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA	path/to/fastq/SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-022_S1_CD45P_BOWEL	path/to/fastq/SPECTRUM-OV-022_S1_CD45P_BOWEL_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-022_S1_CD45P_BOWEL_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA	path/to/fastq/SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM	path/to/fastq/SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-065_S1_CD45P_ASCITES	path/to/fastq/SPECTRUM-OV-065_S1_CD45P_ASCITES_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-065_S1_CD45P_ASCITES_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM	path/to/fastq/SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM_S1_L001_R2_001.fastq.gz
SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY	path/to/fastq/SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY_S1_L001_R1_001.fastq.gz	path/to/fastq/SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY_S1_L001_R2_001.fastq.gz

The metadata file, in .csv format, should include columns pertinent to the experimental design, such as batch and cell sorting status. It can also contain additional biological information about the sample. The batch variable is used to correct the technical effects. In this version of the pipeline, correction is based on a singular variable. Example meta-data.

patient_id	sample_id	Sort	source_name	batch
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45N_BOWEL	CD45-	Bowel	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45N_INFRACOLIC_OMENTUM	CD45-	Omentum	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY	CD45-	Adnexa	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45N_LEFT_UPPER_QUADRANT	CD45-	UQ	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45N_PELVIC_PERITONEUM	CD45-	Peritoneum	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45N_RIGHT_OVARY	CD45-	Adnexa	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45N_RIGHT_UPPER_QUADRANT	CD45-	UQ	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_ASCITES	CD45+	Ascites	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_BOWEL	CD45+	Bowel	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_INFRACOLIC_OMENTUM	CD45+	Omentum	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_LEFT_OVARY	CD45+	Adnexa	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT	CD45+	UQ	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_PELVIC_PERITONEUM	CD45+	Peritoneum	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_RIGHT_OVARY	CD45+	Adnexa	SPECTRUM-OV-009
SPECTRUM-OV-009	SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT	CD45+	UQ	SPECTRUM-OV-009
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45N_ASCITES	CD45-	Ascites	SPECTRUM-OV-022
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45N_BOWEL	CD45-	Bowel	SPECTRUM-OV-022
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA	CD45-	Adnexa	SPECTRUM-OV-022
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA	CD45-	Adnexa	SPECTRUM-OV-022
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45P_ASCITES	CD45+	Ascites	SPECTRUM-OV-022
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45P_BOWEL	CD45+	Bowel	SPECTRUM-OV-022
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45P_LEFT_ADNEXA	CD45+	Adnexa	SPECTRUM-OV-022
SPECTRUM-OV-022	SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA	CD45+	Adnexa	SPECTRUM-OV-022
SPECTRUM-OV-065	SPECTRUM-OV-065_S1_CD45N_ASCITES	CD45-	Ascites	SPECTRUM-OV-065
SPECTRUM-OV-065	SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM	CD45-	Omentum	SPECTRUM-OV-065
SPECTRUM-OV-065	SPECTRUM-OV-065_S1_CD45N_RIGHT_FALLOPIAN_TUBE	CD45-	Adnexa	SPECTRUM-OV-065
SPECTRUM-OV-065	SPECTRUM-OV-065_S1_CD45N_RIGHT_OVARY	CD45-	Adnexa	SPECTRUM-OV-065
SPECTRUM-OV-065	SPECTRUM-OV-065_S1_CD45P_ASCITES	CD45+	Ascites	SPECTRUM-OV-065
SPECTRUM-OV-065	SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM	CD45+	Omentum	SPECTRUM-OV-065
SPECTRUM-OV-065	SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY	CD45+	Adnexa	SPECTRUM-OV-065

Warning

Internally, the pipeline expects the batch column. This column will be used to perform the batch correction approach.

4.2. Minimal command-line#

To execute the pipeline, users should use the command line structure outlined below. Please, note the semantic differences between using one dash (-) for Nextflow commands and two dashes (--) for pipeline commands. Commands with two dashes are reserved for specific pipeline tasks, like adjusting filtering or thresholds on the single-cell analysis.

nextflow run main.nf --project_name <PROJECT> --sample_csv <path/to/sample_table.csv> --meta_data <path/to/meta_data.csv> --cancer_type <CANCER TYPE> -resume -profile <PROFILE>

Ultimately, the pipeline will make a folder named after the --project_name command. This folder contain all the results. The -resume command leverages Nextflow caching, i.e., resuming executions to avoid excessive computational time.

4.3. Staging images and genome indexes#

The pipeline requires staging (downloading) multiple components to operate. This can pose challenges in HPC environments with strict network policies. As a workaround, consider using the -stub option on a node with a network connection. The -stub will stage all the necessary components without actually executing any analysis. Thus, it serves as a bootstrap run for the pipeline. Please note that stub will generate dummy outputs.

nextflow run main.nf --project_name <PROJECT> --sample_csv <path/to/sample_table.csv> --meta_data <path/to/meta_data.csv> --cancer_type <CANCER TYPE> -resume -profile <PROFILE> -stub

4.4. Shorten command-line#

Long command lines can be tricky. Thankfully, with Nextflow's -params-file, we can make things simpler. This is a JSON file that has all the instructions related to a specific run. If you're trying out different settings, it might be best practice to maintain separate files for each test, e.g., PARAMS_TEST_01.json or PARAMS_TEST_02.json.

{
 "project_name": "BTC-CANCER-X",
 "sample_csv": "path/to/sample_table.csv",
 "meta_data": "path/to/meta_data.csv",
 "cancer_type": "CANCER TYPE X"
 "thr_mean_reads_per_cells": 10000
}

Note, other paramaters can be added into the -params-file. For your convenience, please check the command-line documentation.

nextflow run main.nf -params-file <PARAMS.json> -resume -profile <PROFILE>