Getting started#
Installation#
1. Nextflow and third-party software#
Nextflow can be used on any POSIX-compatible system (Linux, OS X, WSL). It requires Bash 3.2 (or later) and Java 11 (or later, up to 18) to be installed.
wget -qO- https://get.nextflow.io | bash
After it, we need to do two easy steps:
-
Make the binary executable on your system by running
chmod +x
nextflow. -
Optionally, move the nextflow file to a directory accessible by your $PATH variable (this is only required to avoid remembering and typing the full path to nextflow each time you need to run it).
2. Containerization#
In line with contemporary pipelines, the BTC scRNA pipeline is powered by multiple Docker containers. On that note, distinct computational environments depend on container technologies, such as Docker (v20.10.22) and Singularity (v3.7.0). For instance, HPC strongly depend on Singularity, therefore it should be explicitly defined into profile
configurations. For a better understanding, refer to the advanced section. Additionally, check the containers repository.
Warn
Please, note that Docker/Singularity images will be automatically download by the pipeline.
3. Cloning scRNA-Seq Pipeline#
git clone --recurse-submodules https://github.com/WangLab-ComputationalBiology/btc-scrna-pipeline
4. Running single-cell pipeline#
The pipeline requires four parameters: project name
, sample_table
, meta_data
, cancer_type
. In particular, sample_table and meta_data should follow a mandatory format as described below.
4.1. Preparing inputs#
The sample table must be a CSV file containing three columns: sample, fastq_1, and fastq_2. The sample column will be linked to all reports generated by the pipeline. Additionally, it's essential for merging the metadata with the Seurat object. Example sample sheet.
sample | fastq_1 | fastq_2 |
---|---|---|
SPECTRUM-OV-009_S1_CD45N_BOWEL | path/to/fastq/SPECTRUM-OV-009_S1_CD45N_BOWEL_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-009_S1_CD45N_BOWEL_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY | path/to/fastq/SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-009_S1_CD45P_ASCITES | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_ASCITES_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_ASCITES_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-009_S1_CD45P_BOWEL | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_BOWEL_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_BOWEL_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA | path/to/fastq/SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-022_S1_CD45P_BOWEL | path/to/fastq/SPECTRUM-OV-022_S1_CD45P_BOWEL_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-022_S1_CD45P_BOWEL_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA | path/to/fastq/SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM | path/to/fastq/SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-065_S1_CD45P_ASCITES | path/to/fastq/SPECTRUM-OV-065_S1_CD45P_ASCITES_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-065_S1_CD45P_ASCITES_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM | path/to/fastq/SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM_S1_L001_R2_001.fastq.gz |
SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY | path/to/fastq/SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY_S1_L001_R1_001.fastq.gz | path/to/fastq/SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY_S1_L001_R2_001.fastq.gz |
The metadata file, in .csv
format, should include columns pertinent to the experimental design, such as batch and cell sorting status. It can also contain additional biological information about the sample. The batch variable is used to correct the technical effects. In this version of the pipeline, correction is based on a singular variable. Example meta-data.
patient_id | sample_id | Sort | source_name | batch |
---|---|---|---|---|
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45N_BOWEL | CD45- | Bowel | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45N_INFRACOLIC_OMENTUM | CD45- | Omentum | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45N_LEFT_OVARY | CD45- | Adnexa | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45N_LEFT_UPPER_QUADRANT | CD45- | UQ | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45N_PELVIC_PERITONEUM | CD45- | Peritoneum | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45N_RIGHT_OVARY | CD45- | Adnexa | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45N_RIGHT_UPPER_QUADRANT | CD45- | UQ | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_ASCITES | CD45+ | Ascites | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_BOWEL | CD45+ | Bowel | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_INFRACOLIC_OMENTUM | CD45+ | Omentum | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_LEFT_OVARY | CD45+ | Adnexa | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_LEFT_UPPER_QUADRANT | CD45+ | UQ | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_PELVIC_PERITONEUM | CD45+ | Peritoneum | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_RIGHT_OVARY | CD45+ | Adnexa | SPECTRUM-OV-009 |
SPECTRUM-OV-009 | SPECTRUM-OV-009_S1_CD45P_RIGHT_UPPER_QUADRANT | CD45+ | UQ | SPECTRUM-OV-009 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45N_ASCITES | CD45- | Ascites | SPECTRUM-OV-022 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45N_BOWEL | CD45- | Bowel | SPECTRUM-OV-022 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45N_LEFT_ADNEXA | CD45- | Adnexa | SPECTRUM-OV-022 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45N_RIGHT_ADNEXA | CD45- | Adnexa | SPECTRUM-OV-022 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45P_ASCITES | CD45+ | Ascites | SPECTRUM-OV-022 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45P_BOWEL | CD45+ | Bowel | SPECTRUM-OV-022 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45P_LEFT_ADNEXA | CD45+ | Adnexa | SPECTRUM-OV-022 |
SPECTRUM-OV-022 | SPECTRUM-OV-022_S1_CD45P_RIGHT_ADNEXA | CD45+ | Adnexa | SPECTRUM-OV-022 |
SPECTRUM-OV-065 | SPECTRUM-OV-065_S1_CD45N_ASCITES | CD45- | Ascites | SPECTRUM-OV-065 |
SPECTRUM-OV-065 | SPECTRUM-OV-065_S1_CD45N_INFRACOLIC_OMENTUM | CD45- | Omentum | SPECTRUM-OV-065 |
SPECTRUM-OV-065 | SPECTRUM-OV-065_S1_CD45N_RIGHT_FALLOPIAN_TUBE | CD45- | Adnexa | SPECTRUM-OV-065 |
SPECTRUM-OV-065 | SPECTRUM-OV-065_S1_CD45N_RIGHT_OVARY | CD45- | Adnexa | SPECTRUM-OV-065 |
SPECTRUM-OV-065 | SPECTRUM-OV-065_S1_CD45P_ASCITES | CD45+ | Ascites | SPECTRUM-OV-065 |
SPECTRUM-OV-065 | SPECTRUM-OV-065_S1_CD45P_INFRACOLIC_OMENTUM | CD45+ | Omentum | SPECTRUM-OV-065 |
SPECTRUM-OV-065 | SPECTRUM-OV-065_S1_CD45P_RIGHT_OVARY | CD45+ | Adnexa | SPECTRUM-OV-065 |
Warning
Internally, the pipeline expects the batch column. This column will be used to perform the batch correction approach.
4.2. Minimal command-line#
To execute the pipeline, users should use the command line structure outlined below. Please, note the semantic differences between using one dash (-) for Nextflow commands and two dashes (--) for pipeline commands. Commands with two dashes are reserved for specific pipeline tasks, like adjusting filtering or thresholds on the single-cell analysis.
nextflow run main.nf --project_name <PROJECT> --sample_csv <path/to/sample_table.csv> --meta_data <path/to/meta_data.csv> --cancer_type <CANCER TYPE> -resume -profile <PROFILE>
--project_name
command. This folder contain all the results. The -resume
command leverages Nextflow caching, i.e., resuming executions to avoid excessive computational time.
4.3. Staging images and genome indexes#
The pipeline requires staging (downloading) multiple components to operate. This can pose challenges in HPC environments with strict network policies. As a workaround, consider using the -stub
option on a node with a network connection. The -stub
will stage all the necessary components without actually executing any analysis. Thus, it serves as a bootstrap run for the pipeline. Please note that stub will generate dummy outputs.
nextflow run main.nf --project_name <PROJECT> --sample_csv <path/to/sample_table.csv> --meta_data <path/to/meta_data.csv> --cancer_type <CANCER TYPE> -resume -profile <PROFILE> -stub
4.4. Shorten command-line#
Long command lines can be tricky. Thankfully, with Nextflow's -params-file
, we can make things simpler. This is a JSON file that has all the instructions related to a specific run. If you're trying out different settings, it might be best practice to maintain separate files for each test, e.g., PARAMS_TEST_01.json or PARAMS_TEST_02.json.
{
"project_name": "BTC-CANCER-X",
"sample_csv": "path/to/sample_table.csv",
"meta_data": "path/to/meta_data.csv",
"cancer_type": "CANCER TYPE X"
"thr_mean_reads_per_cells": 10000
}
Note, other paramaters can be added into the -params-file
. For your convenience, please check the command-line documentation.
nextflow run main.nf -params-file <PARAMS.json> -resume -profile <PROFILE>