Advanced configuration#

1. Creating Nextflow profiles#

For users running the pipeline in an HPC environment, it is necessary to set up a profile. Essentially, Nextflow profiles define specific settings related to the job scheduler of the HPC. Various HPCs may employ different engines for job scheduling, such as SLURM, TORQUE, and LSF.

1.1 SLURM profile#

The profiles should be written on the nextflow.config file. For HPCs, it is essential to load the singularity module. This instruction will replace the module load command.

institution {

    module = 'singularity/3.7.0'

    singularity {

        enabled      = true

    }

    process {

        executor     = 'slurm'
        queue        = 'medium'

    }

    params {

        max_memory   = 128.GB
        max_cpus     = 32
        max_time     = 24.h

    }
}

The previous snippet creates a profile for a institution powered by SLURM. It includes several settings, such as loading modules like singularity/3.7.0, and specifying the job scheduling engine with the executor property. Nextflow works with many engines; for more details, refer to the official documentation. Please note that the process scope should match the HPC specs, i.e., queue, memory and cpus.

1.2. Container and Nextflow pipelines#

Why use Singularity/Docker? It's important to note that Nextflow pipelines integrate with containers to ensure reproducibility and portability. This integration guarantees that, during pipeline execution, each module operates within a predefined computational environment, which includes all required software, libraries, and programming languages. Consequently, researchers can trust that their workflows will yield consistent results across various platforms. Additionally, these workflows can be effortlessly shared and run by collaborators in different computing environments.

2. Pipeline entrypoints#

This pipeline was designed with multiple entry points in mind. It is still a work in progress, but the pipeline already offers ways to skip a few modules (processes) and entire routines (sub-workflows). The entry points are named based on major functions: Basic, Stratification, Annotation, nonMalignant, and Malignant. Details are provided below.

3. Adding gene signatures and meta-programs#

The pipeline allows flexibility regarding cell or meta-program markers. Currently, we provide two databases with curated cell markers [1] and meta-programs [2].

3.1. Cell Markers#

To customize the cell markers database, it's essential to adhere to pipeline conventions. Firstly, the table should consist of four columns, as detailed below. Most importantly, the cell type column will determine the annotation level at which the pipeline will operate.

cell_subset	cell_type	annotation	markers
Lineage markers Subsets	Major cells	T-Cells	CD3D
Lineage markers Subsets	Major cells	T-Cells	CD3E
Lineage markers Subsets	Major cells	T-Cells	CD4
Lineage markers Subsets	Major cells	T-Cells	CD8A
Lineage markers Subsets	Major cells	T-Cells	CD8B
Lineage markers Subsets	Major cells	NK Cells	NCAM1
Lineage markers Subsets	Major cells	NK Cells	KLRG1
Lineage markers Subsets	Major cells	NK Cells	FCGR3A
Lineage markers Subsets	Major cells	NK Cells	NKG7

Tip

Note that each marker (gene) is presented in a single row. This long format aligns with best practices for data analysis.

3.2. Meta-programs#

Similar to customizing cell markers, users will need to adhere to pre-established standards. The source column will serve as the anchor that the pipeline uses to subset the meta-programs database.

source	meta_program	gene_marker
Malignant	Astrocytes	SPARCL1
Malignant	Astrocytes	GFAP
Malignant	Astrocytes	CLU
Malignant	Astrocytes	CRYAB
Malignant	Astrocytes	TTYH1
Malignant	Astrocytes	SLC1A3
Malignant	Astrocytes	CST3
Malignant	Astrocytes	ID3
Malignant	Astrocytes	AGT
Malignant	Astrocytes	APOE

4. Adding custom genomes on the pipeline (Cellranger alignment).#

The pipeline employs Cellranger for alignment. We set up the reference genome based on Gencode (v46) and GRCH38. To achieve this, we adhered to the tutorial provided in the 10x Genomics documentation.

Alternatively, users can substitute it with a version of their choice, but certain conventions must be observed. The mkref output should be stored in a folder that adheres to the following structure:

Genomes/Homo_sapiens/ANNOTATION/GENOME_VERSION/

The terms ANNOTATION and GENOME_VERSION should be replaced with the user's preferred choices. Next, the user will need to edit a few lines on conf/igenomes.config within btc-scrna-pipeline folder.

params {
    // illumina iGenomes reference file paths
    genomes {
        'GRCh38' {
            cellranger       = "${params.igenomes_base}/Genomes/Homo_sapiens/Gencode46/GRCh38"
        }
        // Add your custom genome here
        'Custom' {
            cellranger       = "${params.igenomes_base}/Genomes/Homo_sapiens/ANNOTATION/GENOME_VERSION"
        }
    }
}

Finally, the command line should have both genome and igenomes_base parameters.

nextflow run main.nf --project_name Training --sample_csv sample_table.csv --meta_data meta_data.csv --cancer_type Ovarian --genome Custom path/to/Genomes/Homo_sapiens/ANNOTATION/GENOME_VERSION" --igenomes_base -resume -profile seadragon