Data Processing Overview¶

This section covers how reSCRP processes single-cell RNA sequencing data from raw files to scrp-module format.

Processing Pipeline¶

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Raw Data      │    │  Preprocessing  │    │   Database      │
│                 │    │                 │    │                 │
│ • Seurat RDS    │───▶│ • R Scripts     │───▶│ • MariaDB       │
│ • Metadata      │    │ • Filtering     │    │                 │
│ • Annotations   │    │ • Normalization │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                  │                       │
                                  │                       │
                                  ▼                       ▼
                         ┌─────────────────┐    ┌─────────────────┐
                         │   TSV Export    │    │  Web Interface  │
                         │                 │    │                 │
                         │ • Expression    │    │ • Interactive   │
                         │ • Metadata      │    │                 │
                         │ • Gene Lists    │    │                 │
                         └─────────────────┘    └─────────────────┘

Data Sources¶

Input Data Types¶

1. Seurat Objects (RDS files)¶

# Example clustered Seurat object structure
seurat_obj
├── @assays
│   └── RNA
│       ├── counts      # Raw count matrix
│       ├── data        # Normalized expression
│       └── scale.data  # Scaled expression
├── @meta.data
│   ├── seurat_clusters # Cluster assignments
│   ├── Patient         # Clinical metadata
│   ├── TissueType      # Sample information
│   └── CancerType      # Disease classification
└── @reductions
    └── umap            # Dimensionality reduction
        └── embeddings  # 2D coordinates

2. Differential Expression Results¶

# DEG analysis output format
deg_results
├── p_val          # Statistical significance
├── avg_logFC      # Log fold change
├── pct.1          # Expression percentage in cluster
├── pct.2          # Expression percentage in others
├── p_val_adj      # Multiple testing correction
├── cluster        # Cluster identifier
└── gene           # Gene symbol

3. Clinical Metadata¶

# Required metadata columns
metadata
├── Patient        # Patient identifier
├── Sample         # Sample identifier
├── TissueType     # Tumor/Normal/etc
├── CancerType     # Disease type
├── OrganSite      # Anatomical location
└── Disease        # Specific diagnosis

Preprocessing Workflow¶

1. Data Loading and Validation¶

The preprocessing script performs several steps:

*** Work in progress ***

Database Schema¶

Expression Tables¶

CREATE TABLE {CellType}_EXP (
    Barcode TEXT NOT NULL,     -- Cell identifier
    Marker TEXT NOT NULL,      -- Gene symbol
    EXP DOUBLE NOT NULL,       -- Normalized expression
    INDEX idx_marker_barcode (Marker, Barcode),
    INDEX idx_barcode_marker (Barcode, Marker)
);

Metadata Tables¶

-- Full metadata schema
CREATE TABLE {CellType}_meta (
    Barcode TEXT PRIMARY KEY,
    TissueType TEXT,
    CancerType TEXT,
    Patient TEXT,
    Sample TEXT,
    CellClusterType TEXT,
    UMAP1 DOUBLE,
    UMAP2 DOUBLE
);

-- Simplified schema (for subsets)
CREATE TABLE {Subset}_meta (
    Barcode TEXT PRIMARY KEY,
    UMAP1 DOUBLE,
    UMAP2 DOUBLE,
    CellClusterType TEXT
);

DEG Tables¶

CREATE TABLE {CellType}_DEG (
    p_val DOUBLE,
    avg_logFC DOUBLE,
    pct1 DOUBLE,
    pct2 DOUBLE,
    p_val_adj DOUBLE,
    cluster TEXT,
    gene TEXT
);

Output Formats¶

TSV Files¶

Expression: {CellType}_EXP_{Date}.tsv
Metadata: {CellType}_meta_{Date}.tsv