BTC Data Reference
The aim of this document is to serve as a central point of reference for data handling in Break Through Cancer (BTC), capturing the consensus rubric under which data are generated, annotated, aggregated, governed, and accessed. This includes metadata capture during patient enrollment and sample acquisition, standards and processes for molecular assay data generation and pipelines, data flow diagrams providing simplified views, as well as a FAQ for common questions. In concert with BTC Disease TeamLabs, these norms are being codified by the Data Science TeamLab (DST) as a key element of the data science proposal
The system and infrastructure which serves as the convergence point for data science activity within BTC is code-named DASH, short for DAta Science Hub; and is informed by numerous standards, including F.A.I.R. data practice and NIH guidelines , and draws heavily from lessons learned and software developed in earlier and sister projects, including HTAN, GDC, and TCGA
Data Life Cycle
Within BTC the flow of data naturally subdivides along the boundaries of institutional and BTC-wide ecosystems:
- Institutional: data, research, and clinical trial activity typically within the walls of a single instution, performed upon and annotated by institutional systems per their internal practice. Some examples of this could be pre-clinical or basic research results from a single PI lab, or the exported and de-indentified data from a clinical trial data system.
- DASH: data shared by TeamLabs across institutional boundaries, by way of DASH
How DASH maintains identifiers for and provenance of these data is described in more detail below.
Planning for Data Sharing
Please contact the BTC Programs Team when planning to share data used by or generated in your TeamLab, whether pre-clinical, clinical, or publication data. The PM team will help assess whether there are contractual, institutional, or other encumbrances upon the data that must be cleared first before they may be shared. While planning it's also important to note that:
- Sharing non-public data via DASH does not confer access to other TeamLabs: such data are initially embargoed, with access limited to members of the submitting TeamLab until the embargo period ends
- Patient IDs, sample IDs, and related metadata are defined within clinical lab manuals and SOPs: this will help BTC maintain the provenance of material and data as it travels from one institute to another
- SOPs are in place to remove PHI prior to sharing
Sharing/Submitting Data
When data are ready to be shared/analyzed via DASH, please contact the data coordinator(s) at your institution:
- MD Anderson: Kenna Shaw (PI: Caroline Chung),
Tracee Burnsteel
- MIT: Charles Demurjian (PI: Stuart Levine)
- Johns Hopkins: Meredith Wetzel (PI: Elana Fertig)
- Memorial Sloan Kettering: Eli Havasov (PI: Sohrab Shah)
- Dana Farber: (interim) Jenn Gantchev, Michael Regan; Siri Palreddy (mid-Sept 2024, email TBD)
The role of the BTC data coordinator is described here. Your data coordinator and the DASH team will be happy to guide data providers through the process and help decide which staging method to use (SharePoint or cloud buckets). The staging area is a data lake-style abstraction: a semi-organized collection of storage bins, stratified by TeamLab, whose contents are provided "as is," with little (or no) formal validation or annotation. TeamLab members may elect to use such staged data immediately, prior to further curation, as needs dictate.
Data Curation
After staging, many (but not necessarily all) BTC datasets will also be curated, which includes steps such as: validating that shared files do not include PHI, verifying that trial-specific IDs (when applicable) have been associated so that the data may be tracked, and ensuring sufficient metadata have been submitted so that others may productively use the data. As described below, BTC-specific IDs will also be assigned within the DASH database to maintain provenance, enable pan-cancer analysis, and promote data governance (such as sandboxed use during embargo periods). These BTC IDs supplement, rather than replace, any IDs attached at the point of data origination (e.g. in a clinical trial or PI lab) and the mapping between the two will be retained internally by DASH as metadata.
Accessing Data
Both staged and curated data are summarized in our high-level DASHBoard and may be downloaded interactively from the data browser (temporarily disabled during AWS migration) or programmatically via aws cli for many, large files or pipelined analysis. Additional information is available in the technical instructions for data sharing. Finally, recall that staged data are offered "as is," while curated data will soon support more advanced use cases such as querying with fine-grained parameters (e.g. gene name) or visual exploration in familiar tools like cBioPortal, cellxgene, or Minerva as appropriate.
Analysis and Pipelines
The DST has established joint Protocol and Analysis working groups with disease teams, to: identify analysis needs across TeamLabs, align common SOPs for data generation and QC, coordinate changes as new technologies and needs arise, tie each central analytic pipeline to consumers of its data in each TeamLab, and perform regular check-ins to ensure analytic approaches continue to suit disease TeamLab objectives.
For each data type generated across two or more TeamLabs, the DST will provide best-practices pipelines and/or tools for each data level: from Levels 1 and 2 primary data generation (L1, L2) through subsequent L3 and L4 analyses. This work is very active, and a number of analysis pipelines are already available from DFCI, JHU, MDAnderson, NextFlow nf-core and the MIT BioMicroCenter.
As these and other pipelines harden towards robust, self-service tools they are also being added to a common, increasingly multi-modal analysis workflow in the BTC Cirro instance. This BTC Cirro instance should be immediately accessible to anyone in BTC via single-sign-on with the same credentials used to access other BTC resources (e.g. SharePoint or Teams). The Cirro platform runs NextFlow and WDL pipelines, and provides powerful tools for computationalists in a modern, easy-to-use interface that is also approachable for non-computationalists. There are over 170 pipeline configurations already available, and we are making project spaces available for each TeamLab that are pre-loaded with that team's data, include all available analysis pipelines, and are by default sandboxed for access only by members of the team. Please contact the DASH team if you would like to contribute pipelines, perform analyses, or would simply like more information.
BTC Identifier Scheme
As data are added to DASH they are tagged with subject and sample (biospecimen) identifiers as follows: At minimum such IDs will be attached to BTC clinical trial data; and ideally to BTC pre-clinical, basic research, and external data as appropriate (all by way of the scheme described here). Note that here “biospecimen” is preferred over “sample” for generality and to capture that samples can be subdivided into multiple portions. The association between data file and biospecimen is maintained as metadata within the DASH database, not within the file identifier itself. To see how this ID scheme might play out, here's what a hypothetical data tree for the first subject (patient) of the first BTC glioblastoma multiforme trial might look like Here 6 needle biopsy cores (samples) were extracted; the first of which has been characterized in multiple assays, yielding 8 distinctly molecular data output files (i.e. one per data type). Each interventional timepoint in a longitudinal trial would yield a new sample (or samples) for that subject. Finally, note that BTC may devise study names which encompass the entire set of data generated by a TeamLab, (e.g. a study name of "GBM" that might include ALL data from trial 1, trial 2, and so forth).
Scope
It will help to amplify a point made above in the context of data curation, namely that BTC- identifiers apply only to data AFTER sharing into DASH. These IDs do not replace identifiers attached to data at their point of origin (e.g. in a clinical trial data system, or PI lab research project, or external publication), and in fact those original IDs will be carried along--to the greatest extent possible--as metadata when data are shared into DASH. In other words, DASH does not seek to legislate how data generators identify data within the context of its original use case(s), only how data are identified once they are shared within DASH.
Data Entities and Levels
As data progress through the BTC life cycle they are processed and transformed through a series of “data levels.” For each data context (e.g. clinical, imaging, sequencing, spatial) the constituent files of each data level may differ, but in all cases Level 1 represents raw or uncurated data (e.g. directly from an instrument) and each successive level represents a maturation of that data towards analysis and, eventually, publication and wider utilization.
Unless explicitly stated otherwise, we propose BTC data infrastructure, processing and analysis adopt existing GDC + HTAN standards and nomenclature, including for clinical, biospecimen, sequencing and imaging data. For convenience, some of those standards will be excerpted below but we refer the reader back to the original material at the given links for an exhaustive treatment. Each disease study in BTC contributes data from 1 or more enrolled Subjects, who have donated Biospecimens. Data files are generated when biospecimens are assayed by an instrument/protocol or processed in downstream SOPs or analyses. Captured metadata enables tracing of any data file back to its source biospecimen. Level 1 raw data files are derived directly from the corresponding biospecimen, whereas processed level 2-4 data files are linked to lower level parent data files.
Clinical Data Levels (Tiers)
The BTC clinical data model consists of three tiers. Tier 1 aligns with Genomic Data Commons (GDC) guidelines for clinical data, while Tiers 2 and 3 are informed by the HTAN extensions to the GDC model. Clinical data in BTC are still evolving as we develop trial forms and SOPs across institutions. Only Tier1 is encompassed thus far, but no data will be ingested into DASH if it is not (a) fully de-identified and (b) accompanied by minimally viable metadata (biospecimen and/or clinical).
Omic Data Levels
In alignment with TCGA and the NCI Genomic Data Commons, BTC will categorize multi-omic data into four levels:
Level | Definition | Examples |
---|---|---|
1 | Raw data | FASTQs, unaligned BAMs |
2 | Aligned primary data | Aligned BAMs |
3 | Derived biomolecular data | Gene expression matrix files, VCFs, etc |
4 | Sample level summary data | t-SNE plot coordinates, etc |
These will apply to the multiple assay and sequencing modalities (omic data types) envisioned for BTC, including single-cell and single-nucleus RNA Seq (sc/snRNASeq), single-cell ATAC Seq, bulk RNAseq and Bulk DNAseq.
We propose that BTC follow the latest GENCODE version for gene annotations, GENCODE Version 43. GENCODE is used for gene definitions by many consortia, including ENCODE, NCI Genomic Data Commons, Human Cell Atlas, and PCAWG (Pan-Cancer Analysis of Whole Genomes). Ensembl gene content is essentially identical to that of GENCODE (FAQ) and interconversion is possible.
We further propose BTC adopt GENCODE 43 Gene Transfer Format (GTF) basic gene annotation file (GENCODE 43 GTF) and filtered files (GENCODE 43 GTF with genes only; GENCODE 43 GTF with genes only and retaining only chromosome X copy of pseudoautosomal region) for gene annotation. BTC may also include external data generated with other gene models, as the process of implementing the standard is ongoing. Within BTC metadata files, we propose the reference genome in use be specified in columns/attributes named “Genomic Reference” and “Genomic Reference URL”.
External Data
Data generated through efforts funded by other organizations is referred to as “external data”. BTC investigators are free (and encouraged) to utilize external data in BTC work, which will typically play out in one of two ways:
-
Ad-hoc: in which the investigator or their staff downloads external data to local institutional resources (e.g. on-prem compute or cloud) and references in their local BTC analyses; here the investigator and/or staff initiates & performs the collecting, aggregating and storing of external data on their institutional systems
-
DASH: investigator requests that BTC make external easily accessible to other BTC collaborators via DASH; either in raw form directly from the staging area or more formally with BTC identifiers attached so that the identity and provenance are clear
In the latter case of external data being assigned BTC IDs, a unique study name will be created to indicate that the dataset is from an external source, following the schema EXT_
- Enabling it to be seamlessly mingled with internal BTC data
- Then processed and analyzed at scale in downstream pipelines or analysis tools
- While making the external identity and provenance of the data clear
- And simplifying later bookkeeping and database tracing when assembling data for publication
but should not be interpreted as a claim that BTC now “owns” or is attempting to “re-brand” those external data.