Conceptual Description of the Level 1 C2M2¶

Maintainers: Rick Wagner

Version: 0.1

License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Objectives¶

This is a conceptual and narrative description of the Level 1 Crosscut Metadata Model (C2M2). It covers the things (proper nouns) in the Level 1 C2M2 and their relationships, and describes the tables used to represent them. The last section covers the internal controlled vocabularies used for a few attributes. These notes do not go heavily into things like the format (syntax) of the columns or the specific primary key and foreign key relationships.

Things (Proper Nouns) Described¶

The Level 1 C2M2 includes tables to describe the following things (entities), and the relationships among them.

Namespaces
Project
File
Subject
Biosample
Collection

This section has descriptions of each thing and a list of its attributes (fields).

Namespaces¶

A namespace is a logical groupings of things, used to avoid collisions among the names used by different Data Coordination Centers (DCCs). We assume that each DCC assigns a unique local name to each thing that it manages. (If this assumption is violated--if, for example, biosamples and files may be assigned the same local name--then additional local structure may be needed.) Then, anything from any DCC can be given a unique global name by concatenating the namespace id for the DCC from which the thing originates with the local name assigned to the thing by that DCC. Thus, for example, two things originating from DCC1 and DCC2, and each assigned a local name Sample1, will have distinct C2M2 names: DCC1:Sample1 and DCC2:Sample1.

Attributes¶

namespace A globally unique ID representing this namespace
abbreviation A short display label for this namespace
name A short, human-readable, machine-read-friendly label for this namespace
description A human-readable description of this namespace

Project¶

There can be a single project for each DCC, or things like studies can be represented as subprojects. The field persistent id could be a website for project, or a DOI for a paper. When we get to collections to describe datasets or cohorts, we'll show what project they were part of.

abbreviation A short display label for this project
name A short, human-readable, machine-read-friendly label for this project
description A human-readable description of this project
persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this project

File¶

file id The unique name for this file, compromised of:
- namespace Namespace for the DCC or file creator
- id An ID representing this file, unique within this namespace
project Which project or subproject created this file
persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC (using, e.g., our minid server) and attached to this file
creation_time An ISO 8601 -- RFC 3339 (subset)-compliant timestamp documenting this file's creation time

Ex. YYYY-MM-DDTHH:MM:SS±NN:NN
size_in_bytes The size of a file in bytes
sha256 The output of the SHA-256 cryptographic hash function after being run on this file: one or both of sha256 and md5 is required; sha256 is preferred
md5 The output of the MD5 message-digest algorithm after being run as a cryptographic hash function on this file: one or both of
filename A filename with no prepended PATH information
file_format An EDAM CV term ID identifying the digital format of this file

Ex. TSV or FASTQ
data_type An EDAM CV term ID identifying the type of information stored in this file

Ex. RNA sequence reads

Subject¶

subject id The unique name for this subject, compromised of:
- namespace Namespace for the DCC or subject provider
- id An ID representing this subject, unique within this namespace
project Which project or subproject created this file
persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this subject
creation_time An ISO 8601 -- RFC 3339 (subset)-compliant timestamp documenting this subject record's creation time

Ex. YYYY-MM-DDTHH:MM:SS±NN:NN
granularity A CFDE CV term categorizing this subject by multiplicity (see Subject Granularity under Controlled Vocabularies). One of:
- single organism
- symbiont system
- host-pathogen system
- microbiome
- cell line
- synthetic

Biosample¶

biosample id The unique name for this biosample, compromised of:
- namespace Namespace for the DCC or biosample owner
- id An ID representing this biosample, unique within this namespace
project Which project or subproject created this biosample
persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this biosample
creation_time An ISO 8601 -- RFC 3339 (subset)-compliant timestamp documenting this biosample's creation time

Ex. YYYY-MM-DDTHH:MM:SS±NN:NN
assay_type An OBI CV term ID describing the type of material represented by this biosample
anatomy An UBERON CV term ID used to locate the origin of this biosample within the physiology of its source or host organism

Collection¶

Like projects, collections can have subcollections. Collections can hold files, biosamples, or subjects, which is done using a relationship.

collection id The unique name for this collection compromised of:
- namespace Namespace for the DCC or collection creator
- id An ID representing this collection, unique within this namespace
persistent id A persistent, resolvable (not nec. retrievable) URI generated by a DCC and attached to this collection
abbreviation A very short display label for this collection
name A short, human-readable, machine-read-friendly label for this collection
description A human-readable description of this collection

Relationships¶

There are several relationships between things that can be described, like which subject a biosample comes from. These a often mapping tables between the unique names (namespace, id) of different things.

Things in Collections¶

Collections can contain one or more files, biosamples, or subjects. A collection may contain a combination of different types. There are tables for each type that map the items into their collections. The item is identified by its namespace and id, so is the collection. Effectively, the tables look like the following:

Attributes¶

Files in collection * subject id The unique name (namespace, id) of the subject * collection id The unique name (namespace, id) of the collection

Biosamples in collection * biosample id The unique name (namespace, id) of the biosample * collection id The unique name (namespace, id) of the collection

Subjects in collection * subject id The unique name (namespace, id) of the subject * collection id The unique name (namespace, id) of the collection

Biosamples and Subjects¶

To allow for multiple subjects to be represented in a single biosample and vice versa, there is a mapping table between biosamples and subjects.

Attributes¶

biosample id The unique name (namespace, id) of the biosample
subject id The unique name (namespace, id) of the subject

Files Describing Subjects and Biosamples¶

To show a relationship between a file and a subject a or biosample, like a sequence file generated from a biosample, there are two more mapping tables.

Attributes¶

Files describing biosamples * file id The unique name (namespace, id) of the file * biosample id The unique name (namespace, id) of the biosample

Files describing subjects * file id The unique name (namespace, id) of the file * subject id The unique name (namespace, id) of the subject

Subject Role and Taxonomy¶

A table linking a subject, a subject_role (a named organism-level constituent component of a subject, like 'host', 'pathogen', 'endosymbiont', 'taxon detected inside a microbiome subject', etc.) and a taxonomic label (which is hereby assigned to this particular subject_role within this particular subject)".

Attributes¶

subject
namespace The namespaec of the subject
id The ID of this subject
role The role assigned to this organism-level constituent component of this subject (see Subject Role under Controlled Vocabularies). One of:
single organism
host
symbiont
pathogen
microbiome taxon
cell line ancestor
synthetic
taxonomy_id An NCBI Taxonomy Database ID identifying this taxon

CFDE Controlled Vocabularies¶

Subject Granularity¶

Term	Description
single organism	One organism
symbiont system	A mixed system of consisting of two or more organisms (symbionts) in symbiosis (living colocated in time and space): one such symbiont may optionally be identified as a host
host-pathogen system	A special case of a symbiont system consisting of one symbiont, designated as a host, plus one or more other symbionts acting to create or sustain disease within the host organism
microbiome	A symbiont system consisting of a collection of (potentially unknown or partially characterized) taxa, where the environment in which the system resides is well-characterized, but the taxonomic composition of the system may be unknown; optionally contains one symbiont specially identified as a host
cell line	A cell line derived from one or more species or strains
synthetic	A synthetic biological entity

Subject Role¶

Term	Description
single organism	The organism represented by a subject in the 'single organism' granularity category
host	Any organism identified as a host for a subject assigned to the 'symbiont system', 'host-pathogen system', or 'microbiome' granularity categories
microbiome taxon	A constituent taxon of either (a) a subject assigned to the 'environmental microbiome' granularity category or (b) the microbiome (non-host) portion of a subject assigned to the 'host-associated microbiome' granularity category [NB: This role is probably not appropriate for Level 1, because it necessitates the post-facto attachment of downstream analysis procedures (subject -> sample -> library prep -> sequencing -> bioinformatics -> taxonomic classification results) to a subject which was originally uncharacterized at this level]
symbiont	An organism identified as a symbiont within a subject assigned to the 'symbiont system' granularity category
pathogen	An organism identified as a pathogen symbiont in a subject assigned to the 'host-pathogen system' granularity category
cell line ancestor	A taxon identified as a source organism for a subject assigned to the 'cell line' granularity category
synthetic	A synthetic biological entity

Conclusions¶

This section provides a concise overview of the key objects and concepts covered by the C2M2 model and should be viewed as an initial contact point for anyone interested in mapping data into the C2M2 model, thereby getting ready for a full ETL process.

What to read next?¶

CFDE namespaces

CFDE selected terminologies?