Your First Import
This tutorial takes you through the process of creating, then importing your own minimal UnifyBio dataset. Before attempting to these steps, make sure you have a local environment capable of completing all the steps in the quickstart.
The import we walk through is the synthetic example-import included in the
Pattern UnifyBio distribution. It should be fairly simple to adapt to other
small datasets; we encourage you to work through the tutorial a second time
with some of your own data, once you've completed it for the example case.
At any point, feel free to refer to the import config docs for a more thorough explanation of any special unify forms or syntax you encounter during the tutorial.
Organizing Files
The following shows an example directory structure for our simple import, which corresponds to conventions used in the Pattern Data Commons system:
example-import
├── config.yaml
├── mappings.yaml
└── data
└── raw
├── clinical.tsv
├── measurements.tsv
├── patients.tsv
└── samples.tsv
For smaller and medium sized imports, Pattern uses ordinary git version control (sometimes with LFS enabled) to manage the state of the import that is the basis for the UnifyBio dataset. This ensures that database states can be linked back to git repos and commit hashes, guaranteeing a traceable history which can be used to source causes of error, or forward propagate changes if issues are found with the underlying dataset.
You do not necessarily need to use this exact scheme in your own work, but some similar discipline about storing and tracking versions of data as it is imported into UnifyBio is recommended.
Importantly, this captures the typical components of any import:
- the
config.yamlandmappings.yamlfiles which contain the mappings and import directives which specify to the Unify CLI how the data should be imported. - raw data (stored in
data/raw)
You will also typically see:
scripts/for preprocessing the raw data into the form imported into the UnifyBio system.- the data as transformed by the scripts, in
data/processed/
Using TSVs
The present version of the Unify CLI can only process TSV files for input. Unify uses tab-separated files to avoid any of the ambiguities and problems that occur with arbitrarily delimited files, given that compact and obscure data encodings and fulltext natural language sentences and similar are often imported and harmonized into clinical and molecular dataset. The TSV files do not need to correspond to any particular data model, but there should be a column to column mappings to attributes in the schema.
Creating an Import Config
We will create our own import config file for the purposes of this tutorial. We will store it
as a sibling to the current reference config file provided in the distribution and name it
tutorial-config.yaml.
To get started, provide the top level structure in the YAML file.
Setting Up Your Editor
The Unify CLI can generate a JSON Schema you can apply to your YAML editing process in
order to provide static checks on the structure and content of the import, as well as
to provide autocomplete, attribute documentation, and more. If you are using the Pattern UnifyBio distribution,
this has been pregenerated for you and can be found under the distribution directry
as the file candel-import-config-schema.json. If you need to generate it, you can do so
with this CLI command:
Below are links for how to configure common editors to use JSON Schema when editing YAML files:
- VSCode, via the Red Hat YAML extension steps are documented here.
- JetBrains products such as IntelliJ, Dataspell, and PyCharm, using the processes documented here.
These are summarized in the Unify import config docs as well.
Providing Entries for each File
We have four files to provide mappings for:
In the following subsections you will write the import config for each of them.
Patient / Subject data
The patients.tsv file contains data on our patients. Patients are currently modeled
under the subject entity in the Pattern schema. A list of available attributes can
be found here.
If you have configured your editor, you should get autocomplete suggestions as you type
which contain the same documented information.
In the file, we have the following columns and table shape:
| id | age | sex | disease |
|---|---|---|---|
| subject-group-A-860 | 104 | M | lung adenocarcinoma |
| subject-group-B-478 | 35 | F | lung adenocarcinoma |
| subject-group-A-437 | 49 | male | lung adenocarcinoma |
Looking at the patient data in the schema, we can find the appropriate attributes to map to:
(Well, the last one may not be obvious, but for the purpose of the tutorial, we'll have you use freetext instead of mapping the terms into an ontology.)
In its simplest form, the yaml config file is just a literal translation of the mapping from columns to attributes that we just identified. The required update has been higlighted for you in the snippet below:
:unify/import:
user: "some.user@gmail.com"
dataset:
name: "my-first-import"
subjects:
- unify/input-tsv-file: "data/raw/patients.tsv"
id: "id"
sex: "sex"
age: "age"
There's actually one more thing we have to tend to here, though. :subject/sex
values should be mapped to an enum, not simply read into the schema. You can
find the enums in the schema visualizer.
Mapping enums requires an additional mappings.yaml file. For now, we won't worry
about writing it, but link to the existing mappings.yaml in the example import,
then inspect the relevant contents.
:unify/import:
user: "some.user@gmail.com"
mappings: mappings.yaml
dataset:
name: "my-first-import"
subjects:
- unify/input-tsv-file: "data/raw/patients.tsv"
id: "id"
sex: "sex"
age: "age"
Here are the contents of mappings.yaml relevant to the :subject/sex attribute:
unify/mappings:
enum/subject.sex:
subject.sex/male:
- "Male"
- "M"
- "m"
- "male"
subject.sex/female:
- "Female"
- "female"
- "F"
- "f"
unify/variables:
subject/sex: enum/subject.sex
:subject.sex/male, and likewise for "F" or "female" to :subject.sex/female.
We are now ready to add more data! Note, that if you want to, you can incrementally run prepare or even transact as you go (assuming you're using local dev databases), just to check for errors or omitted mappings. For the purpose of the tutorial, we'll continue, but at any point you can refer back to the commonds from the quickstart and run the commands there on your import in progress.
Samples data
Next we will map in samples.tsv. The columns and data layout are as below:
| anatomic-location | id | patient-id | timepoint |
|-------------------|------------------|---------------------|-----------|
| lung | sample-HIKIKDGG2 | subject-group-A-860 | baseline |
| lung | sample-JJGDJCEB1 | subject-group-B-478 | baseline |
| pancreas | sample-BEFKJJCB7 | subject-group-A-437 | baseline |
| pancreas | sample-IBHIBFDG3 | subject-group-B-62 | baseline |
Again, we will consult the schema dashboard for samples to find the right attributes. The new mappings are provided below:
dataset:
name: "my-first-import"
subjects:
- unify/input-tsv-file: "data/raw/patients.tsv"
id: "id"
sex: "sex"
age: "age"
samples:
- unify/input-tsv-file: "data/raw/samples.tsv"
id: "id"
subject: "patient-id"
freetext-anatomic-site: "anatomic-location"
timepoint: "timepoint"
There are two attributes used of a new kind here — timepoint and subject are
both ref attributes. That means they refer to other entities, in this case,
the subject and timepoint entities. You will notice, if you look through
the tables, that we do not have a timepoints.tsv file. However, as you can
quick verify with grep, R, pandas, or spreadsheet software, we only have
two timepoints, the baseline timepoint and the eos timepoint.
It is not necessary to add every data point in the dataset as a table, we can actually just use data literals in the import.
Adding timepoints
Data literal entities lack a unify/input-tsv-file directive. We'll add one for timepoints
to our import config. First we'll consult the
schema docs
and see which attributes are available that we have the information to provide. It looks
like we need to provide a timepoint/id and timepoint/relative-order. Since we only
have two timepoints, we'll use 1 and 2 for the relative ordering. We can also click
on the enum information for
timepoint/type
and we see our timepoint names baseline and eos map neatly on to two types, so we'll provide
those values as well.
dataset:
name: "my-first-import"
subjects:
- unify/input-tsv-file: "data/raw/patients.tsv"
id: "id"
sex: "sex"
age: "age"
samples:
- unify/input-tsv-file: "data/raw/samples.tsv"
id: "id"
subject: "patient-id"
freetext-anatomic-site: "anatomic-location"
timepoint: "timepoint"
timepoints:
- id: "baseline"
relative-order: 1
type: ":timepoint.type/baseline"
- id: "eos"
relative-order: 2
type: ":timepoint.type/eos"
Note: we supply the full enum attribute name as a string, e.g. ":timepoint.type/baseline". Unlike
with attributes that are property names (i.e. map keys), Unify does not infer namespace information or keyword vs string type for literals
that are property values. But it will attempt to parse any string in value position starting with ":" as a keyword.
Mapping in measurements.tsv
Measurements, as captured by the measurement entity/namespace, is a very broad
entity type in the Pattern and CANDEL schemas. In general, any molecular assay
or measurement taken in a lab is a measurement. When we refer to the
measurement schema we can
see a host of different measurement types available.
You should think of measurements as duck typed, in that the kind of measurement
represented is inferred from the attributes. Certain kinds of measurements have
certain targets: proteomic assays target proteins, RNA-seq targets gene expression
by way of transcript fragments, and so on. Most measurements are the result of
an assay performed on a sample. UnifyBio's validator will ensure in the
CLI bin/validate step that the combination of attributes on any measurement
makes sense and corresponds to specs and expectations.
In our case, we have a fairly straight forward RNA-seq assay to map in:
| fpkm | hugo | sample |
|---|---|---|
| 393.43107371469574 | SNORA24B | sample-HIKIKDGG2 |
| 203.42482862644985 | TCAF1P1 | sample-HIKIKDGG2 |
| 965.3775593576936 | ARHGEF35-AS1 | sample-HIKIKDGG2 |
| 320.0502226355575 | PCDH19 | sample-HIKIKDGG2 |
| 486.5600564050051 | ANKRD49P1 | sample-HIKIKDGG2 |
It might not be obvious to non-biologists, but the hugo columns refers to genes
(through the indirection of gene-product entities in the Pattern schema), and
we can see :measurement/fpkm in the measurement entity in the schema, which
stands for fragments per kilobase million, i.e. a normalized metric for
RNA transcript counts.
dataset:
name: "my-first-import"
subjects:
- unify/input-tsv-file: "data/raw/patients.tsv"
id: "id"
sex: "sex"
age: "age"
samples:
- unify/input-tsv-file: "data/raw/samples.tsv"
id: "id"
subject: "patient-id"
freetext-anatomic-site: "anatomic-location"
timepoint: "timepoint"
timepoints:
- id: "baseline"
relative-order: 1
type: ":timepoint.type/baseline"
- id: "eos"
relative-order: 2
type: ":timepoint.type/eos"
assays:
- name: "RNAseq"
technology: ":assay.technology/RNA-seq"
measurement-sets:
- name: "standard fpkm"
measurements:
- unify/input-tsv-file: "data/raw/measurements.tsv"
sample: "sample"
fpkm: "fpkm"
gene-product: "hugo"
A note on the tree structure: :dataset/assays contains all assays in a dataset, and
for some datasets, like the one used in the Prince Study
these can get quite large and detailed. Assays contain one or more measurement sets, which contain one
or more measurement inputs, to provide groupings that might represent different experimental conditions,
or different ways of processing data compuationally, etc. Sometimes these introduce derived entities,
like cell populations
grouped through flow cytometry, or
single cells
detected and separated through chemical or fluidic aspects of assays, or defined computationally through segmentation.
Our case is pretty straightforward: we're referring to samples from elsewhere in our config, genes by names provided by the HGNC standard and already present as reference data.
You'll note that as we've gone, also, we've built up the relational complexity of the dataset without doing a lot of work. For instance, each measurement is:
- a value or values representing an instrument reading in the real world, to which some kind of data processing was applied.
- of a particular target, a known biological entity, which is a transcript (gene product) derived from a gene
- taken by a particular assay of a particular known technology type
- with certain processing and/or experimental conditions shared by other measurements grouped into the same measurement set.
- of a sample that was derived
- from a patient (subject)
- at a particular timepoint
Not every data model that uses UnifyBio's tool ecosystem or integrations, or every use of the Unify CLI for that matter, has to apply this sort of relational complexity. But this kind of relational complexity exists in many domains, and with minimal declarative effort, we are able to infer it from a combination of the structure of our data, and the details of our Unify schema and metamodel.
clinical.tsv
The last data we have to import is our clinical data. A view of it is below:
| os | patient | timepoint |
|---|---|---|
| 98 | subject-group-A-860 | eos |
| 39 | subject-group-B-478 | eos |
| 153 | subject-group-A-437 | eos |
We have the overall survival data (as os), an ID for our patients, and a timepoint. In this case, all the ref targets (timepoints and patients) have been supplied, so the mapping is straight forward. Here it is in the context of the entire (and now complete) import:
:unify/import:
user: "some.user@gmail.com"
mappings: "mappings.yaml"
dataset:
name: "my-first-import"
subjects:
- unify/input-tsv-file: "data/raw/patients.tsv"
id: "id"
sex: "sex"
age: "age"
samples:
- unify/input-tsv-file: "data/raw/samples.tsv"
id: "id"
subject: "patient-id"
freetext-anatomic-site: "anatomic-location"
timepoint: "timepoint"
timepoints:
- id: "baseline"
relative-order: 1
type: ":timepoint.type/baseline"
- id: "eos"
relative-order: 2
type: ":timepoint.type/eos"
assays:
- name: "RNAseq"
technology: ":assay.technology/RNA-seq"
measurement-sets:
- name: "standard fpkm"
measurements:
- unify/input-tsv-file: "data/raw/measurements.tsv"
sample: "sample"
fpkm: "fpkm"
gene-product: "hugo"
clinical-observation-sets:
- name: "clinical observations"
clinical-observations:
- unify/input-tsv-file: "data/raw/clinical.tsv"
subject: "patient"
os: "os"
timepoint: "timepoint"
You can refer to the clinical observations schema to see what other information can be encoded in clinical observations. As present, they play a similar role to measurements, being typed implicitly through the presence or lack of certain attributes. Entities can be grouped arbitrarily within clinical observation sets, representing different sites, contexts within a study or other medical interactions, labs, and so on.
Running Unify CLI import steps with your dataset
You can now run all the same steps you ran in the quickstart to transact your data into a database. Because this is a new, unique dataset, you can transact it either into an existing dev database you already created, or a new one. You should probably make sure it passes validations first, though, before you consider making it available to others!
The steps are:
bin/unify prepare --import-config PATH/TO/example-import/tutorial-config.yaml --working-directory TEMP-DIR/tutorial-attempt-1
bin/unify request-db --database my-first-db
bin/unify transact --working-directory TEMP-DIR/tutorial-attempt-1 --database my-first-db
bin/unfy validate --database my-first-db
--dataset arg, it defaults to the most recent dataset
in the database.
If your file contains typos or misaligned columns and attributes, you might encounter errors in a number of places:
- the config file may fail to parse with an error
- while data is being prepared, it might encounter an invalid data value or file vs. mapping situation, e.g. a column specified in the config might not exist in the file.
- while transacting data, you might find that some values that are supposed to be unique conflict with each other, or some reference data specified doesn't actually exist. E.g. you might have a gene name that comes fom a different standard, or a different version of the HGNC standard even.
- while validating data, you might find that you created measurements with a nonsensical combination of
attributes, like a
:measurement/fpkmattribute with a:measurement/epitopeprotein target. Or you might find that you had a measurement that referred to sample a sample ID that didn't actually exist in your sample table.
Given these situations, it's completely normal and expected that you will hit hiccups and need to debug them. The Unify CLI tries to provide error messages that are as specific and direct as possible to make this as painless as possible. It can't magically provide the missing information from the bogus xslx file the vendor put in your Box folder, but it will do its best to identify any issues those sorts of things introduce.
Further Steps
To see most of what's possible in a Unify import, refer to the template dataset and the import config docs. In addition, RCRF intends to make several curated imports and datasets public. To learn more, get in touch!.