Quickstart
This section contains step by step instructions to import the template dataset, an example dataset constructed from multiple public data sources that exercises a large surface area of Unify's import functionality. Running these steps will help you make sure that everything is setup correctly on your system and also provide an overall view of the process
Setup
This quickstart assumes you have a local UnifyBio distribution downloaded, such as the one provided by the Rare Cancer Research Foundation (RCRF)'s Pattern Data Commons. At present, UnifyBio is in an early alpha phase and only available in packaged form to RCRF's research partners. To get access to the Pattern distribution, you will need to contact RCRF.
Local System via Docker Compose
To use the provided local system, you need to install Docker and ensure you can stand up a local system with Docker Compose. You will also need a JVM (version 21 or later required). You might already have one available, which you can check by running:
in a terminal. If you do not have a JVM installed, Temurin 21 or 22 is a good default.
Additional configuration
The current Pattern distribution requires hostname remapping at the OS level
to ensure the Unify CLI's peer process can communicate with the other system
components using the same configuration. Edit your /etc/hosts file and
add these two entries:
Starting the local system
Download and unzip your UnifyBio distribution. If you have not been provided one, you can use the Pattern distribution provided by RCRF.
- Navigate to the directory created when you unzip or clone the UnifyBio distribution.
-
Start a local system with:
All subsequent commands assume you are in the UnifyBio distribution directory and have a local system running with Docker compose.
Simple Import Workflow
This quickstart shows an example of working with a minimal import, using the Unify CLI and
the template dataset. The Pattern distribution contains a copy of the template-dataset in the template-dataset/
subdirectory. You can also browse the copy used in Unify's test systems
here.
Prepare
The first thing you will do is transform data from tables into entity map representations in edn form, by using the Unify CLI prepare task.
bin/unify prepare --import-config $PATH-TO-TEMPLATE-DATASET/config.yaml --working-directory PATH-TO-TEMP-DIRECTORY
When prepare has completed successfully, you will be able to transact the data into a database.
Request Database
Before we can transact data, we need a database to transact it into. For this quickstart, we'll request a dev database from the local UnifyBio system:
This will create a new Datomic database in your local system. In most UnifyBio workflows, you will work iteratively with dev databases to create, debug, and update the dataset import.
Note: Local dev databases are suitable for learning Unify, but it is expected that production ready datasets will be published or otherwise blessed for use in a centralized system through an org specified process.
Transact
Now that we have prepared a dataset for import and have a database to import it into, we
are ready to run transact. Note that YOUR-DB-NAME is the same --database you
created in the request step.
Validate
UnifyBio distributions include a validate task, which ensures that data in a database
conforms to org provided specifications. These ensure data are structurally and
semantically sound, as well as referentially consistent (all targets of entity reference
attributes refer to other entities that actually exist).
Validate your import with this command:
Query Your Data
You can now run queries against the local database! If you are using the Pattern Data Commons distribution and schema,
you can install the patternq Python or R libraries
for free form query.
The Pattern distribution contains an example query you can run with Python. This
script accesses data through the Query service included in the local docker compose system.
You will need to have the requests package in your local Python environment to run it.
You can optionally include a second arg to local-test-query.py that contains a file path to
a JSON file containing a query request body.
While covering the possible contents and query language
are outside the scope of
this tutorial, you can inspect the conents of the Python script, consult some of the
examples
in the query service repo, and use the
live schema browser
as guides for constructing more advanced queries.