BioThings Studio tutorial

This tutorial will guide you through BioThings Studio, a pre-configured environment used to build and administer BioThings API. This guide will show how to convert a simple flat file to a fully operational BioThings API, with as minimal effort as possible.

Note

You may also want to read the developer’s guide for more detailed informations.

Note

The following tutorial is only valid for BioThings Studio release 0.1.e. Check all available releases for more.

What you’ll learn

Through this guide, you’ll learn:

  • how to obtain a Docker image to run your favorite API
  • how to run that image inside a Docker container and how to access the BioThings Studio application
  • how to integrate a new data source by defining a data plugin
  • how to define a build configuration and create data releases
  • how to create a simple, fully operational BioThings API serving the integrated data

Prerequisites

Using BioThings Studio requires a Docker server up and running, some basic knowledge about commands to run and use containers. Images have been tested on Docker >=17. Using AWS cloud, you can use our public AMI biothings_demo_docker (ami-44865e3c in Oregon region) with Docker pre-configured and ready for studio deployment. Instance type depends on the size of data you want to integrate and parsers’ performances. For this tutorial, we recommend using instance type with at least 4GiB RAM, such as t2.medium. AMI comes with an extra 30GiB EBS volume, which is more than enough for the scope of this tutorial.

Alternately, you can install your own Docker server (on recent Ubuntu systems, sudo apt-get install docker.io is usually enough). You may need to point Docker images directory to a specific hard drive to get enough space, using -g option:

# /mnt/docker points to a hard drive with enough disk space
sudo echo 'DOCKER_OPTS="-g /mnt/docker"' >> /etc/default/docker
# restart to make this change active
sudo service docker restart

Downloading and running BioThings Studio

BioThings Studio is available as a Docker image that you can download following this link using your favorite web browser or wget:

$ wget http://biothings-containers.s3-website-us-west-2.amazonaws.com/biothings_studio/biothings_studio_latest.docker

Typing docker ps should return all running containers, or at least an empty list as in the following example. Depending on the systems and configuration, you may have to add sudo in front of this command to access Docker server.

$ docker ps
  CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS      NAMES

Once downloaded, the image can be loaded into the server:

$ docker image load < biothings_studio_latest.docker
$ docker image list
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
biothings_studio    0.1e                742a8c502280        2 months ago        1.81 GB

Notice the value for TAG, we’ll need it to run the container (here, 0.1e)

A BioThings Studio instance expose several services on different ports:

  • 8080: BioThings Studio web application port
  • 7022: BioThings Hub SSH port
  • 7080: BioThings Hub REST API port
  • 9200: ElasticSearch port
  • 27017: MongoDB port
  • 8000: BioThings API, once created, it can be any non-priviledged (>1024) port
  • 8000: BioThings API, once created, it can be any non-priviledged (>1024) port

We will map and expose those ports to the host server using option -p so we can access BioThings services without having to enter the container:

$ docker run --name studio -p 8080:8080 -p 7022:7022 -p 7080:7080 -p 9200:9200 -p 27017:27017 -p 8000:8000 -p 9000:9000 -d biothings_studio:0.1e

Note

we need to add the release number after the image name: biothings_studio:0.1e. Should you use another release (including unstable releases, tagged as master) you would need to adjust this parameter accordingly.

Note

Biothings Studio and the Hub are not designed to be publicly accessible. Those ports should not be exposed. When accessing the Studio and any of these ports, SSH tunneling can be used to safely access the services from outside. Ex: ssh -L 7080:localhost:7080 -L 8080:localhost:8080 user@mydockerserver will expose the web application and the REST API ports to your computer, so you can access the webapp using http://localhost:8080 and the API using http://localhost:7080. See https://www.howtogeek.com/168145/how-to-use-ssh-tunneling for more

We can follow the starting sequence using docker logs command:

$ docker logs -f studio
Waiting for mongo
tcp        0      0 127.0.0.1:27017         0.0.0.0:*               LISTEN      -
* Starting Elasticsearch Server
...
now run webapp
not interactive

Please refer Filesystem overview and Services check for more details about Studio’s internals.

By default, the studio will auto-update its source code to the latest available and install all required dependencies. This behavior can be skipped by adding no-update at the end of the command line.

We can now access BioThings Studio using the dedicated web application (see webapp overview).

Creating an API from a flat file

In this section we’ll dive in more details on using the BioThings Studio and Hub. We will be integrating a simple flat file as a new datasource within the Hub, declare a build configuration using that datasource, create a build from that configuration, then a data release and finally instantiate a new API service and use it to query our data.

Input data, parser and data plugin

For this tutorial, we will integrate data from the Cancer Genome Interpreter (CGI). This datasource is used in MyVariant.info, one of the most used BioThings APIs. The input file is available here: https://www.cancergenomeinterpreter.org/data/cgi_biomarkers_latest.zip.

The parser itself is not the main topic of this tutorial, the full code for the parser can be found here, in MyVariant’s github repository.

From a single flat file, it produces JSON documents looking like this:

{
"_id": "chr9:g.133747570A>G",
  "cgi": {
    "association": "Resistant",
    "cdna": "c.877A>G",
    "drug": "Imatinib (BCR-ABL inhibitor 1st gen&KIT inhibitor)",
    "evidence_level": "European LeukemiaNet guidelines",
    "gene": "ABL1",
    "primary_tumor_type": "Chronic myeloid leukemia",
    "protein_change": "ABL1:I293V",
    "region": "inside_[cds_in_exon_5]",
    "source": "PMID:21562040",
    "transcript": "ENST00000318560"
  }
}

Note

The _id key is mandatory and represents a unique identifier for this document. The type must a string. The _id key is used when data from multiple datasources are merged together, that process is done according to its value (all documents sharing the same _id from different datasources will be merged together).

We can easily create a new datasource and integrate data using BioThings Studio, by declaring a data plugin. Such plugin is defined by:

  • a folder containing a manifest.json file, where the parser and the input file location are declared
  • all necessary files supporting the declarations in the manifest, such as a python file containing the parsing function for instance.

This folder must be located in the plugins directory (by default /data/biothings_studio/plugins, where the Hub monitors changes and reloads itself accordingly to register data plugins. Another way to declare such plugin is to register a github repository, containing everything useful for the datasource. This is what we’ll do in the following section.

Note

whether the plugin comes from a github repository or directly found in the plugins directory doesn’t really matter. In the end, the code will be found that same plugins directory, whether it comes from a git clone command while registeting the github URL or whether it comes from folder and files manually created in that location. It’s however easier, when developing a plugin, to directly work on local files first so we don’t have to regurlarly update the plugin code (git pull) from the webapp, to fetch the latest code. That said, since the plugin is already defined in github in our case, we’ll use the github repo registration method.

The corresponding data plugin repository can be found at https://github.com/sirloon/mvcgi. The manifest file looks like this:

{
    "version": "0.2",
    "__metadata__" : {
        "license_url" : "https://www.cancergenomeinterpreter.org/faq#q11c",
        "licence" : "CC BY-NC 4.0",
        "url" : "https://www.cancergenomeinterpreter.org"
    },
    "dumper" : {
        "data_url" : "https://www.cancergenomeinterpreter.org/data/cgi_biomarkers_latest.zip",
        "uncompress" : true,
    },
    "uploader" : {
        "parser" : "parser:load_data",
        "on_duplicates" : "ignore"
    }
}
  • the dumper section declares where the input file is, using data_url key. Since the input file is a ZIP file, we first need to uncompress the archive, using uncompress : true.
  • the uploader section tells the Hub how to upload JSON documents to MongoDB. parser has a special format, module_name:function_name. Here, the parsing function is named load_data and can be found in parser.py module. ‘on_duplicates’ : ‘ignore’ tells the Hub to ignore any duplicated records (documents with same _id).

For more information about the other fields, please refer to the plugin specification.

Let’s register that data plugin using the Studio. First, copy the repository URL:

../_images/githuburl.png

Moving back to the Studio, click on the sources tab, then menu icon, this will open a side bar on the left. Click on New data plugin, you will be asked to enter the github URL. Click “OK” to register the data plugin.

../_images/registerdp.png

Interpreting the manifest coming with the plugin, BioThings Hub has automatically created for us:

  • a dumper using HTTP protocol, pointing to the remote file on the CGI website. When downloading (or dumping) the data source, the dumper will automatically check whether the remote file is more recent than the one we may have locally, and decide whether a new version should be downloaded.
  • and an uploader to which it “attached” the parsing function. This uploader will fetch JSON documents from the parser and store those in MongoDB.

At this point, the Hub has detected a change in the datasource code, as the new data plugin source code has been pulled from github locally inside the container. In order to take this new plugin into account, the Hub needs to restart to load the code. The webapp should detect that reload and should ask whether we want to reconnect, which we’ll do!

../_images/hub_restarting.png

Upon registration, the new data source appears:

../_images/listdp.png
  • dumpicon is used to trigger the dumper and (if necessary) download remote data
  • uploadicon will trigger the uploader (note it’s automatically triggered if a new version of the data is available)
  • inspecticon can be used to “inspect” the data, more of that later

Let’s open the datasource by clicking on its title to have more information. Dumper and Uploader tabs are rather empty since none of these steps have been launched yet. The Plugin tab though shows information about the actual source code pulled from the github repository. As shown, we’re currently at the HEAD version of the plugin, but if needed, we could freeze the version by specifiying a git commit hash or a branch/tag name.

../_images/plugintab.png

Without further waiting, let’s trigger a dump to integrate this new datasource. Either go to Dump tab and click on dumplabelicon or click on sources to go back to the sources list and click on dumpicon at the bottom of the datasource.

The dumper is triggered, and after few seconds, the uploader is automatically triggered. Commands can be listed by clicking at the top the page. So far we’ve run 3 commands to register the plugin, dump the data and upload the JSON documents to MongoDB. All succeeded.

../_images/allcommands.png

We also have new notifications as shown by the red number on the right. Let’s have a quick look:

../_images/allnotifs.png

Going back to the source’s details, we can see the Dumper has been populated. We now know the release number, the data folder, when was the last download, how long it tooks to download the file, etc…

../_images/dumptab.png

Same for the Uploader tab, we now have 323 documents uploaded to MongoDB.

../_images/uploadtab.png

Inspecting the data

Now that we have integrated a new datasource, we can move forward. Ultimately, data will be sent to ElasticSearch, an indexing engine. In order to do so, we need to tell ElasticSearch how the data is structured and which fields should be indexed (and which should not). This step consists of creating a “mapping”, describing the data in ElasticSearch terminology. This can be a tedious process as we would need to dig into some tough technical details and manually write this mapping. Fortunately, we can ask BioThings Studio to inspect the data and suggest a mapping for it.

In order to do so, click on Mapping tab, then click on inspectlabelicon.

We’re asked where the Hub can find the data to inspect. Since we successfully uploaded the data, we now have a Mongo collection so we can directly use this. Click on “OK” to let the Hub work and generate a mapping for us.

../_images/inspectmenu.png

Since the collection is very small, inspection is fast, you should have a mapping generated within few seconds.

../_images/inspected.png

For each field highlighted in blue, you can decide whether you want the field to be searchable or not, and whether the field should be searched by default when querying the API. Let’s click on “gene” field and make it searched by default.

../_images/genefield.png

Indeed, by checking the “Search by default” checkbox, we will be able to search for instance gene “ABL1” with /query?q=ABL1 instead of /query?q=cgi.gene:ABL1.

After this modification, you should see edited at the top of the mapping, let’s save our changes clicking on savelabelicon. Also, before moving forwared, we want to make sure the mapping is valid, let’s click on validatelabelicon. You should see this success message:

../_images/validated.png

Note

“Validate on test” means Hub will send the mapping to ElasticSearch by creating a temporary, empty index to make sure the mapping syntax and content are valid. It’s immediately deleted after validation (wheter successful or not). Also, “test” is the name of an environment, by default, and without further manual configuration, this is the only development environment available in the Studio, pointing to embedded ElasticSearch server.

Everything looks fine, one last step is to “commit” the mapping, meaning we’re ok to use this mapping as the official, registered mapping, the one that will actually be used by ElasticSearch. Indeed the left side of the page is about inspected mapping, we can re-launch the inspection as many time as we want, without impacting active/registered mapping (this is usefull when the data structure changes). Click on commit then “OK”, and you now should see the final, registered mapping on the right:

../_images/registered.png

Defining and creating a build

Once we have integrated data and a valid ElasticSeach mapping, we can move forward by creating a build configuration. A build configuration tells the Hub which datasources should be merged together, and how. Click on builds then menu and finally, click on newbuildconf.

../_images/buildconfform.png
  • enter a name for this configuration. We’re going to have only one configuration created through this tutorial so it doesn’t matter, let’s make it “default”
  • the document type represents the kind of documents stored in the merged collection. It gives its name to the annotate API endpoint (eg. /variant). This source is about variant, so “variant” it is…
  • open the dropdown list and select the sources you want to be part of the merge. We only have one, “mvcgi”
  • in root sources, we can declare which sources are allowed to create new documents in the merged collection, that is merge documents from a datasource, but only if corresponding documents exist in the merged collection. It’s usefull if data from a specific source relates to data on another source (it only makes sense to merge that relating data if the data itself is present). If root sources are declared, Hub will first merge them, then the others. In our case, we can leave it empty (no root sources specified, all sources can create documents in the merged collection)
  • the other fields are for advanced usage and are out-of-topic for this tutorial

Click “OK” and open the menu again, you should see the new configuration available in the list.

../_images/buildconflist.png

Click on it and create a new build.

../_images/newbuild.png

You can give a specific name for that build, or let the Hub generate one for you. Click “OK”, after few seconds, you should see the new build displayed on the page.

../_images/builddone.png

Open it by clicking on its name. You can explore the tabs for more information about it (sources involved, build times, etc…). The “Release” tab is the one we’re going to use next.

Creating a data release

If not there yet, open the new created build and go the “Release” tab. This is the place where we can create new data releases. Click on newrelease.

../_images/newreleaseform.png

Since we only have one build available, we can’t generate an incremental release so we’ll have to select full this time. Click “OK” to launch the process.

Note

Should there be a new build available (coming from the same configuration), and should there be data differences, we could generate an incremental release. In this case, Hub would compute a diff between previous and new builds and generate diff files (using JSON diff format). Incremental releases are usually smaller than full releases, usually take less time to deploy (appying diff data) unless diff content is too big (there’s a threshold between using an incremental and a full release, depending on the hardware and the data, because applying a diff requires to first fetch the document from ElasticSearch, patch it, and then save it back)

Hub will directly index the data on its locally installed ElasticSearch server (test environment). After few seconds, a new full release is created.

../_images/newfullrelease.png

We can easily access ElasticSearch server using the application Cerebro which comes pre-configured with the studio. Let’s access it through http://localhost:9000/#/connect (assuming ports 9200 and 9000 have properly been mapped, as mentioned earlier). Cerebro provides an easy to manager ElasticSearch and check/query indices.

Click on the pre-configured server named BioThings Studio.

../_images/cerebro_connect.png

Clicking on an index gives access to different information, such as the mapping, which also contains metadata (sources involved in the build, releases, counts, etc…)

../_images/cerebro_index.png

Generating a BioThings API

At this stage, a new index containing our data has been created on ElasticSearch, it is now time for final step. Click on api then menu and finally newapi

../_images/apilist.png

To turn on this API instance, just click on playicon, you should then see a running label on the top right corner, meaning the API can be accessed:

../_images/apirunning.png

Note

When running, queries such /metadata and /query?q=* are provided as examples. They contain a hostname set by Docker though (it’s the Docker instance hostname), which probably means nothing outside of Docker’s context. In order to use the API you may need to replace this hostname by the one actually used to access the Docker instance.

Accessing the API

Assuming API is accessible through http://localhost:8000, we can easily query it with curl for instance. The endpoint /metadata gives information about the datasources and build date:

$ curl localhost:8000/metadata
{
  "build_date": "2018-06-05T18:32:23.604840",
  "build_version": "20180605",
  "src": {
    "mvcgi": {
      "stats": {
        "mvcgi": 323
      },
      "version": "2018-04-24"
    }
  },
  "src_version": {
    "mvcgi": "2018-04-24"
  },
  "stats": {}

Let’s query the data using a gene name (results truncated):

$ curl localhost:8000/query?q=ABL1
{
  "max_score": 2.5267246,
  "took": 24,
  "total": 93,
  "hits": [
    {
      "_id": "chr9:g.133748283C>T",
      "_score": 2.5267246,
      "cgi": [
        {
          "association": "Responsive",
          "cdna": "c.944C>T",
          "drug": "Ponatinib (BCR-ABL inhibitor 3rd gen&Pan-TK inhibitor)",
          "evidence_level": "NCCN guidelines",
          "gene": "ABL1",
          "primary_tumor_type": "Chronic myeloid leukemia",
          "protein_change": "ABL1:T315I",
          "region": "inside_[cds_in_exon_6]",
          "source": "PMID:21562040",
          "transcript": "ENST00000318560"
        },
        {
          "association": "Resistant",
          "cdna": "c.944C>T",
          "drug": "Bosutinib (BCR-ABL inhibitor  3rd gen)",
          "evidence_level": "European LeukemiaNet guidelines",
          "gene": "ABL1",
          "primary_tumor_type": "Chronic myeloid leukemia",
          "protein_change": "ABL1:T315I",
          "region": "inside_[cds_in_exon_6]",
          "source": "PMID:21562040",
          "transcript": "ENST00000318560"
        },
        ...

Note

we don’t have to specify cgi.gene, the field in which the value “ABL1” should be searched, because we explicitely asked ElasticSearch to search that field by default (see fieldbydefault)

Finally, we can fetch a variant by its ID:

$ curl "localhost:8000/variant/chr19:g.4110584A>T"
{
  "_id": "chr19:g.4110584A>T",
  "_version": 1,
  "cgi": [
    {
      "association": "Resistant",
      "cdna": "c.373T>A",
      "drug": "BRAF inhibitors",
      "evidence_level": "Pre-clinical",
      "gene": "MAP2K2",
      "primary_tumor_type": "Cutaneous melanoma",
      "protein_change": "MAP2K2:C125S",
      "region": "inside_[cds_in_exon_3]",
      "source": "PMID:24265153",
      "transcript": "ENST00000262948"
    },
    {
      "association": "Resistant",
      "cdna": "c.373T>A",
      "drug": "MEK inhibitors",
      "evidence_level": "Pre-clinical",
      "gene": "MAP2K2",
      "primary_tumor_type": "Cutaneous melanoma",
      "protein_change": "MAP2K2:C125S",
      "region": "inside_[cds_in_exon_3]",
      "source": "PMID:24265153",
      "transcript": "ENST00000262948"
    }
  ]
}

Conclusions

We’ve been able to easily convert a remote flat file to a fully operational BioThings API:

  • by defining a data plugin, we told the BioThings Hub where the remote data was and what the parser function was
  • BioThings Hub then generated a dumper to download data locally on the server
  • It also generated an uploader to run the parser and store resulting JSON documents
  • We defined a build configuration to include the newly integrated datasource and then trigger a new build
  • Data was indexed internally on local ElasticSearch by creating a full release
  • Then we created a BioThings API instance pointing to that new index

The final step would then be to deploy that API as a cluster on a cloud. This last step is currently under development, stay tuned!

Troubleshooting

We test and make sure, as much as we can, that the BioThings Studio image is up-to-date and running properly. But things can still go wrong…

First make sure all services are running. Enter the container and type netstat -tnlp, you should see services running on ports (see usual running `services`_). If services running on ports 7080 or 7022 aren’t running, it means the Hub has not started. If you just started the instance, wait a little more as services may take a while before they’re fully started and ready.

If after ~1 min, you still don’t see the Hub running, log to user biothings and check the starting sequence.

Note

Hub is running in a tmux session, under user biothings

# sudo su - biothings
$ tmux a # recall tmux session

$ python bin/hub.py
DEBUG:asyncio:Using selector: EpollSelector
INFO:root:Hub DB backend: {'uri': 'mongodb://localhost:27017', 'module': 'biothings.utils.mongo'}
INFO:root:Hub database: biothings_src
DEBUG:hub:Last launched command ID: 14
INFO:root:Found sources: []
INFO:hub:Loading data plugin 'https://github.com/sirloon/mvcgi.git' (type: github)
DEBUG:hub:Creating new GithubAssistant instance
DEBUG:hub:Loading manifest: {'dumper': {'data_url': 'https://www.cancergenomeinterpreter.org/data/cgi_biomarkers_latest.zip',
            'uncompress': True},
 'uploader': {'ignore_duplicates': False, 'parser': 'parser:load_data'},
 'version': '0.1'}
INFO:indexmanager:{}
INFO:indexmanager:{'test': {'max_retries': 10, 'retry_on_timeout': True, 'es_host': 'localhost:9200', 'timeout': 300}}
DEBUG:hub:for managers [<SourceManager [0 registered]: []>, <AssistantManager [1 registered]: ['github']>]
INFO:root:route: ['GET'] /job_manager => <class 'biothings.hub.api.job_manager_handler'>
INFO:root:route: ['GET'] /command/([\w\.]+)? => <class 'biothings.hub.api.command_handler'>
...
INFO:root:route: ['GET'] /api/list => <class 'biothings.hub.api.api/list_handler'>
INFO:root:route: ['PUT'] /restart => <class 'biothings.hub.api.restart_handler'>
INFO:root:route: ['GET'] /status => <class 'biothings.hub.api.status_handler'>
DEBUG:tornado.general:sockjs.tornado will use json module
INFO:hub:Monitoring source code in, ['/home/biothings/biothings_studio/hub/dataload/sources', '/home/biothings/biothings_studio/plugins']:
['/home/biothings/biothings_studio/hub/dataload/sources',
 '/home/biothings/biothings_studio/plugins']

You should see something looking like this above. If not, you should see the actual error, and depending on the error, you may be able to fix it (not enough disk space, etc…). BioThings Hub can be started again using python bin/hub.py from within the application directory (in our case, /home/biothings/biothings_studio)

Note

Press Control-B then D to dettach the tmux session and let the Hub running in background.

By default, logs are available in /home/biothings/biothings_studio/data/logs.

Finally, you can report issues and request for help, by joining Biothings Google Groups (https://groups.google.com/forum/#!forum/biothings)