BioThings Web

In this tutorial we will start a Biothings API and learn to customize it, overriding the default behaviors and adding new features, using increasingly more advanced techniques step by step. In the end, you will be able to make your own Biothings API, run other production APIs, like Mygene.info, and additionally, customize and add more features to those projects.

Attention

Before starting the tutorial, you should have the biothings package installed, and have an Elasticsearch running with only one index populated with this dataset using this mapping. You may also need a JSON Formatter browser extension for the best experience following this tutorial. (For Chrome)

1. Starting an API server

First, assuming your Elasticsearch service is running on the default port 9200, we can run a Biothings API with all default settings to explore the data, simply by creating a config.py under your project folder. After creating the file, run python -m biothings.web to start the API server. You should be able to see the following console output:

[I 211130 22:21:57 launcher:28] Biothings API 0.10.0
[I 211130 22:21:57 configs:86] <module 'config' from 'C:\\Users\\Jerry\\code\\biothings.tutorial\\config.py'>
[INFO biothings.web.connections:31] <Elasticsearch([{'host': 'localhost', 'port': 9200}])>
[INFO biothings.web.connections:31] <AsyncElasticsearch([{'host': 'localhost', 'port': 9200}])>
[INFO biothings.web.applications:137] API Handlers:
    [('/', <class 'biothings.web.handlers.services.FrontPageHandler'>, {}),
    ('/status', <class 'biothings.web.handlers.services.StatusHandler'>, {}),
    ('/metadata/fields/?', <class 'biothings.web.handlers.query.MetadataFieldHandler'>, {}),
    ('/metadata/?', <class 'biothings.web.handlers.query.MetadataSourceHandler'>, {}),
    ('/v1/spec/?', <class 'biothings.web.handlers.services.APISpecificationHandler'>, {}),
    ('/v1/doc(?:/([^/]+))?/?', <class 'biothings.web.handlers.query.BiothingHandler'>, {'biothing_type': 'doc'}),
    ('/v1/metadata/fields/?', <class 'biothings.web.handlers.query.MetadataFieldHandler'>, {}),
    ('/v1/metadata/?', <class 'biothings.web.handlers.query.MetadataSourceHandler'>, {}),
    ('/v1/query/?', <class 'biothings.web.handlers.query.QueryHandler'>, {})]
[INFO biothings.web.launcher:99] Server is running on "0.0.0.0:8000"...
[INFO biothings.web.connections:25] Elasticsearch Package Version: 7.13.4
[INFO biothings.web.connections:27] Elasticsearch DSL Package Version: 7.3.0
[INFO biothings.web.connections:51] localhost:9200: docker-cluster 7.9.3

Note the console log shows the API version, the config file it uses, its database connections, HTTP routes, service port, important python dependency package versions, as well as the database cluster details.

Note

The cluster detail appears as the last line, sometimes with a delay, because it is scheduled asynchronously at start time, but executed later after the main program has launched. The default implementation of our application is asynchronous and non-blocking based on asyncio and tornado.ioloop interface. The specific logic in this case is implemented in the biothings.web.connections module.

Of all the information provided, note that it says the server is running on port 8000, this is the default port we use when we start a Biothings API. It means you can acccess the API by opening http://localhost:8000/ in your browser in most of the cases.

Note

If this port is occupied, you can pass the “port” parameter during startup to change it, for example, running python -m biothings.web --port=9000. The links in the tutorial assume the services is running on the default port 8000. If you are running the service on a differnt port. You need to modify the URLs provided in the tutorial before opening in the browser.

Now open the browser and access localhost:8000, we should be able to see the biothings welcome page, showing the public routes in regex formats reading like:

/
/status
/metadata/fields/?
/metadata/?
/v1/spec/?
/v1/doc(?:/([^/]+))?/?
/v1/metadata/fields/?
/v1/metadata/?
/v1/query/?

2. Exploring an API endpoint

The last route on the welcome page shows the URL pattern of the query API. Let’s use this pattern to access the query endpoint. Accessing http://localhost:8000/v1/query/ returns a JSON document containing 10 results from our elasticsearch index.

Let’s explore some Biothings API features here, adding a query parameter “fields” to limit the fields returned by the API, and another parameter “size” to limit the returned document number. If you used the dataset mentioned at the start of the tutorial, accessing http://localhost:8000/v1/query?fields=symbol,alias,name&size=1 should return a document like this:

{
    "took": 15,
    "total": 1030,
    "max_score": 1,
    "hits": [
        {
            "_id": "1017",
            "_score": 1,
            "alias": [
                "CDKN2",
                "p33(CDK2)"
            ],
            "name": "cyclin dependent kinase 2",
            "symbol": "CDK2"
        }
    ]
}

The most commonly used parameter is the “q” parameter, try http://localhost:8000/v1/query?q=cdk2 and see all the returned results contain “cdk2”, the value specified for the “q” parameter.

Note

For a list of the supporting parameters, visit Biothings API Specifications. The documentation for our most popular service https://mygene.info/ also covers a lot of features also available in all biothings applications. Read more on Gene Query Service and Gene Annotation Service.

3. Customizing an API through the config file

In the previous step, we tested document exploration by search its content. Is there a way to access individual documents directly by their “_id” or other id fields? We can look at the annotation endpoint doing exactly that.

By default, this endpoint is accessible by an URL pattern like this: /<ver>/doc/<_id> where “ver” refers to the API version. In our case, if we want to access a document with an id of “1017”, one of those doc showing up in the previous example, we can try: http://localhost:8000/v1/doc/1017

Note

To configure a different API version other than “v1” for your program, add a prefix to all API patterns, like /api/<ver>/…, or remove these patterns, make changes in the config file modifying the settings prefixed with “APP”, as those control the web application behavior. A web application is basically a collection of routes and settings that can be understood by a web server. See biothings.web.settings.default source code to look at the current configuration and refer to biothings.web.applications to see how the settings are turned to routes in different web frameworks.

In this dataset, we know the document type can be best described as “gene”s. We can enable a widely-used feature, document type URL templating, by providing more information to the biothings app in the config.py file. Write the following lines to the config file:

ES_HOST = "localhost:9200" # optional
ES_INDICES = {"gene": "<your index name>"}

ANNOTATION_DEFAULT_SCOPES = ["_id", "symbol"]

Note

The ES_HOST setting is a common parameter that you see in the config file. Although it is not making a difference here, you can configure the value of this setting to ask biothings.web to connect to a different Elasticsearch server, maybe hosted remotely on the cloud. The ANNOTATION_DEFAULT_SCOPES setting specifies the document fields we consider as the id fields. By default, only the “_id” field in the document, a must-have field in Elasticsearch, is considered the biothings id field. We additionally added the “symbol” field, to allow the user to it to find documents in this demo API.

Restart your program and see the annotation route is now prefixed with /v1/gene if you pay close attention to the console log. Now try the following URL:

http://localhost:8000/v1/gene/1017

http://localhost:8000/v1/gene/CDK2

See that using both of the URLs can take you straight to the document previously mentioned. Note using the symbol field “CDK2” may yield multiple documents because multiple documents may have the same key-value pair. This also means “symbol” may not be a good choice of the key field we want to support in the URL.

These two endpoints, annotation and query, are the pillars for Biothings API. You can additionally customize these endpoints to work better with your data.

For example, if you think our returned result by default from the query endpoint is too verbose and we want to only include limited information unless the user specifically asked for more, we can set a default “fields” value, for this parameter used in the previous example. Open config.py and add:

from biothings.web.settings.default import QUERY_KWARGS
QUERY_KWARGS['*']['_source']['default'] = ['name', 'symbol', 'taxid', 'entrezgene']

Restart your program after changing the config file and visit http://localhost:8000/v1/query, see the effect of specifying default fields to return. Like this:

{
    "took": 9,
    "total": 100,
    "max_score": 1,
    "hits": [
        {
            "_id": "1017",
            "_score": 1,
            "entrezgene": "1017",
            "name": "cyclin dependent kinase 2",
            "symbol": "CDK2",
            "taxid": 9606
        },
        {
            "_id": "12566",
            "_score": 1,
            "entrezgene": "12566",
            "name": "cyclin-dependent kinase 2",
            "symbol": "Cdk2",
            "taxid": 10090
        },
        {
            "_id": "362817",
            "_score": 1,
            "entrezgene": "362817",
            "name": "cyclin dependent kinase 2",
            "symbol": "Cdk2",
            "taxid": 10116
        },
        ...
    ]
}

4. Customizing an API through pipeline stages

In the previous example, the numbers in the “entrezgene” field are typed as strings. Let’s modify the internal logic called the query pipeline to convert these values to integers just to show what we can do in customization.

Note

The pipeline is one of the biothings.web.services. It defines the intermediate steps or stages we take to execute a query. See biothings.web.query to learn more about the individual stages.

Add to config.py:

ES_RESULT_TRANSFORM = "pipeline.MyFormatter"

And create a file pipeline.py to include:

from biothings.web.query import ESResultFormatter


class MyFormatter(ESResultFormatter):

    def transform_hit(self, path, doc, options):

        if path == '' and 'entrezgene' in doc:  # root level
            try:
                doc['entrezgene'] = int(doc['entrezgene'])
            except:
                ...

Commit your changes and restart the webserver process. Run some queries and you should be able to see the “entrezgene” field now showing as integers:

{
    "_id": "1017",
    "_score": 1,
    "entrezgene": 1017, # instead of the quoted "1017" (str)
    "name": "cyclin dependent kinase 2",
    "symbol": "CDK2",
    "taxid": 9606
}

In this example, we made changes to the query transformation stage, controlled by the biothings.web.query.formatter.ESResultFormatter class, this is one of the three stages that defined the query pipeline. The two stages coming before it are represented by biothings.web.query.engine.AsyncESQueryBackend and biothings.web.query.builder.ESQueryBuilder.

Let’s try to modify the query builder stage to add another feature. We’ll incorporate domain knowledge here to deliver more user-friendly seach result by scoring the documents with a few rules to increase result relevancy. Additionally add to the pipeline.py file:

from biothings.web.query import ESQueryBuilder
from elasticsearch.dsl import Search

class MyQueryBuilder(ESQueryBuilder):

    def apply_extras(self, search, options):

        search = Search().query(
            "function_score",
            query=search.query,
            functions=[
                {"filter": {"term": {"name": "pseudogene"}}, "weight": "0.5"},  # downgrade
                {"filter": {"term": {"taxid": 9606}}, "weight": "1.55"},
                {"filter": {"term": {"taxid": 10090}}, "weight": "1.3"},
                {"filter": {"term": {"taxid": 10116}}, "weight": "1.1"},
            ], score_mode="first")

        return super().apply_extras(search, options)

Make sure our application can pick up the change by adding this line to config.py:

ES_QUERY_BUILDER = "pipeline.MyQueryBuilder"

Note

We wrapped our original query logic in an Elasticsearch compound query fucntion score query. For more on writing python-friendly Elasticsearch queries, see Elasticsearch DSL package, one of the dependencies used in biothings.web.

Save the file and restart the webserver process. Search something and if you compare with the application before, you may notice some result rankings have changed. It is not easy to pick up this change if you are not familiar with the data, visit http://localhost:8000/v1/query?q=kinase&rawquery instead and see that our code was indeed making a difference and get passed to elasticsearch, affecting the query result ranking. Notice the “rawquery” is a feature in our program to intercept the raw query we sent to elasticsearch for debugging.

5. Customizing an API through pipeline services

Taking it one more step further, we can add more procedures or stages to the pipeline by overwriting the Pipeline class. Add to the config file:

ES_QUERY_PIPELINE = "pipeline.MyQueryPipeline"

and add the following code to pipeline.py:

class MyQueryPipeline(AsyncESQueryPipeline):

    async def fetch(self, id, **options):

        if id == "tutorial":
            res = {"_welcome": "to the world of biothings.api"}
            res.update(await super().fetch("1017", **options))
            return res

        res = await super().fetch(id, **options)
        return res

Now we made ourselves a tutorial page to show what annotation results can look like, by visiting http://localhost:8000/v1/gene/tutorial, you can see what http://localhost:8000/v1/gene/1017 would typically give you, and the additional welcome message:

{
    "_welcome": "to the world of biothings.api",
    "_id": "1017",
    "_version": 1,
    "entrezgene": 1017,
    "name": "cyclin dependent kinase 2",
    "symbol": "CDK2",
    "taxid": 9606
}

Note

In this example, we modified the query pipeline’s “fetch” method, the one used in the annotation endpoint, to include some additional logic before executing what we would typically do. The call to the “super” function executes the typical query building, executing and formatting stages.

6. Customizing an API through the web app

The examples above demonstrated the customizations you can make on top of our pre-defined APIs, for the most demanding tasks, you can additionally add your own API routes to the web app.

Modify the config file as a usual first step. Declare a new route by adding:

from biothings.web.settings.default import APP_LIST

APP_LIST = [
    *APP_LIST, # keep the original ones
    (r"/{ver}/echo/(.+)", "handlers.EchoHandler"),
]

Let’s make an echo handler that just echos what the user puts in the URL. Create a handlers.py and add:

from biothings.web.handlers import BaseAPIHandler


class EchoHandler(BaseAPIHandler):

    def get(self, text):
        self.write({
            "status": "ok",
            "result": text
        })

Now we have added a completely new feature not based on any of the existing biothings offerings, which can be as simple and as complex as you need. Visiting http://localhost:8000/v1/echo/hello would give you:

{
    "status": "ok",
    "result": "hello"
}

in which case, the “hello” in “result” field is the input we give the application in the URL.

7. Customizing an API through the app launcher

Another convenient place to customize the API is to have a launching module, typically called index.py, and pass parameters to the starting function, provided as biothings.web.launcher.main(). Create an index.py in your project folder:

from biothings.web.launcher import main
from tornado.web import RedirectHandler

if __name__ == '__main__':
    main([
        (r"/v2/query(.*)", RedirectHandler, {"url": "/v1/query{0}"})
    ], {
        "static_path": "static"
    })

Create another folder called “static” and add a file of random content named “file.txt” under the newly created static folder. In this step, we added a redirection of a later-to-launch v2 query API, that we temporarily set to redirect to the v1 API and passed a static file configuration that asks tornado to serve files under the static folder we specified to the tornado webserver, the default webserver we use. The static folder is named “static” and contains only one file in this example.

Note

For more on configuring route redirections and other application features in tornado, see RedirectHandler and Application configuration.

After making the changes, visiting http://localhost:8000/v2/query/?q=cdk2 would direct you back to http://localhost:8000/v1/query/?q=cdk2 and by visiting http://localhost:8000/static/file.txt you should see the random content you previously created. Note in this step, you should run the python launcher module directly by calling something like python index.py instead of running the first command we introduced. Running the launcher directly is also how we start most of our user-facing products that require complex configurations, like http://mygene.info/. ts code is publicly available at https://github.com/biothings/mygene.info under the Biothings Organization.

The End

Finishing this tutorial, you have completed the most common steps to customize biothings.api. The customization starts from passing a different parameter at launch time and evolve to modifying the app code at different levels. I hope you feel confident running biothings API now and please check out the documentation page for more details on customizing APIs.