DataTransform Module

A key problem when merging data from multiple data sources is finding a common identifier. To ameliorate this problem, we have written a DataTransform module to convert identifiers from one type to another. Frequently, this conversion process has multiple steps, where an identifier is converted to one or more intermediates before having its final value. To describe these steps, the user defines a graph where each node represents an identifier type and each edge represents a conversion. The module processes documents using the network to convert their identifiers to their final form.

A graph is a mathematical model describing how different things are connected. Using our model, our module is connecting different identifiers together. Each connection is an identifier conversion or lookup process. For example, a simple graph could describe how pubchem identifiers could be converted to drugbank identifiers using MyChem.info.

Graph Definition

The following graph facilitates conversion from inchi to inchikey using pubchem as an intermediate:

from biothings.hub.datatransform import MongoDBEdge
import networkx as nx

graph_mychem = nx.DiGraph()

###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph_mychem.add_node('inchi')
graph_mychem.add_node('pubchem')
graph_mychem.add_node('inchikey')

graph_mychem.add_edge('inchi', 'pubchem',
                      object=MongoDBEdge('pubchem', 'pubchem.inchi', 'pubchem.cid'))

graph_mychem.add_edge('pubchem', 'inchikey',
                      object=MongoDBEdge('pubchem', 'pubchem.cid', 'pubchem.inchi_key'))

To setup a graph, one must define nodes and edges. There should be a node for each type of identifier and an edge which describes how to convert from one identifier to another. Node names can be arbitrary; the user is allowed to chose what an identifier should be called. Edge classes, however, must be defined precisely for conversion to be successful.

Edge Classes

The following edge classes are supported by the DataTransform module. One of these edge classes must be selected when defining an edge connecting two nodes in a graph.

MongoDBEdge

class biothings.hub.datatransform.MongoDBEdge(collection_name, lookup, field, weight=1, label=None, check_index=True)[source]

The MongoDBEdge uses data within a MongoDB collection to convert one identifier to another. The input identifier is used to search a collection. The output identifier values are read out of that collection:

Parameters:

collection_name (str) – The name of the MongoDB collection.
lookup (str) – The field that will match the input identifier in the collection.
field (str) – The output identifier field that will be read out of matching documents.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

The example above uses the MongoDBEdge class to convert from inchi to inchikey.

MyChemInfoEdge

class biothings.hub.datatransform.MyChemInfoEdge(lookup, field, weight=1, label=None, url=None)[source]

The MyChemInfoEdge uses the MyChem.info API to convert identifiers.

Parameters:

lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

This example graph uses the MyChemInfoEdge class to convert from pubchem to inchikey. The pubchem.cid and pubchem.inchi_key fields are returned by MyChem.info and are listed by /metadata/fields.

from biothings.hub.datatransform import MyChemInfoEdge
import networkx as nx

graph_mychem = nx.DiGraph()

###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph_mychem.add_node('pubchem')
graph_mychem.add_node('inchikey')

graph_mychem.add_edge('pubchem', 'inchikey',
                      object=MyChemInfoEdge('pubchem.cid', 'pubchem.inchi_key'))

MyGeneInfoEdge

class biothings.hub.datatransform.MyGeneInfoEdge(lookup, field, weight=1, label=None, url=None)[source]

The MyGeneInfoEdge uses the MyGene.info API to convert identifiers.

Parameters:

lookup (str) – The field in the API to search with the input identifier.
field (str) – The field in the API to convert to.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

RegExEdge

class biothings.hub.datatransform.RegExEdge(from_regex, to_regex, weight=1, label=None)[source]

The RegExEdge allows an identifier to be transformed using a regular expression. POSIX regular expressions are supported.

Parameters:

from_regex (str) – The first parameter of the regular expression substitution.
to_regex (str) – The second parameter of the regular expression substitution.
weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

This example graph uses the RegExEdge class to convert from pubchem to a shorter form. The CID: prefix is removed by the regular expression substitution:

from biothings.hub.datatransform import RegExEdge
import networkx as nx

graph = nx.DiGraph()

###############################################################################
# DataTransform Nodes and Edges
###############################################################################
graph.add_node('pubchem')
graph.add_node('pubchem-short')

graph.add_edge('pubchem', 'pubchem-short',
               object=RegExEdge('CID:', ''))

Example Usage

A complex graph developed for use with MyChem.info is shown here. This file includes a definition of the MyChemKeyLookup class which is used to call the module on the data source. In general, the graph and class should be supplied to the user by the BioThings.api maintainers.

To call the DataTransform module on the Biothings Uploader, the following definition is used:

keylookup = MyChemKeyLookup(
        [('inchi', 'pharmgkb.inchi'),
         ('pubchem', 'pharmgkb.xrefs.pubchem.cid'),
         ('drugbank', 'pharmgkb.xrefs.drugbank'),
         ('chebi', 'pharmgkb.xrefs.chebi')])

def load_data(self,data_folder):
    input_file = os.path.join(data_folder,"drugs.tsv")
    return self.keylookup(load_data)(input_file)

The parameters passed to MyChemKeyLookup are a list of input types. The first element in an input type is the node name that must match the graph. The second element is the field in dotstring notation which should describe where the identifier should be read from in a document.

The following report was reported when using the DataTransform module with PharmGKB. Reports have a section for document conversion and a section describing conversion along each edge. The document section shows which inputs were used to produce which outputs. The edge section is useful in debugging graphs, ensuring that different conversion edges are working properly.

{
     'doc_report': {
          "('inchi', 'pharmgkb.inchi')-->inchikey": 1637,
          "('pubchem', 'pharmgkb.xrefs.pubchem.cid')-->inchikey": 46
          "('drugbank', 'pharmgkb.xrefs.drugbank')-->inchikey": 41,
          "('drugbank', 'pharmgkb.xrefs.drugbank')-->drugbank": 25,
     }
     'edge_report': {
          'inchi-->chembl': 1109,
          'inchi-->drugbank': 319,
          'inchi-->pubchem': 209,
          'chembl-->inchikey': 1109,
          'drugbank-->inchikey': 360,
          'pubchem-->inchikey': 255
          'drugbank-->drugbank': 25,
     },
}

As an example, the number identifiers converted from inchi to inchikey is 1637. However, these conversions are done via intermediates. One of these intermediates is chembl and the number of identifiers converted from inchi to chembl is 319. Some identifiers are converted directly from pubchem and drugbank. The inchi field is used to lookup several intermediates (chembl, drugbank, and pubchem). Eventually, most of these intermediates are converted to inchikey.

Advanced Usage - DataTransform MDB

The DataTransformMDB module was written as a decorator class which is intended to be applied to the load_data function of a Biothings Uploader. This class can be sub-classed to simplify applification within a Biothings service.

class biothings.hub.datatransform.DataTransformMDB(graph, *args, **kwargs)[source]

Convert document identifiers from one type to another.

The DataTransformNetworkX module was written as a decorator class which should be applied to the load_data function of a Biothings Uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.

Parameters:

graph – nx.DiGraph (networkx 2.1) configuration graph
input_types – A list of input types for the form (identifier, field) where identifier matches a node and field is an optional dotstring field for where the identifier should be read from (the default is ‘_id’).
output_types (list(str)) – A priority list of identifiers to convert to. These identifiers should match nodes in the graph.
id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.
skip_on_failure (bool) – If True, documents where identifier conversion fails will be skipped in the final document list.
skip_w_regex (bool) – Do not perform conversion if the identifier matches the regular expression provided to this argument. By default, this option is disabled.
skip_on_success (bool) – If True, documents where identifier conversion succeeds will be skipped in the final document list.
idstruct_class (class) – Override an internal data structure used by the this module (advanced usage)
copy_from_doc (bool) – If true then an identifier is copied from the input source document regardless as to weather it matches an edge or not. (advanced usage)

Note: Prefixes can be defined at the node level using:: graph.add_node(“chebi”, prefix=”CHEBI”) When an identifier is converted to a node with a prefix attribute, the prefix will be automatically added to the _id.

An example of how to apply this class is shown below:

keylookup = DataTransformMDB(graph, input_types, output_types,
                             skip_on_failure=False, skip_w_regex=None,
                             idstruct_class=IDStruct, copy_from_doc=False)
def load_data(self,data_folder):
     input_file = os.path.join(data_folder,"drugs.tsv")
     return self.keylookup(load_data)(input_file)

It is possible to extend the DataTransformEdge type and define custom edges. This could be useful for example if the user wanted to define a computation that transforms one identifier to another. For example inchikey may be computed directly by performing a hash on the inchi identifier.

Document Maintainers

Greg Taylor (@gregtaylor)
Chunlei Wu (@chunleiwu)