DataTransform Module #################### A key problem when merging data from multiple data sources is finding a common identifier. To ameliorate this problem, we have written a DataTransform module to convert identifiers from one type to another. Frequently, this conversion process has multiple steps, where an identifier is converted to one or more intermediates before having its final value. To describe these steps, the user defines a graph where each node represents an identifier type and each edge represents a conversion. The module processes documents using the network to convert their identifiers to their final form. A graph is a mathematical model describing how different things are connected. Using our model, our module is connecting different identifiers together. Each connection is an identifier conversion or lookup process. For example, a simple graph could describe how `pubchem` identifiers could be converted to `drugbank` identifiers using `MyChem.info`. Graph Definition ---------------- The following graph facilitates conversion from `inchi` to `inchikey` using `pubchem` as an intermediate: .. code-block:: python from biothings.hub.datatransform import MongoDBEdge import networkx as nx graph_mychem = nx.DiGraph() ############################################################################### # DataTransform Nodes and Edges ############################################################################### graph_mychem.add_node('inchi') graph_mychem.add_node('pubchem') graph_mychem.add_node('inchikey') graph_mychem.add_edge('inchi', 'pubchem', object=MongoDBEdge('pubchem', 'pubchem.inchi', 'pubchem.cid')) graph_mychem.add_edge('pubchem', 'inchikey', object=MongoDBEdge('pubchem', 'pubchem.cid', 'pubchem.inchi_key')) To setup a graph, one must define nodes and edges. There should be a node for each type of identifier and an edge which describes how to convert from one identifier to another. Node names can be arbitrary; the user is allowed to chose what an identifier should be called. Edge classes, however, must be defined precisely for conversion to be successful. Edge Classes ------------ The following edge classes are supported by the `DataTransform` module. One of these edge classes must be selected when defining an edge connecting two nodes in a graph. MongoDBEdge ~~~~~~~~~~~ .. autoclass:: biothings.hub.datatransform.MongoDBEdge The example above uses the `MongoDBEdge` class to convert from `inchi` to `inchikey`. MyChemInfoEdge ~~~~~~~~~~~~~~ .. autoclass:: biothings.hub.datatransform.MyChemInfoEdge This example graph uses the `MyChemInfoEdge` class to convert from `pubchem` to `inchikey`. The `pubchem.cid` and `pubchem.inchi_key` fields are returned by `MyChem.info` and are listed by `/metadata/fields `_. .. code-block:: python from biothings.hub.datatransform import MyChemInfoEdge import networkx as nx graph_mychem = nx.DiGraph() ############################################################################### # DataTransform Nodes and Edges ############################################################################### graph_mychem.add_node('pubchem') graph_mychem.add_node('inchikey') graph_mychem.add_edge('pubchem', 'inchikey', object=MyChemInfoEdge('pubchem.cid', 'pubchem.inchi_key')) MyGeneInfoEdge ~~~~~~~~~~~~~~ .. autoclass:: biothings.hub.datatransform.MyGeneInfoEdge RegExEdge ~~~~~~~~~ .. autoclass:: biothings.hub.datatransform.RegExEdge This example graph uses the `RegExEdge` class to convert from `pubchem` to a shorter form. The `CID:` prefix is removed by the regular expression substitution: .. code-block:: python from biothings.hub.datatransform import RegExEdge import networkx as nx graph = nx.DiGraph() ############################################################################### # DataTransform Nodes and Edges ############################################################################### graph.add_node('pubchem') graph.add_node('pubchem-short') graph.add_edge('pubchem', 'pubchem-short', object=RegExEdge('CID:', '')) Example Usage ------------- A complex graph developed for use with `MyChem.info `_ is shown `here `_. This file includes a definition of the `MyChemKeyLookup` class which is used to call the module on the data source. In general, the graph and class should be supplied to the user by the BioThings.api maintainers. To call the `DataTransform` module on the `Biothings Uploader`, the following definition is used: .. code-block:: python keylookup = MyChemKeyLookup( [('inchi', 'pharmgkb.inchi'), ('pubchem', 'pharmgkb.xrefs.pubchem.cid'), ('drugbank', 'pharmgkb.xrefs.drugbank'), ('chebi', 'pharmgkb.xrefs.chebi')]) def load_data(self,data_folder): input_file = os.path.join(data_folder,"drugs.tsv") return self.keylookup(load_data)(input_file) The parameters passed to `MyChemKeyLookup` are a list of input types. The first element in an input type is the node name that must match the graph. The second element is the field in `dotstring` notation which should describe where the identifier should be read from in a document. The following report was reported when using the `DataTransform` module with PharmGKB. Reports have a section for document conversion and a section describing conversion along each edge. The document section shows which inputs were used to produce which outputs. The edge section is useful in debugging graphs, ensuring that different conversion edges are working properly. .. code-block:: python { 'doc_report': { "('inchi', 'pharmgkb.inchi')-->inchikey": 1637, "('pubchem', 'pharmgkb.xrefs.pubchem.cid')-->inchikey": 46 "('drugbank', 'pharmgkb.xrefs.drugbank')-->inchikey": 41, "('drugbank', 'pharmgkb.xrefs.drugbank')-->drugbank": 25, } 'edge_report': { 'inchi-->chembl': 1109, 'inchi-->drugbank': 319, 'inchi-->pubchem': 209, 'chembl-->inchikey': 1109, 'drugbank-->inchikey': 360, 'pubchem-->inchikey': 255 'drugbank-->drugbank': 25, }, } As an example, the number identifiers converted from `inchi` to `inchikey` is 1637. However, these conversions are done via intermediates. One of these intermediates is `chembl` and the number of identifiers converted from `inchi` to `chembl` is 319. Some identifiers are converted directly from `pubchem` and `drugbank`. The `inchi` field is used to lookup several intermediates (`chembl`, `drugbank`, and `pubchem`). Eventually, most of these intermediates are converted to `inchikey`. Advanced Usage - DataTransform MDB ---------------------------------- The `DataTransformMDB` module was written as a decorator class which is intended to be applied to the `load_data` function of a `Biothings Uploader`. This class can be sub-classed to simplify applification within a Biothings service. .. autoclass:: biothings.hub.datatransform.DataTransformMDB An example of how to apply this class is shown below: .. code-block:: python keylookup = DataTransformMDB(graph, input_types, output_types, skip_on_failure=False, skip_w_regex=None, idstruct_class=IDStruct, copy_from_doc=False) def load_data(self,data_folder): input_file = os.path.join(data_folder,"drugs.tsv") return self.keylookup(load_data)(input_file) It is possible to extend the `DataTransformEdge` type and define custom edges. This could be useful for example if the user wanted to define a computation that transforms one identifier to another. For example `inchikey` may be computed directly by performing a hash on the `inchi` identifier. Document Maintainers -------------------- * Greg Taylor (@gregtaylor) * Chunlei Wu (@chunleiwu)