biothings.hub.datatransform

biothings.hub.datatransform.ciidstruct

CIIDStruct - case insenstive id matching data structure

class biothings.hub.datatransform.ciidstruct.CIIDStruct(field=None, doc_lst=None)[source]

Bases: IDStruct

CIIDStruct - id structure for use with the DataTransform classes. The basic idea is to provide a structure that provides a list of (original_id, current_id) pairs.

This is a case-insensitive version of IDStruct.

Initialize the structure :param field: field for documents to use as an initial id (optional) :param doc_lst: list of documents to use when building an initial list (optional)

add(left, right)[source]

add a (original_id, current_id) pair to the list, All string values are typecast to lowercase

find(where, ids)[source]

Case insensitive lookup of ids

biothings.hub.api.datatransform.datatransform_api

DataTransforAPI - classes around API based key lookup.

class biothings.hub.datatransform.datatransform_api.BiothingsAPIEdge(lookup, fields, weight=1, label=None, url=None)[source]

Bases: DataTransformEdge

APIEdge - IDLookupEdge object for API calls

Initialize the class :param label: A label can be used for debugging purposes.

property client

property getter for client

client_name = None
edge_lookup(keylookup_obj, id_strct, debug=False)[source]

Follow an edge given a key.

This method uses the data in the edge_object to find one key to another key using an api. :param edge: :param key: :return:

init_state()[source]

initialize state - pickleable member variables

prepare_client()[source]

Load the biothings_client for the class :return:

class biothings.hub.datatransform.datatransform_api.DataTransformAPI(input_types, output_types, *args, **kwargs)[source]

Bases: DataTransform

Perform key lookup or key conversion from one key type to another using an API endpoint as a data source.

This class uses biothings apis to conversion from one key type to another. Base classes are used with the decorator syntax shown below:

@IDLookupMyChemInfo(input_types, output_types)
def load_document(doc_lst):
    for d in doc_lst:
        yield d

Lookup fields are configured in the ‘lookup_fields’ object, examples of which can be found in ‘IDLookupMyGeneInfo’ and ‘IDLookupMyChemInfo’.

Required Options:
  • input_types
    • ‘type’

    • (‘type’, ‘nested_source_field’)

    • [(‘type1’, ‘nested.source_field1’), (‘type2’, ‘nested.source_field2’), …]

  • output_types:
    • ‘type’

    • [‘type1’, ‘type2’]

Additional Options: see DataTransform class

Initialize the IDLookupAPI object.

batch_size = 10
default_source = '_id'
key_lookup_batch(batchiter)[source]

Look up all keys for ids given in the batch iterator (1 block) :param batchiter: 1 lock of records to look up keys for :return:

lookup_fields = {}
class biothings.hub.datatransform.datatransform_api.DataTransformMyChemInfo(input_types, output_types=None, skip_on_failure=False, skip_w_regex=None)[source]

Bases: DataTransformAPI

Single key lookup for MyChemInfo

Initialize the class by seting up the client object.

lookup_fields = {'chebi': 'chebi.chebi_id', 'chembl': 'chembl.molecule_chembl_id', 'drugbank': 'drugbank.drugbank_id', 'drugname': ['drugbank.name', 'unii.preferred_term', 'chebi.chebi_name', 'chembl.pref_name'], 'inchi': ['drugbank.inchi', 'chembl.inchi', 'pubchem.inchi'], 'inchikey': ['drugbank.inchi_key', 'chembl.inchi_key', 'pubchem.inchi_key'], 'pubchem': 'pubchem.cid', 'rxnorm': ['unii.rxcui'], 'unii': 'unii.unii'}
output_types = ['inchikey', 'unii', 'rxnorm', 'drugbank', 'chebi', 'chembl', 'pubchem', 'drugname']
class biothings.hub.datatransform.datatransform_api.DataTransformMyGeneInfo(input_types, output_types=None, skip_on_failure=False, skip_w_regex=None)[source]

Bases: DataTransformAPI

deprecated

Initialize the class by seting up the client object.

lookup_fields = {'ensembl': 'ensembl.gene', 'entrezgene': 'entrezgene', 'symbol': 'symbol', 'uniprot': 'uniprot.Swiss-Prot'}
class biothings.hub.datatransform.datatransform_api.MyChemInfoEdge(lookup, field, weight=1, label=None, url=None)[source]

Bases: BiothingsAPIEdge

The MyChemInfoEdge uses the MyChem.info API to convert identifiers.

Parameters:
  • lookup (str) – The field in the API to search with the input identifier.

  • field (str) – The field in the API to convert to.

  • weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

client_name = 'drug'
class biothings.hub.datatransform.datatransform_api.MyGeneInfoEdge(lookup, field, weight=1, label=None, url=None)[source]

Bases: BiothingsAPIEdge

The MyGeneInfoEdge uses the MyGene.info API to convert identifiers.

Parameters:
  • lookup (str) – The field in the API to search with the input identifier.

  • field (str) – The field in the API to convert to.

  • weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

client_name = 'gene'

biothings.hub.datatransform.datatransform_mdb

DataTransform MDB module - class for performing key lookup using conversions described in a networkx graph.

class biothings.hub.datatransform.datatransform_mdb.CIMongoDBEdge(collection_name, lookup, field, weight=1, label=None)[source]

Bases: MongoDBEdge

Case-insensitive MongoDBEdge

Parameters:
  • collection_name (str) – The name of the MongoDB collection.

  • lookup (str) – The field that will match the input identifier in the collection.

  • field (str) – The output identifier field that will be read out of matching documents.

  • weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

collection_find(id_lst, lookup, field)[source]

Abstract out (as one line) the call to collection.find and use a case-insensitive collation

class biothings.hub.datatransform.datatransform_mdb.DataTransformMDB(graph, *args, **kwargs)[source]

Bases: DataTransform

Convert document identifiers from one type to another.

The DataTransformNetworkX module was written as a decorator class which should be applied to the load_data function of a Biothings Uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.

Parameters:
  • graph – nx.DiGraph (networkx 2.1) configuration graph

  • input_types – A list of input types for the form (identifier, field) where identifier matches a node and field is an optional dotstring field for where the identifier should be read from (the default is ‘_id’).

  • output_types (list(str)) – A priority list of identifiers to convert to. These identifiers should match nodes in the graph.

  • id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.

  • skip_on_failure (bool) – If True, documents where identifier conversion fails will be skipped in the final document list.

  • skip_w_regex (bool) – Do not perform conversion if the identifier matches the regular expression provided to this argument. By default, this option is disabled.

  • skip_on_success (bool) – If True, documents where identifier conversion succeeds will be skipped in the final document list.

  • idstruct_class (class) – Override an internal data structure used by the this module (advanced usage)

  • copy_from_doc (bool) – If true then an identifier is copied from the input source document regardless as to weather it matches an edge or not. (advanced usage)

batch_size = 1000
default_source = '_id'
key_lookup_batch(batchiter)[source]

Look up all keys for ids given in the batch iterator (1 block) :param batchiter: 1 lock of records to look up keys for :return:

travel(input_type, target, doc_lst)[source]

Traverse a graph from a start key type to a target key type using precomputed paths.

Parameters:
  • start – key type to start from

  • target – key type to end at

  • key – key value of type ‘start’

Returns:

class biothings.hub.datatransform.datatransform_mdb.MongoDBEdge(collection_name, lookup, field, weight=1, label=None, check_index=True)[source]

Bases: DataTransformEdge

The MongoDBEdge uses data within a MongoDB collection to convert one identifier to another. The input identifier is used to search a collection. The output identifier values are read out of that collection:

Parameters:
  • collection_name (str) – The name of the MongoDB collection.

  • lookup (str) – The field that will match the input identifier in the collection.

  • field (str) – The output identifier field that will be read out of matching documents.

  • weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

property collection

getting for collection member variable

collection_find(id_lst, lookup, field)[source]

Abstract out (as one line) the call to collection.find

edge_lookup(keylookup_obj, id_strct, debug=False)[source]

Follow an edge given a key.

An edge represets a document and this method uses the data in the edge_object to find one key to another key using exactly one mongodb lookup. :param keylookup_obj: :param id_strct: :return:

init_state()[source]

initialize the state of pickleable objects

prepare_collection()[source]

Load the mongodb collection specified by collection_name. :return:

biothings.hub.datatransform.datatransform

DataTransform Module - IDStruct - DataTransform (superclass)

class biothings.hub.datatransform.datatransform.DataTransform(input_types, output_types, id_priority_list=None, skip_on_failure=False, skip_w_regex=None, skip_on_success=False, idstruct_class=<class 'biothings.hub.datatransform.datatransform.IDStruct'>, copy_from_doc=False, debug=False)[source]

Bases: object

DataTransform class. This class is the public interface for the DataTransform module. Much of the core logic is in the subclass.

Initialize the keylookup object and precompute paths from the start key to all target keys.

The decorator is intended to be applied to the load_data function of an uploader. The load_data function yields documents, which are then post processed by call and the ‘id’ key conversion is performed.

Parameters:
  • G – nx.DiGraph (networkx 2.1) configuration graph

  • collections – list of mongodb collection names

  • input_type – key type to start key lookup from

  • output_types – list of all output types to convert to

  • id_priority_list (list(str)) – A priority list of identifiers to to sort input and output types by.

  • id_struct_class – IDStruct used to manager/fetch IDs from docs

  • copy_from_doc – if transform failed using the graph, try to get value from the document itself when output_type == input_type. No check is performed, it’s a straight copy. If checks are needed (eg. check that an ID referenced in the doc actually exists in another collection, nodes with self-loops can be used, so ID resolution will be forced to go through these loops to ensure data exists)

DEFAULT_WEIGHT = 1
batch_size = 1000
debug = False
default_source = '_id'
property id_priority_list

Property method for getting id_priority_list

key_lookup_batch(batchiter)[source]

Core method for looking up all keys in batch (iterator) :param batchiter: :return:

lookup_one(doc)[source]

KeyLookup on document. This method is called as a function call instead of a decorator on a document iterator.

sort_input_by_priority_list(input_types)[source]

Reorder the given input_types to follow a priority list. Inputs not in the priority list should remain in their given order at the end of the list.

sort_output_by_priority_list(output_types)[source]

Reorder the given output_types to follow a priority list. Outputs not in the priority list should remain in their given order at the end of the list.

class biothings.hub.datatransform.datatransform.DataTransformEdge(label=None)[source]

Bases: object

DataTransformEdge. This class contains information needed to transform one key to another.

Initialize the class :param label: A label can be used for debugging purposes.

edge_lookup(keylookup_obj, id_strct, debug=False)[source]

virtual method for edge lookup. Each edge class is responsible for its own lookup procedures given a keylookup_obj and an id_strct :param keylookup_obj: :param id_strct: - list of tuples (orig_id, current_id) :return:

init_state()[source]

initialize the state of pickleable objects

property logger

getter for the logger property

prepare(state=None)[source]

Prepare class state objects (pickleable objects)

setup_log()[source]

setup the logger member variable

unprepare()[source]

reset anything that’s not picklable (so self can be pickled) return what’s been reset as a dict, so self can be restored once pickled

class biothings.hub.datatransform.datatransform.IDStruct(field=None, doc_lst=None)[source]

Bases: object

IDStruct - id structure for use with the DataTransform classes. The basic idea is to provide a structure that provides a list of (original_id, current_id) pairs.

Initialize the structure :param field: field for documents to use as an initial id (optional) :param doc_lst: list of documents to use when building an initial list (optional)

add(left, right)[source]

add a (original_id, current_id) pair to the list

static find(where, ids)[source]

Find all ids in dictionary where

find_left(ids)[source]

Find left values given a list of ids

find_right(ids)[source]

Find the first id founding by searching the (_, right) identifiers

get_debug(key)[source]

Get debug information for a given key

property id_lst

Build up a list of current ids

import_debug(other)[source]

import debug information the entire IDStruct object

left(key)[source]

Determine if the id (left, _) is registered

lookup(left, right)[source]

Find if a (left, right) pair is already in the list

right(key)[source]

Determine if the id (_, right) is registered

set_debug(left, label, right)[source]

Set debug (left, right) debug values for the structure

static side(_id, where)[source]

Find if an _id is a key in where

transfer_debug(key, other)[source]

transfer debug information for one key in the IDStruct object

class biothings.hub.datatransform.datatransform.RegExEdge(from_regex, to_regex, weight=1, label=None)[source]

Bases: DataTransformEdge

The RegExEdge allows an identifier to be transformed using a regular expression. POSIX regular expressions are supported.

Parameters:
  • from_regex (str) – The first parameter of the regular expression substitution.

  • to_regex (str) – The second parameter of the regular expression substitution.

  • weight (int) – Weights are used to prefer one path over another. The path with the lowest weight is preferred. The default weight is 1.

edge_lookup(keylookup_obj, id_strct, debug=False)[source]

Transform identifiers using a regular expression substitution.

biothings.hub.datatransform.datatransform.nested_lookup(doc, field)[source]

Performs a nested lookup of doc using a period (.) delimited list of fields. This is a nested dictionary lookup. :param doc: document to perform lookup on :param field: period delimited list of fields :return:

biothings.hub.datatransform.histogram

DataTransform Histogram class - track keylookup statistics

class biothings.hub.datatransform.histogram.Histogram[source]

Bases: object

Histogram - track keylookup statistics

update_edge(vert1, vert2, size)[source]

Update the edge histogram

update_io(input_type, output_type, size)[source]

Update the edge histogram