biothings.hub.datainspect

biothings.hub.datainspect.inspector

exception biothings.hub.datainspect.inspector.InspectorError[source]

Bases: Exception

class biothings.hub.datainspect.inspector.InspectorManager(upload_manager, build_manager, *args, **kwargs)[source]

Bases: BaseManager

clean_stale_status()[source]

During startup, search for action in progress which would have been interrupted and change the state to “canceled”. Ex: some donwloading processes could have been interrupted, at startup, “downloading” status should be changed to “canceled” so to reflect actual state on these datasources. This must be overriden in subclass.

flatten(data_provider, mode=('type', 'stats'), do_validate=True)[source]
get_backend_provider_info(data_provider)[source]
inspect(data_provider, mode='type', batch_size=10000, limit=None, sample=None, **kwargs)[source]

Inspect given data provider: - backend definition, see bt.hub.dababuild.create_backend for

supported format), eg “merged_collection” or (“src”,”clinvar”)

  • or callable yielding documents

Mode: - “type”: will inspect and report type map found in data (internal/non-standard format) - “mapping”: will inspect and return a map compatible for later

ElasticSearch mapping generation (see bt.utils.es.generate_es_mapping)

  • “stats”: will inspect and report types + different counts found in data, giving a detailed overview of the volumetry of each fields and sub-fields

  • “jsonschema”, same as “type” but result is formatted as json-schema standard

  • limit: when set to an integer, will inspect only x documents.

  • sample: combined with limit, for each document, if random.random() <= sample (float), the document is inspected. This option allows to inspect only a sample of data.

setup_log()[source]

Setup and return a logger instance

biothings.hub.datainspect.inspector.inspect_data(backend_provider, ids, mode, pre_mapping, **kwargs)[source]