biothings.hub.autoupdate

biothings.hub.autoupdate.dumper

class biothings.hub.autoupdate.dumper.BiothingsDumper(*args, **kwargs)[source]

Bases: HTTPDumper

This dumper is used to maintain a BioThings API up-to-date. BioThings data is available as either as an ElasticSearch snapshot when full update, and a collection of diff files for incremental updates. It will either download incremental updates and apply diff, or trigger an ElasticSearch restore if the latest version is a full update. This dumper can also be configured with precedence rules: when a full and a incremental update is available, rules can set so full is preferably used over incremental (size can also be considered when selecting the preferred way).

AUTO_UPLOAD = False
AWS_ACCESS_KEY_ID = None
AWS_SECRET_ACCESS_KEY = None
SRC_NAME = None
SRC_ROOT_FOLDER = None
TARGET_BACKEND = None
VERSION_URL = None
anonymous_download(remoteurl, localfile, headers=None)[source]
auth_download(bucket_name, key, localfile, headers=None)[source]
property base_url
check_compat(build_meta)[source]
choose_best_version(versions)[source]

Out of all compatible versions, choose the best: 1. choose incremental vs. full according to preferences 2. version must be the highest (most up-to-date)

compare_remote_local(remote_version, local_version, orig_remote_version, orig_local_version)[source]
create_todump_list(force=False, version='latest', url=None)[source]

Fill self.to_dump list with dict(“remote”:remote_path,”local”:local_path) elements. This is the todo list for the dumper. It’s a good place to check whether needs to be downloaded. If ‘force’ is True though, all files will be considered for download

download(remoteurl, localfile, headers=None)[source]

Download “remotefile’ to local location defined by ‘localfile’ Return relevant information about remotefile (depends on the actual client)

find_update_path(version, backend_version=None)[source]

Explore available versions and find the path to update the hub up to “version”, starting from given backend_version (typically current version found in ES index). If backend_version is None (typically no index yet), a complete path will be returned, from the last compatible “full” release up-to the latest “diff” update. Returned is a list of dict, where each dict is a build metadata element containing information about each update (see versions.json), the order of the list describes the order the updates should be performed.

async get_target_backend()[source]

Example: [{

‘host’: ‘es6.mygene.info:9200’, ‘index’: ‘mygene_allspecies_20200823_ufkwdv79’, ‘index_alias’: ‘mygene_allspecies’, ‘version’: ‘20200906’, ‘count’: 38729977

}]

async info(version='latest')[source]

Display version information (release note, etc…) for given version {

“info”: … “release_note”: …

}

load_remote_json(url)[source]
post_dump(*args, **kwargs)[source]

Placeholder to add a custom process once the whole resource has been dumped. Optional.

prepare_client()[source]

Depending on presence of credentials, inject authentication in client.get()

remote_is_better(remotefile, localfile)[source]

Determine if remote is better

Override if necessary.

async reset_target_backend()[source]
property target_backend
async versions()[source]

Display all available versions. Example: [{

‘build_version’: ‘20171003’, ‘url’: ‘https://biothings-releases.s3.amazonaws.com:443/mygene.info/20171003.json’, ‘release_date’: ‘2017-10-06T11:58:39.749357’, ‘require_version’: None, ‘target_version’: ‘20171003’, ‘type’: ‘full’

}, …]

biothings.hub.autoupdate.uploader

class biothings.hub.autoupdate.uploader.BiothingsUploader(*args, **kwargs)[source]

Bases: BaseSourceUploader

db_conn_info is a database connection info tuple (host,port) to fetch/store information about the datasource’s state.

AUTO_PURGE_INDEX = False
SYNCER_FUNC = None
TARGET_BACKEND = None
async apply_diff(build_meta, job_manager, **kwargs)[source]
clean_archived_collections()[source]
get_snapshot_repository_config(build_meta)[source]

Return (name,config) tuple from build_meta, where name is the repo name, and config is the repo config

async load(*args, **kwargs)[source]

Main resource load process, reads data from doc_c using chunk sized as batch_size. steps defines the different processes used to laod the resource: - “data” : will store actual data into single collections - “post” : will perform post data load operations - “master” : will register the master document in src_master

name = None
async restore_snapshot(build_meta, job_manager, **kwargs)[source]
property syncer_func
property target_backend
async update_data(batch_size, job_manager, **kwargs)[source]

Look in data_folder and either restore a snapshot to ES or apply diff to current ES index