biothings.web.query

biothings.web.query.builder

Biothings Query Builder

Turn the biothings query language to that of the database. The interface contains a query term (q) and query options.

Depending on the underlying database choice, the data type of the query term and query options vary. At a minimum, a query builder should support:

q: str, a query term,: when not provided, always perform a match all query. when provided as an empty string, always match none.

options: dotdict, optional query options.

scopes: List[str], the fields to look for the query term.
the meaning of scopes being an empty list or a None object/not provided is controlled by specific class implementations or not defined.

_source: List[str], fields to return in the result. size: int, maximum number of hits to return. from_: int, starting index of result to return. sort: str, customized sort keys for result list

aggs: str, customized aggregation string. post_filter: str, when provided, the search hits are filtered after the aggregations are calculated. facet_size: int, maximum number of agg results.

class biothings.web.query.builder.ESQueryBuilder(user_query: str | ESUserQuery = None, scopes_regexs: Iterable[Tuple[str | Pattern, str | Iterable]] = None, scopes_default: Tuple[str] = ('_id',), pattern_default: Tuple[str | Pattern, str | Iterable] = (re.compile('(?P<scope>[^:]+):(?P<term>[\\W\\w]+)'), ()), allow_random_query: bool = True, allow_nested_query: bool = False, metadata: BiothingsMetadata = None, formatter: ESResultFormatter = None)[source]

Bases: object

Build an Elasticsearch query with elasticsearch-dsl.

ES Query Builder Architecture

————————–↓↓↓————————–: _build_one

(dispatch basing on scopes, then apply_extras(..))
————↓↓↓————————↓↓↓————-: _build_string_query | _build_match_query

(__all__, userquery,..) | (compound match query)
————↓↓↓————————↓↓↓————-: default_string_query | default_match_query

(map to ES query string) | (map to ES match query)

apply_extras(search, options)[source]: Process non-query options and customize their behaviors. Customized aggregation syntax string is translated here.

build(q=None, **options)[source]

Build a query according to q and options. This is the public method called by API handlers.

Regarding scopes:: scopes: [str] nonempty, match query. scopes: NoneType, or [], no scope, so query string query.
Additionally support these options:: explain: include es scoring information userquery: customized function to interpret q

additional keywords are passed through as es keywords
for example: ‘explain’, ‘version’ …
multi-search is supported when q is a list. all queries
are built individually and then sent in one request.

default_match_query(q, scopes, options)[source]: Override this to customize default match query. By default it implements a multi_match query.

default_string_query(q, options)[source]: Override this to customize default string query. By default it implements a query string query.

class biothings.web.query.builder.ESScrollID(seq: object)[source]: Bases: UserString

class biothings.web.query.builder.ESUserQuery(path)[source]

Bases: object

get_filter(named_query)[source]

get_query(named_query, **kwargs)[source]

has_filter(named_query)[source]

has_query(named_query)[source]

property logger

class biothings.web.query.builder.Group(term, scopes)

Bases: tuple

Create new instance of Group(term, scopes)

scopes: Alias for field number 1

term: Alias for field number 0

class biothings.web.query.builder.MongoQueryBuilder(default_scopes=('_id',))[source]

Bases: object

build(q, **options)[source]

class biothings.web.query.builder.QStringParser(default_scopes: Tuple[str] = None, patterns: Iterable[Tuple[str | Pattern, str | Iterable]] = None, default_pattern: Tuple[str | Pattern, str | Iterable] = (re.compile('(?P<scope>[^:]+):(?P<term>[\\W\\w]+)'), ()), gpnames: Tuple[str] = None, formatter: ESResultFormatter = None)[source]

Bases: object

parse(query: str, metadata: BiothingsMetadata) → Query[source]

Parsing method for the QStringParser object

Inputs query: string query to search the elasticsearch instance metadata: BiothingsMetadata object. Typically the BiothingsESMetadata object defined in the namespace configuration

Flow: 1) It will first attempt to load the metadata fields associated the endpoint we’re querying against. There is a potential chance that the cache for the BiothingsESMetadata object never refreshed due to the asynchronous nature of the connection so we can’t assume that the data will be loaded 2) We then iterate over the provided regex patterns from the configuration. It greedily searchs the supplied regex patterns supplied via <self.patterns> to the first match in the list. The search breaks after the first match so the order of self.patterns is important when setting the configuration 3) If a match if found we then attempt to extract the two main matching groups from the expression. We have the gpname property defined for the parser class that is a namedtuple of the following structure:

>>> Group = namedtuple("Group", ("term", "scopes"))

The regex patterns typically define the pattern roughly of the following structure of <term>:<scope>. With the <term> grouping referring to the search term and <scope> group matching the different fields to search against. The matched regex pattern attempts to find these defined groups and pull them out. However it isn’t a requirement for either term or scope so we have an order of precedence for storing the term_query and scope_fields

<structure> (highest priority[variable name] << higher priority << lower priority << lowest priority)

<term> (regex term[self.gpname.term] << raw input query[query])

<scope> (regex_scope[self.gpname.scopes] << regex pattern[pattern_fields] << default scope[self.default_scopes]

Using this priority structure, we build the Query object. This is also a named tuple with the exact same structure as the previously defined Group

>>> Query = namedtuple("Query", ("term", "scopes"))

4) After exiting the loop we perform the metadata check. If we have metadata fields to validate against we check to see if the generated scope fields are a subset of the metadata fields. In the positive case, we do nothing and continue with the same query_object instance. In the negative case, we reset the query_object to the default 5) The final check is see if we have a defined query_object. In the case of no regex pattern matching against the query, we simply set the query_object to the default instance 6) We return the constructed Query instance to the caller

class biothings.web.query.builder.Query(term, scopes)

Bases: tuple

Create new instance of Query(term, scopes)

scopes: Alias for field number 1

term: Alias for field number 0

exception biothings.web.query.builder.RawQueryInterrupt(data)[source]: Bases: Exception

class biothings.web.query.builder.SQLQueryBuilder(tables, default_scopes=('id',), default_limit=10)[source]

Bases: object

build(q, **options)[source]

biothings.web.query.engine

Search Execution Engine

Take the output of the query builder and feed to the corresponding database engine. This stage typically resolves the db destination from a biothing_type and applies presentation and/or networking parameters.

Example:

>>> from biothings.web.query import ESQueryBackend
>>> from elasticsearch import Elasticsearch
>>> from elasticsearch.dsl import Search

>>> backend = ESQueryBackend(Elasticsearch())
>>> backend.execute(Search().query("match", _id="1017"))

>>> _["hits"]["hits"][0]["_source"].keys()
dict_keys(['taxid', 'symbol', 'name', ... ])

class biothings.web.query.engine.AsyncESQueryBackend(client, indices=None, scroll_time='1m', scroll_size=1000, multisearch_concurrency=5, total_hits_as_int=True)[source]

Bases: ESQueryBackend

Execute an Elasticsearch query

async execute(query, **options)[source]

Execute the corresponding query. Must return an awaitable. May override to add more. Handle uncaught exceptions.

Options:: fetch_all: also return a scroll_id for this query (default: false) biothing_type: which type’s corresponding indices to query (default in config.py)

class biothings.web.query.engine.ESQueryBackend(client, indices=None)[source]

Bases: object

adjust_index(original_index, query, **options)[source]: Override to get specific ES index.

execute(query, **options)[source]

exception biothings.web.query.engine.EndScrollInterrupt[source]: Bases: ResultInterrupt

class biothings.web.query.engine.MongoQueryBackend(client, collections)[source]

Bases: object

execute(query, **options)[source]

exception biothings.web.query.engine.RawResultInterrupt(data)[source]: Bases: ResultInterrupt

exception biothings.web.query.engine.ResultInterrupt(data)[source]: Bases: Exception

class biothings.web.query.engine.SQLQueryBackend(client)[source]

Bases: object

execute(query, **options)[source]

biothings.web.query.formatter

Search Result Formatter

Transform the raw query result into consumption-friendly structures by possibly removing from, adding to, and/or flattening the raw response from the database engine for one or more individual queries.

class biothings.web.query.formatter.Doc(dict=None, /, **kwargs)[source]

Bases: FormatterDict

{: “_id”: … , “_score”: … , …

}

class biothings.web.query.formatter.ESResultFormatter(licenses=None, license_transform=None, field_notes=None, excluded_keys=())[source]

Bases: ResultFormatter

Class to transform the results of the Elasticsearch query generated prior in the pipeline. This contains the functions to extract the final document from the elasticsearch query result in `Elasticsearch Query`_. This also contains the code to flatten a document etc.

transform(response, **options)[source]

Transform the query response to a user-friendly structure. Mainly deconstruct the elasticsearch response structure and hand over to transform_doc to apply the options below.

Options:

# generic transformations for dictionaries # —————————————— dotfield: flatten a dictionary using dotfield notation _sorted: sort keys alaphabetically in ascending order always_list: ensure the fields specified are lists or wrapped in a list allow_null: ensure the fields specified are present in the result,

the fields may be provided as type None or [].

# additional multisearch result transformations # ———————————————— template: base dict for every result, for example: {“success”: true} templates: a different base for every result, replaces the setting above template_hit: a dict to update every positive hit result, default: {“found”: true} template_miss: a dict to update every query with no hit, default: {“found”: false}

# document format and content management # ————————————— biothing_type: result document type to apply customized transformation.

for example, add license field basing on document type’s metadata.

one: return the individual document if there’s only one hit. ignore this setting: if there are multiple hits. return None if there is no hit. this option is not effective when aggregation results are also returned in the same query.

native: bool, if the returned result is in python primitive types. version: bool, if _version field is kept. score: bool, if _score field is kept. with_total: bool, if True, the response will include max_total documents,

and a message to tell how many query terms return greater than the max_size of hits. The default is False. An example when with_total is True: {

‘max_total’: 100, ‘msg’: ‘12 query terms return > 1000 hits, using from=1000 to retrieve the remaining hits’, ‘hits’: […]

}

jmespath: passed as “<target_field>|<jmes_query>” to transform any target field in hit using: jmespath query syntax.
jmespath_exclude_empty: bool, if True, exclude the hit from the hits list if the transformed value: is empty (e.g. [] or None). Default is False.

transform_aggs(res)[source]

Transform the aggregations field and make it more presentable. For example, these are the fields of a two level nested aggregations:

aggregations.<term>.doc_count_error_upper_bound aggregations.<term>.sum_other_doc_count aggregations.<term>.buckets.key aggregations.<term>.buckets.key_as_string aggregations.<term>.buckets.doc_count aggregations.<term>.buckets.<nested_term>.* (recursive)

After the transformation, we’ll have:

facets.<term>._type facets.<term>.total facets.<term>.missing facets.<term>.other facets.<term>.terms.count facets.<term>.terms.term facets.<term>.terms.<nested_term>.* (recursive)

Note the first level key change doesn’t happen here.

transform_hit(path, obj, doc, options)[source]

Transform an individual search hit result. By default add licenses for the configured fields.

If a source has a license url in its metadata, Add “_license” key to the corresponding fields. Support dot field representation field alias.

If we have the following settings in web_config.py

LICENSE_TRANSFORM = {: “exac_nontcga”: “exac”, “snpeff.ann”: “snpeff”

},

Then GET /v1/variant/chr6:g.38906659G>A should look like: {

“exac”: {
“_license”: “http://bit.ly/2H9c4hg”, “af”: 0.00002471},

“exac_nontcga”: {
“_license”: “http://bit.ly/2H9c4hg”, <— “af”: 0.00001883}, …

} And GET /v1/variant/chr14:g.35731936G>C could look like: {

“snpeff”: {
“_license”: “http://bit.ly/2suyRKt”, “ann”: [{“_license”: “http://bit.ly/2suyRKt”, <—

“effect”: “intron_variant”, “feature_id”: “NM_014672.3”, …}, {“_license”: “http://bit.ly/2suyRKt”, <— “effect”: “intron_variant”, “feature_id”: “NM_001256678.1”, …}, …]

}, …

}

The arrow marked fields would not exist without the setting lines.

This method can be overridden to add more transformations in a customized Formatter class.

transform_mapping(mapping, prefix=None, search=None)[source]: Transform Elasticsearch mapping definition to user-friendly field definitions metadata results.

trasform_jmespath(path: str, obj, doc, options) → None[source]

Transform any target field in obj using jmespath query syntax. The jmespath query parameter value should have the pattern of “<target_list_fieldname>|<jmespath_query_expression>” <target_list_fieldname> can be any sub-field of the input obj using dot notation, e.g. “aaa.bbb”.

If empty or “.”, it will be the root field.

The flexible jmespath syntax allows to filter/transform any nested objects in the input obj on the fly. The output of the jmespath transformation will then be used to replace the original target field value. .. rubric:: Examples

filtering an array sub-field
jmespath=tags|[?name==`Metadata`] # filter tags array by name field jmespath=aaa.bbb|[?(sub_a==`val_a`||sub_a==`val_aa`)%26%26sub_b==`val_b`] # use %26%26 for &&

obj: the object to be transformed, which corresponding to the current path doc: the whole document we are traversing in the upstream _transform_hit method

passed here in case we need to make changes to the whole document.

static trasform_jmespath_obj(obj, parent_path: str, target_field: str, doc, jmes_query, jmespath_exclude_empty=False) → None[source]

static traverse(obj, leaf_node=False)

Output path-dictionary pairs. For example, input: {

‘exac_nontcga’: {‘af’: 0.00001883}, ‘gnomad_exome’: {‘af’: {‘af’: 0.0000119429, ‘af_afr’: 0.000123077}}, ‘snpeff’: {‘ann’: [{‘effect’: ‘intron_variant’,

‘feature_id’: ‘NM_014672.3’}, {‘effect’: ‘intron_variant’, ‘feature_id’: ‘NM_001256678.1’}]}

} will be translated to a generator: (

(“exac_nontcga”, {“af”: 0.00001883}), (“gnomad_exome.af”, {“af”: 0.0000119429, “af_afr”: 0.000123077}), (“gnomad_exome”, {“af”: {“af”: 0.0000119429, “af_afr”: 0.000123077}}), (“snpeff.ann”, {“effect”: “intron_variant”, “feature_id”: “NM_014672.3”}), (“snpeff.ann”, {“effect”: “intron_variant”, “feature_id”: “NM_001256678.1”}), (“snpeff.ann”, [{ … },{ … }]), (“snpeff”, {“ann”: [{ … },{ … }]}), (‘’, {‘exac_nontcga’: {…}, ‘gnomad_exome’: {…}, ‘snpeff’: {…}})

) or when traversing leaf nodes: (

(‘exac_nontcga.af’, 0.00001883), (‘gnomad_exome.af.af’, 0.0000119429), (‘gnomad_exome.af.af_afr’, 0.000123077), (‘snpeff.ann.effect’, ‘intron_variant’), (‘snpeff.ann.feature_id’, ‘NM_014672.3’), (‘snpeff.ann.effect’, ‘intron_variant’), (‘snpeff.ann.feature_id’, ‘NM_001256678.1’)

)

class biothings.web.query.formatter.FormatterDict(dict=None, /, **kwargs)[source]

Bases: UserDict

collapse(key)[source]

exclude(keys)[source]

include(keys)[source]

wrap(key, kls)[source]

class biothings.web.query.formatter.Hits(dict=None, /, **kwargs)[source]

Bases: FormatterDict

{

“total”: … , “hits”: [

{ … }, { … }, …

]

}

class biothings.web.query.formatter.MongoResultFormatter[source]

Bases: ResultFormatter

transform(result, **options)[source]

class biothings.web.query.formatter.ResultFormatter[source]

Bases: object

transform(response)[source]

transform_mapping(mapping, prefix=None, search=None)[source]

exception biothings.web.query.formatter.ResultFormatterException[source]: Bases: Exception

class biothings.web.query.formatter.SQLResultFormatter[source]

Bases: ResultFormatter

transform(result, **options)[source]

biothings.web.query.pipeline

class biothings.web.query.pipeline.AsyncESQueryPipeline(builder, backend, formatter, **settings)[source]

Bases: QueryPipeline

async fetch(**kwargs)[source]

async search(**kwargs)[source]

class biothings.web.query.pipeline.ESQueryPipeline(builder=None, backend=None, formatter=None, *args, **kwargs)[source]

Bases: QueryPipeline

fetch(id, **options)[source]

search(q, **options)[source]

class biothings.web.query.pipeline.MongoQueryPipeline(builder, backend, formatter, **settings)[source]: Bases: QueryPipeline

class biothings.web.query.pipeline.QueryPipeline(builder, backend, formatter, **settings)[source]

Bases: object

fetch(id, **options)[source]

search(q, **options)[source]

exception biothings.web.query.pipeline.QueryPipelineException(code: int = 500, summary: str = '', details: object = None)[source]

Bases: Exception

code: int = 500

details: object = None

summary: str = ''

exception biothings.web.query.pipeline.QueryPipelineInterrupt(data)[source]: Bases: QueryPipelineException

class biothings.web.query.pipeline.SQLQueryPipeline(builder, backend, formatter, **settings)[source]: Bases: QueryPipeline

biothings.web.query.pipeline.capturesESExceptions(func)[source]