Query strategy performance

From phenoscape
Revision as of 16:05, 18 February 2009 by Crk18 (talk | contribs) (Stored procedure #5: getAnatomyInfo)

This page discusses the requirements of the various data access modules of the Phenoscape application to be demonstrated at the ASIH workshop, and documents the performance of the various querying strategies being tested for these modules. Familiary with SQL and PostgreSQL stored procedures may be necessary to fully comprehend these contents.

Anatomy data services

Given a specific anatomical entity A, the requirements of the anatomy data service module are as follows:

  1. Identify ALL unique phenotypes {P} that are composed of the entity A or any of its subtypes.
  2. For each identified phenotype P which is of the form inheres_in(Q, E)
    1. Identify the quality Q.
      1. In addition, identify the character C which subsumes Q. This requires navigating the PATO hierarchy recursively in some cases
    2. Identify the entity E. E may be the same as A, or any valid subtype of A.
    3. Identify the set of taxa {T} or genotypes {GT} that exhibit the phenotype P.
      1. For a genotype GT, identify the gene G that encodes it.

The result should be a list in the format shown below

<javascript> <Search-Anatomical-Entity> <Phenotype> <Entity> <Quality> <Character> <List of taxa>OR<List of genes> </javascript>


Strategy #1

This strategy is the simplest way to try and get all the necessary information related to the anatomical entity being searched for. Below is the actual query which attempts to do this. This query returns the unique phenotypes associated with the anatomical entity, and the qualities and specific anatomical entities associated with the phenotype, as well as the list of taxa that exhibit each of these phenotypes in one fell table join (uh, swoop).

Query 1 (keep it simple, no stored procedure invocations)

<sql> SELECT DISTINCT phenotype_node.uid AS phenotype, taxon_node.uid AS taxon, quality_node.uid AS quality, anatomy_node.uid AS entity FROM link AS inheres_in_link, link AS search_link, link AS is_a_link, link AS exhibits_link, node AS taxon_node, node AS exhibits_pred_node, node AS phenotype_node, node AS search_pred_node, node AS search_node, node AS inheres_in_pred_node, node AS anatomy_node, node AS is_a_pred_node, node AS quality_node WHERE exhibits_pred_node.uid = 'PHENOSCAPE:exhibits' AND is_a_pred_node.uid = 'OBO_REL:is_a' AND inheres_in_pred_node.uid = 'OBO_REL:inheres_in' AND search_pred_node.uid = 'OBO_REL:inheres_in' AND search_node.uid = 'TAO:0000108' AND inheres_in_link.is_inferred = 'f' AND is_a_link.is_inferred = 'f' AND exhibits_link.node_id IS NOT NULL AND search_link.node_id = phenotype_node.node_id AND search_link.predicate_id = search_pred_node.node_id AND search_link.object_id = search_node.node_id AND inheres_in_link.node_id = phenotype_node.node_id AND inheres_in_link.predicate_id = inheres_in_pred_node.node_id AND inheres_in_link.object_id = anatomy_node.node_id AND is_a_link.node_id = phenotype_node.node_id AND is_a_link.predicate_id = is_a_pred_node.node_id AND is_a_link.object_id = quality_node.node_id AND exhibits_link.node_id = taxon_node.node_id AND exhibits_link.predicate_id = exhibits_pred_node.node_id AND exhibits_link.object_id = phenotype_node.node_id; </sql>

Query Execution Plan

Execution Details

  • Rows returned: 968
  • Phenoscape data revision: 449
  • Time: 0.25 ~ 9 s

Discussion

This query (referred to as QfS1, short for "Query for Strategy 1") returns each phenotype associated with the 'TAO:0000108' and also the specific entities and qualities associated with each phenotype. In addition, the taxa that exhibit the phenotype and the genotypes are also returned. If several taxa {T} or genotypes (GT} exhibit the same phenotype P, which is of the form, inheres_in(Q, E), the results are returned in rows in the format shown below.

<javascript> P1 E1 Q1 T1 P1 E1 Q1 T2 . . P1 E1 Q1 Tn P2 E2 Q2 GT1 . . P2 E2 Q2 GTn </javascript>

Note that for each quality Q in the result above, the character that subsumes it needs to be determined. This cannot be done directly from the query because a recursive traversal of the PATO hierarchy may be required in many cases. Recursive traversals are best implemented by stored procedures. This will definitely add to the execution times documented for this query. Further, note that only the genotypes associated with each phenotype are returned. To find the genes that encode these genotypes, another query needs to be executed. It is not possible to unilaterally join this gene query with QfS1 because both taxa as well as genotypes are returned by QfS1. What is required is a procedural attachment that invokes the gene query if the returned result is a genotype. Again, this procedural attachment can only be implemented as a stored procedure.

Finally, the performance of the query QfS1 varied widely from 0.25 (very good) to 9 seconds (mildly acceptable) in a random set of executions issued from the command line. The query execution planning process is responsible for this. If instead of searching for 'TAO:0000108' (fin), a more general term (like TAO:0100000, the parent term for all term definitions in TAO) were to be searched for, the performance of this query would be much more unpredictable and much slower too.

Query 2 (Bring on the stored procedures, some of them to start)

Let us examine the performance of the very same query where specific stored procedures for identifying the characters subsuming the qualities and genes encoding the genotypes are invoked as part of QfS1.

<sql> SELECT DISTINCT phenotype_node.uid AS phenotype, quality_node.uid AS quality_id, quality_node.label AS quality, getAttributeForQuality(quality_node.node_id) AS attribute, anatomy_node.uid AS entity_id, anatomy_node.label AS entity, getTaxaForPhenotype(phenotype_node.node_id) AS taxonOrGene FROM link AS inheres_in_link, link AS search_link, link AS is_a_link, link AS exhibits_link, node AS taxon_node, node AS exhibits_pred_node, node AS phenotype_node, node AS search_pred_node, node AS search_node, node AS inheres_in_pred_node, node AS anatomy_node, node AS is_a_pred_node, node AS quality_node WHERE exhibits_pred_node.uid = 'PHENOSCAPE:exhibits' AND is_a_pred_node.uid = 'OBO_REL:is_a' AND inheres_in_pred_node.uid = 'OBO_REL:inheres_in' AND search_pred_node.uid = 'OBO_REL:inheres_in' AND search_node.uid = 'TAO:0000108' AND inheres_in_link.is_inferred = 'f' AND is_a_link.is_inferred = 'f' AND exhibits_link.node_id IS NOT NULL AND search_link.node_id = phenotype_node.node_id AND search_link.predicate_id = search_pred_node.node_id AND search_link.object_id = search_node.node_id AND inheres_in_link.node_id = phenotype_node.node_id AND inheres_in_link.predicate_id = inheres_in_pred_node.node_id AND inheres_in_link.object_id = anatomy_node.node_id AND is_a_link.node_id = phenotype_node.node_id AND is_a_link.predicate_id = is_a_pred_node.node_id AND is_a_link.object_id = quality_node.node_id AND exhibits_link.node_id = taxon_node.node_id AND exhibits_link.predicate_id = exhibits_pred_node.node_id AND exhibits_link.object_id = phenotype_node.node_id; </sql>

Execution Details

  • Rows returned: 69
  • Execution time: 7 - 30 seconds

Discussion

This query returns all the required information for an input anatomical entity. Note the taxa or genes exhibiting or responsible for a specific phenotype are all grouped together in one row with the said phenotype. The character which subsumes the quality of a phenotype is retrieved as well. But since 13 tables are joined together, query execution planning may still be a significant cause to the unpredictable and relatively slow performance of this query. Moreover, note these results are being obtained for a query for fin. If the query was for TAO:0100000, the root term of TAO, the performance would be much slower and unpredictable.

Strategy #2

This strategy tries to save on query execution planning times by specifying the order of table joins in the query itself.

Query

<sql> SELECT DISTINCT phenotype_node.uid AS phenotype, quality_node.uid AS quality_id, quality_node.label AS quality, getAttributeForQuality(quality_node.node_id) AS attribute, anatomy_node.uid AS entity_id, anatomy_node.label AS entity, getTaxaForPhenotype(phenotype_node.node_id) AS taxonOrGene FROM link AS inheres_in_link, link AS is_a_link, node AS taxon_node INNER JOIN link AS exhibits_link INNER JOIN node AS phenotype_node INNER JOIN link AS search_link INNER JOIN node AS search_node ON (search_link.object_id = search_node.node_id) ON (phenotype_node.node_id = search_link.node_id) ON (exhibits_link.object_id = phenotype_node.node_id) ON (taxon_node.node_id = exhibits_link.node_id), node AS exhibits_pred_node, node AS search_pred_node, node AS inheres_in_pred_node, node AS anatomy_node, node AS is_a_pred_node, node AS quality_node WHERE exhibits_pred_node.uid = 'PHENOSCAPE:exhibits' AND is_a_pred_node.uid = 'OBO_REL:is_a' AND inheres_in_pred_node.uid = 'OBO_REL:inheres_in' AND search_pred_node.uid = 'OBO_REL:inheres_in' AND search_node.uid = 'TAO:0000108' AND inheres_in_link.is_inferred = 'f' AND is_a_link.is_inferred = 'f' AND exhibits_link.node_id IS NOT NULL AND search_link.node_id = phenotype_node.node_id AND search_link.predicate_id = search_pred_node.node_id AND search_link.object_id = search_node.node_id AND inheres_in_link.node_id = phenotype_node.node_id AND inheres_in_link.predicate_id = inheres_in_pred_node.node_id AND inheres_in_link.object_id = anatomy_node.node_id AND is_a_link.node_id = phenotype_node.node_id AND is_a_link.predicate_id = is_a_pred_node.node_id AND is_a_link.object_id = quality_node.node_id AND exhibits_link.node_id = taxon_node.node_id AND exhibits_link.predicate_id = exhibits_pred_node.node_id AND exhibits_link.object_id = phenotype_node.node_id; </sql>

Execution details

  • Rows returned: 69
  • Execution time: Approx. 9.8 seconds

Discussion

The results obtained by this query are consistent with the requirements of the anatomy data service. The query performance is also very stable (approx. 9.8 seconds) over repeated executions. This time was recorded for querying over fins. Querying for the root entity of TAO will expectedly result in much longer execution times.

Strategy #3

This strategy uses back-end stored procedures to minimize execution planning times. All the stored procedures are introduced here.

Stored procedures

Stored procedure #1: getAttributeForQuality

This procedure navigates the PATO hierarchy recursively if needed to identify the character that is associated with a given quality. Facets have been provided in this procedure to handle qualities which do not belong to attribute slim or value slim subsets, and to handle qualities that are specified as aliases for other qualities. Note the PATO think tank is about to replace attribute slims with character slim subset specifications. The functionality of this procedure will however remain the same.

<sql> CREATE OR REPLACE FUNCTION getAttributeForQuality(INT) RETURNS VARCHAR AS $$ DECLARE res VARCHAR; attr_id INT; slim VARCHAR; slim_id INT; temp_id INT; super_id INT; BEGIN SELECT DISTINCT object_id, node_uid(object_id) INTO slim_id, slim FROM link WHERE node_id = $1 AND predicate_id = (SELECT node_id FROM node WHERE uid = 'oboInOwl:inSubset') AND object_id IN (SELECT node_id FROM node WHERE uid IN ('value_slim', 'attribute_slim')); IF (slim IS NULL) THEN SELECT object_id FROM link INTO super_id WHERE node_id = $1 AND predicate_id = (SELECT node_id FROM node WHERE uid = 'OBO_REL:is_a') AND is_inferred = 'f'; IF (super_id IS NULL) THEN SELECT DISTINCT node_id FROM link INTO temp_id WHERE object_id = $1 AND predicate_id = (SELECT node_id FROM node WHERE uid = 'oboInOwl:hasDbXref'); IF temp_id IS NULL THEN RETURN 'Attribute Undefined'; ELSE RETURN getAttributeForQuality(temp_id); END IF; ELSE RETURN getAttributeForQuality(super_id); END IF; ELSIF (slim = 'attribute_slim') THEN res := node_uid($1) || '(' || node_label($1) || ')'; RETURN res; ELSE SELECT object_id INTO attr_id FROM link WHERE node_id = $1 AND predicate_id = (SELECT node_id FROM node WHERE uid = 'OBO_REL:is_a') AND is_inferred = 'f'; RETURN getAttributeForQuality(attr_id); END IF; END $$ LANGUAGE 'plpgsql'; </sql>

Stored procedure #2: getTaxaForPhenotype

This procedure retrieves all the taxa and genes that are associated with a given phenotype. Given that genes are indirectly associated with phenotypes (genotypes being the intermediary), this procedure navigates the link from genotypes to genes to find the association.

<sql> CREATE OR REPLACE FUNCTION getTaxaForPhenotype(INT) RETURNS VARCHAR[] AS $$ DECLARE rec RECORD; gene RECORD; taxa VARCHAR[20]; ct int := 0; BEGIN FOR rec IN SELECT node_id, node_uid(node_id) AS uid, node_label(node_id) AS label FROM link WHERE predicate_id = (SELECT node_id FROM node WHERE uid = 'PHENOSCAPE:exhibits') AND object_id = $1 LOOP ct := ct + 1; IF (rec.uid LIKE '%TTO%') THEN taxa[ct] := rec.uid || '(' || rec.label || ')'; ELSE FOR gene IN SELECT node_uid(node_id) AS gId, node_label(node_id) AS gLabel FROM link WHERE predicate_id = (SELECT node_id FROM node WHERE uid = 'PHENOSCAPE:has_allele') AND object_id = rec.node_id LOOP taxa[ct] := gene.gId || '(' || gene.gLabel || ')'; END LOOP; END IF; END LOOP; RETURN taxa; END $$ LANGUAGE 'plpgsql'; </sql>

Stored procedure #3: getQualityFromPhenotype

This procedure finds the quality associated with the given phenotype

<sql> CREATE or REPLACE FUNCTION getQualityFromPhenotype(INT) RETURNS RECORD AS $$ SELECT object_id AS quality_id, node_uid(object_id) || ' (' || node_label(object_id) || ')' AS quality FROM link WHERE predicate_id = (SELECT node_id FROM node WHERE uid = 'OBO_REL:is_a') AND node_id = $1 AND is_inferred = 'f'; $$ LANGUAGE 'sql'; </sql>

Stored procedure #4: getEntityFromPhenotype

This procedure finds the anatomical entity directly associated with the given phenotype

<sql> CREATE OR REPLACE FUNCTION getEntityFromPhenotype(INT) RETURNS VARCHAR AS $$ DECLARE entity_id VARCHAR; BEGIN SELECT node_uid(object_id) || ' ' || node_label(object_id) INTO entity_id FROM link WHERE predicate_id = (SELECT node_id FROM node WHERE uid = 'OBO_REL:inheres_in') AND node_id = $1 AND is_inferred = 'f'; RETURN entity_id; END; $$ LANGUAGE 'plpgsql'; </sql>

Stored procedure #5: getAnatomyInfo

This is the master stored procedure which can be invoked directly from the client side. It invokes the other stored procedures in turn to retrieve all the information associated with the searched anatomical entity in one go.

<sql> CREATE TABLE anatomyrow (search_id VARCHAR, phenotype_id INT, phenotype_label VARCHAR, entity_info VARCHAR, quality_info VARCHAR, attribute_info VARCHAR, taxa_info VARCHAR);

CREATE OR REPLACE FUNCTION getAnatomyInfo(VARCHAR) RETURNS SETOF anatomyrow AS $$ DECLARE result anatomyrow%rowtype; qualityRecord RECORD; phenotype INT; quality VARCHAR; entity VARCHAR; taxa VARCHAR; gene VARCHAR; attribute VARCHAR; qualityAndAttribute VARCHAR[2]; BEGIN FOR result IN SELECT DISTINCT $1 AS search_id, phenotype_node.node_id as phenotype_id, phenotype_node.uid AS phenotype_uid, 'entity' AS entity_info, 'quality' AS quality_info, 'attribute' AS attribute_info, 'taxa' AS taxa_info FROM node AS taxon_node INNER JOIN link AS exhibits_link INNER JOIN node AS phenotype_node INNER JOIN link AS inheres_in_link INNER JOIN node AS search_node ON (inheres_in_link.object_id = search_node.node_id) ON (phenotype_node.node_id = inheres_in_link.node_id) ON (exhibits_link.object_id = phenotype_node.node_id) ON (taxon_node.node_id = exhibits_link.node_id), node AS inheres_in_pred_node, node AS exhibits_pred_node WHERE search_node.uid = $1 AND exhibits_pred_node.uid = 'PHENOSCAPE:exhibits' AND inheres_in_pred_node.uid = 'OBO_REL:inheres_in' AND exhibits_link.node_id IS NOT NULL AND exhibits_link.predicate_id = exhibits_pred_node.node_id AND inheres_in_link.predicate_id = inheres_in_pred_node.node_id LOOP phenotype := result.phenotype_id; qualityRecord := getQualityFromPhenotype(phenotype); attribute := getAttributeForQuality(qualityRecord.quality_id); quality := qualityRecord.quality; taxa := getTaxaForPhenotype(phenotype); entity := getEntityFromPhenotype(phenotype); result.search_id := $1; result.entity_info := entity; result.quality_info := quality; result.attribute_info := attribute; result.taxa_info := taxa; RETURN NEXT result; END LOOP; END $$ LANGUAGE 'plpgsql'; </sql>

Execution details

The master stored procedure getAnatomyInfo() can be invoked from the client side using a command similar to the one below

<sql> SELECT * FROM getAnatomyInfo('TAO:0000108'); </sql>

  • Rows returned: 69
  • Execution time: Approx. 0.25 seconds

Discussion

This strategy resulted in the fastest execution times. All the required information is obtained. Moreover, when querying for TAO:0100000, the root PATO term, this strategy returned approximately 2500 unique phenotypes with their associated sets of taxa and genes, qualities, entities, and atttributes in approximately 8 seconds. That is presumptively every annotation in the database!

Strategy #4

This strategy will be tested in the week of Feb 16 - 20.

Conclusion and potential areas for improvement

From the above discussion, it can be seen the performance of queries from the SICB prototype can be significantly improved and scaled. At present, (Feb 11, 2009), the database contains approximately 122500 phenotype assertions from the Phenoscape project and roughly 23000 assertions from ZFIN. While significant additions to the ZFIN assertions are not expected, the number of assertions in the Phenoscape project are expected to go up to 7.5 million. However, assuming a logarithmic increase in performance metrics with a linear increase in data (as is standard in most industrial strength DBMSs), the scalability of the querying strategies does not appear to be a major concern at this time.

Materializing views for Attribute-Quality (or Character-State) combinations

Given that stored procedures display the fastest execution times, the invocation of a procedure that recursively executes SQL queries to traverse the PATO hierarchy in order to identify characters that subsume qualities is a bottleneck of its own. This can be overcome by the creation of a materialized view, which contains the mapping information from several qualities to a common attribute (or from several states to a common character), as shown below.

<javascript>

Quality Attribute


---------

Blue Color Red Color Round Shape Bifurcated Shape . . </javascript>

The attribute identification for a quality can be done by simply reading from this view, which should lead to more savings in query execution time.

Gene data services

Taxon data services

Publication data services