Query strategy performance

From phenoscape
Revision as of 18:18, 12 February 2009 by Crk18 (talk | contribs) (New page: This page discusses the requirements of the various data access modules of the Phenoscape application to be demonstrated at the ASIH workshop, and documents the performance of the various ...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This page discusses the requirements of the various data access modules of the Phenoscape application to be demonstrated at the ASIH workshop, and documents the performance of the various querying strategies being tested for these modules.

Anatomy data services

Given a specific anatomical entity A, the requirements of the anatomy data service module are as follows:

  1. Identify ALL unique phenotypes {P} that are composed of the entity A or any of its subtypes.
  2. For each identified phenotype P which is of the form inheres_in(Q, E)
    1. Identify the quality Q.
      1. In addition, identify the character C which subsumes Q. This requires navigating the PATO hierarchy recursively in some cases
    2. Identify the entity E. E may be the same as A, or any valid subtype of A.
    3. Identify the set of taxa {T} or genotypes {GT} that exhibit the phenotype P.
      1. For a genotype GT, identify the gene G that encodes it.

The result should be a list in the format shown below

<javascript> <Search-Anatomical-Entity> <Phenotype> <Entity> <Quality> <Character> <List of taxa>OR<List of genes> </javascript>


Strategy #1

This strategy is the simplest way to try and get all the necessary information related to the anatomical entity being searched for. Below is the actual query which attempts to do this. This query returns the unique phenotypes associated with the anatomical entity, and the qualities and specific anatomical entities associated with the phenotype, as well as the list of taxa that exhibit each of these phenotypes in one fell table join (uh, swoop).

Query

<sql> SELECT phenotype_node.uid AS phenotype, taxon_node.uid AS taxon, quality_node.uid AS quality, anatomy_node.uid AS entity FROM link AS inheres_in_link, link AS search_link, link AS is_a_link, link AS exhibits_link, node AS taxon_node, node AS exhibits_pred_node, node AS phenotype_node, node AS search_pred_node, node AS search_node, node AS inheres_in_pred_node, node AS anatomy_node, node AS is_a_pred_node, node AS quality_node WHERE exhibits_pred_node.uid = 'PHENOSCAPE:exhibits' AND is_a_pred_node.uid = 'OBO_REL:is_a' AND inheres_in_pred_node.uid = 'OBO_REL:inheres_in' AND search_pred_node.uid = 'OBO_REL:inheres_in' AND search_node.uid = 'TAO:0000108' AND inheres_in_link.is_inferred = 'f' AND exhibits_link.node_id IS NOT NULL AND search_link.node_id = phenotype_node.node_id AND search_link.predicate_id = search_pred_node.node_id AND search_link.object_id = search_node.node_id AND inheres_in_link.node_id = phenotype_node.node_id AND inheres_in_link.predicate_id = inheres_in_pred_node.node_id AND inheres_in_link.object_id = anatomy_node.node_id AND is_a_link.node_id = phenotype_node.node_id AND is_a_link.predicate_id = is_a_pred_node.node_id AND is_a_link.object_id = quality_node.node_id AND exhibits_link.node_id = taxon_node.node_id AND exhibits_link.predicate_id = exhibits_pred_node.node_id AND exhibits_link.object_id = phenotype_node.node_id; </sql>

Query Execution Plan

Execution Details

  • Rows returned: 4797
  • Phenoscape data revision: 449
  • Time: 0.4 ~ 108 s

Discussion

This query (referred to as QfS1, short for "Query for Strategy 1") returns each phenotype associated with the 'TAO:0000108' and also the specific entities and qualities associated with each phenotype. In addition, the taxa that exhibit the phenotype and the genotypes are also returned. If several taxa {T} or genotypes (GT} exhibit the same phenotype P, which is of the form, inheres_in(Q, E), the results are returned in rows in the format shown below.

<javascript> P1 E1 Q1 T1 P1 E1 Q1 T2 . . P1 E1 Q1 Tn P2 E2 Q2 GT1 . . P2 E2 Q2 GTn </javascript>

Note that for each quality Q in the result above, the character that subsumes it needs to be determined. This cannot be done directly from the query because a recursive traversal of the PATO hierarchy may be required in many cases. Recursive traversals are best implemented by stored procedures. This will definitely add to the execution times documented for this query. Further, note that only the genotypes associated with each phenotype are returned. To find the genes that encode these genotypes, another query needs to be executed. It is not possible to unilaterally join this gene query with QfS1 because both taxa as well as genotypes are returned by QfS1. What is required is a procedural attachment that invokes the gene query if the returned result is a genotype. Again, this procedural attachment can only be implemented as a stored procedure.

Finally, the performance of the query QfS1 varied widely from half a second (very good) to 108 seconds (intolerable) in a random set of executions issued from the command line. The query execution planning process is responsible for this. If instead of searching for 'TAO:0000108' (fin), a more general term (like TAO:0100000, the parent term for all term definitions in TAO) were to be searched for, the performance of this query would be much more unpredictable and much slower too.

Let us examine the performance of the very same query where specific stored procedures for identifying the characters subsuming the qualities and genes encoding the genotypes are invoked as part of QfS1.

Strategy #2

Strategy #3

Strategy #4

Strategy #5

Gene data services

Taxon data services

Publication data services