Difference between revisions of "Queries"
(→Querying strategies for the ASIH prototype) |
(→Proposed querying strategy for the ASIH prototype) |
||
Line 62: | Line 62: | ||
=== Proposed querying strategy for the ASIH prototype=== | === Proposed querying strategy for the ASIH prototype=== | ||
− | + | As it is now believed that the multiple query spawning strategy was primarily responsible for the slow execution performance in the SICB prototype, initial attempts at improving this are focused upon retrieving all the required data in one query. The new proposed querying strategy traverses all the relations ((1) ~ (6)) described in the previous sections to find all the information pertinent to an anatomical entity that is being searched for, using a combination of [http://en.wikipedia.org/wiki/Join_(SQL) TABLE JOINS]. This methodology makes optimal use of transitive relations derived by the OBD reasoner between Attributes and Values in the PATO hierarchy and between Anatomical Entities in the TAO hierarchy, in contrast to the strategy used in the SICB prototype. The details of these queries can be found [[Queries to be implemented in the future|here]] | |
− | + | === Performance documentation plans === | |
− | The data retrieval process can be broken down into 4 stages. | + | The data retrieval process can be broken down into 4 stages. |
− | + | ||
− | * '''Query formulation''' | + | * '''Query formulation''' |
** This happens on the client-side, which is the layer that lies between the Phenoscape UI and the OBD database backend. On the client-side the RESTlet resource opens a connection to the database and calls the Shard which assembles the query and forwards the query to the database for execution through the opened connection | ** This happens on the client-side, which is the layer that lies between the Phenoscape UI and the OBD database backend. On the client-side the RESTlet resource opens a connection to the database and calls the Shard which assembles the query and forwards the query to the database for execution through the opened connection | ||
− | * '''Query execution''' | + | * '''Query execution''' |
** This happens on the OBD database side. Th | ** This happens on the OBD database side. Th | ||
− | * '''Database side result processing''' After the results are retrieved, they need to be assembled in a format compatible with the | + | * '''Database side result processing''' After the results are retrieved, they need to be assembled in a format compatible with the |
* '''Result processing''' On the client side, the retrieved results need t | * '''Result processing''' On the client side, the retrieved results need t | ||
− | |||
==Gene Services== | ==Gene Services== |
Revision as of 20:06, 28 January 2009
This section describes the queries that have been (or are to be) implemented for the Phenoscape data services, in addition to the execution details of each queries on the PostgreSQL database on Darwin.
Contents
Status (Jan 20, 09)
The first iteration of the Web Services module for the Phenoscape project (the SICB prototype) was demonstrated at the SICB meeting in Boston, MA in January 2009. This module allowed database searches for Anatomical Entities (Anatomical Entity Services) and Genes (Gene Services). Searches for Taxa (Taxon Services) are to be implemented in the next iteration which will be a part of the next Phenoscape version to be demonstrated at the ASIH meeting in Portland, OR (the ASIH prototype) in July, 2009.
Testing by the Phenoscape project stakeholders (Paula, Todd, and Monte) at the SICB meeting revealed that Anatomy and Gene Services were functional, but their execution was very slow in terms of time. As a result, the data retrieval strategy used in the SICB prototype is being examined for bottlenecks and these details are presented here.
Summary
Queries are assembled in a Java program and dispatched through a connection to the database, and executed at the database end. For brevity's sake, the Java program is called the client side and the database side is called the backend henceforth. The query modules on the client-side interface with the database in the backend to execute the queries. The data retrieved by these query executions are then processed at the client-side. There are two possible bottlenecks in this scheme: one at the client-side and the other at the backend.
The backend bottleneck is the more likely of the two. This is because the query has to be transmitted through the connection from the client side to the backend, then executed at the backend (a process in itself which is not discussed here), and the retrieved results sent back over the connection to the client side. All this takes time, which eventually adds up. As a case in point, the query execution strategy implemented for the SICB prototype spawns a multitude of queries. The execution of each of these queries takes up time to connect, retrieve the results, and transfer them back to the client side. Therefore, a new strategy that tries to obtain all the required data in one query (or a very limited number of queries) is being tested as of now. Details of both the old and new strategies can be found in the linked pages
To test the efficiency of the new queries, more methods need to be added to the OBD Shard libraries, the projects have to be compiled and linked prior to testing. This is to be done over the next two weeks from now (Jan 20, 09). The details of these strategies can also be found here.
Database details
- Last updated: Jan 02, 2009
- Size: ~ 600 MB
Anatomical Entity Services
Queries on anatomical entities retrieve information on the qualities that inhere in them, the taxa that exhibit these entity-quality (or more correctly, character-state) combinations. Querying strategies to retrieve this information from the OBD database leverage a number of relation instances which are stored in the OBD database. These are detailed below
Relations of interest
Post compositions of Entities and Qualities are used to relate taxa and phenotypes through the exhibits relation as shown in (1). <javascript> Taxon exhibits inheres_in(Quality, Entity) -- (1) </javascript> In addition, the OBD database also contains information relating post composed phenotypes to both the Quality and the Entity by different relations as shown in (2) and (3) respectively
<javascript> inheres_in(Quality, Entity) is_a Quality -- (2) inheres_in(Quality, Entity) inheres_in Entity -- (3) </javascript> Quality can be either a Value or an Attribute (beside other slims) and is related to these by the in_subset_of relation as shown in (4) <javascript> Quality in_subset_of Slim -- (4) </javascript>
Qualities and Anatomical entities are subclassed in the PATO and TAO hierarchies respectively as shown in (5) and (6) <javascript> Value is_a Attribute -- (5) Sub Anatomical Feature is_a Anatomical Feature -- (6) </javascript>
Querying strategy for the SICB prototype
The queries implemented for this iteration of the Phenoscape UI use the following strategy to retrieve taxa and qualities associated with an Anatomical Entity.
- Phenotypes containing the anatomical feature and the taxa exhibiting these phenotypes were extracted from the database using regular expression keyword matches. This was done with one query (Q1) that uses the relation in (1)
- Results from Q1 are parsed to extract the Anatomical Feature and the Quality that went into each Phenotype (again using regular expressions)
- The Quality extracted in the previous step is analyzed by running a query (Q2) on relation (4), to see if it is an attribute or value. If the Quality is a value, then a second query (Q3) is used to determine the attribute it is a value of. This query runs on the is_a relation in (5), and is invoked in sequence until an attribute higher in the quality branch is found
- The results from the previous step are used to group the qualities that an entity can take under specific attributes. Value qualities such as Distorted, Regular etc may be grouped under an attribute quality such as Shape
- In another direction, the taxa retrieved in Q1 are also collected
- Now, the anatomical features that are sub features of the search feature are collected. For example, if the search was for dorsal fins, now we retrieve all the sub features of dorsal fin such as dorsal fin lepidotrichium etc by querying (Q4) over the relation shown in (6) below
- For every sub anatomical feature retrieved by Q4, we repeat the previous steps
In summary, the relations (2) and (3) are not leveraged in this strategy. The transitive relations between the Attribute and Value Qualities in the PATO hierarchy and the Anatomical Features in the TAO hierarchy (which are inferred by the OBD reasoner) are also not utilized. An assortment of queries are executed over the database backend and their results are fed into the JAVA methods implemented on the client side, which is a very expensive process in terms of time. Some data structures like lookup tables for Attributes and Values have been implemented to minimize database connections and query executions, however the whole retrieval process is still very time consuming
The details of these queries can be found here
Proposed querying strategy for the ASIH prototype
As it is now believed that the multiple query spawning strategy was primarily responsible for the slow execution performance in the SICB prototype, initial attempts at improving this are focused upon retrieving all the required data in one query. The new proposed querying strategy traverses all the relations ((1) ~ (6)) described in the previous sections to find all the information pertinent to an anatomical entity that is being searched for, using a combination of TABLE JOINS. This methodology makes optimal use of transitive relations derived by the OBD reasoner between Attributes and Values in the PATO hierarchy and between Anatomical Entities in the TAO hierarchy, in contrast to the strategy used in the SICB prototype. The details of these queries can be found here
Performance documentation plans
The data retrieval process can be broken down into 4 stages.
- Query formulation
- This happens on the client-side, which is the layer that lies between the Phenoscape UI and the OBD database backend. On the client-side the RESTlet resource opens a connection to the database and calls the Shard which assembles the query and forwards the query to the database for execution through the opened connection
- Query execution
- This happens on the OBD database side. Th
- Database side result processing After the results are retrieved, they need to be assembled in a format compatible with the
- Result processing On the client side, the retrieved results need t
Gene Services
The querying strategy for the Gene Services module of SICB prototype is identical to the strategy for the Anatomy Services module. This strategy also involves the spawning of multiple queries, which add to the backend bottleneck. The only difference in this case is this strategy leverages the relationships between genes and genotypes and then, the genotypes and phenotypes (as shown in (1) and (2) below to retrieve the desired information.
<javascript> Gene has_allele Genotype -- (1) Genotype exhibits inheres_in(Quality, Entity) -- (2) </javascript>
Taxon Services
These will be implemented for the first time in the ASIH prototype