Phenoscape data repository

From phenoscape
Revision as of 18:52, 5 August 2009 by Crk18 (talk | contribs) (Speeding up the queries: Data warehouse)

The Phenoscape data repository is a relational database, which holds phenotypic data from the model organism Danio Rerio (Zebrafish) and the evolutionary organisms belong to the clade of Ostariophysi. This page describes the schema of this data repository and outlines the methods to load and query this data repository.

Data Repository

The Phenoscape data repository has been implemented as a PostgreSQL relational database, and at present is housed on the development database server at NESCent.

Schema

The schema of the Phenoscape data repository is based upon the Open Biomedical Database (OBD) data format developed at the Berkeley Bioinformatics Open-source Projects (BBOP). OBD is based upon the Resource Description Framework (RDF) format for capturing metadata about Web (and Semantic Web) resources such as Web pages and Web services.

The philosophy of OBD is to represent every conceptual entity, be it a type or a token (synonymously a class or an object, or a concept or an instance) or a relation definition, as a Node. Instances of relations between these nodes are represented as Statements, specifically Link Statements. OBD also allows for reification, which is vital to the life sciences with their emphasis on evidence codes and attributions (provenance). For this purpose, OBD provides Literal Statements (and Annotation Statements) to capture metadata about Nodes and Link Statements, such as the source publication, evidence codes, specimens used, and so forth.

Tables

Two relational tables are central to the schema of the Phenoscape data repository. These are: LINK and NODE. The SQL commands for the creation of these tables (and the others) can be found at this Phenoscape Sourceforge page.

The NODE table

The NODE table contains information about every concept such as its unique identifier, label, and source ontology. The NODE table contains this information about concepts extracted from the source Ontologies. In addition, it also holds information about scientific publications (in a rudimentary format which will be improved upon soon), the ontologies themselves, and the representation of phenotypes from the ZFIN and NeXML databases. It will be augmented in the future to hold information about collection specimens. The NODE table adds a unique identifier (generated from a sequence) of its own to every row. An excerpt of the row from the NODE table for the Gymnotiformes term is shown below

<javascript>

node_id |   uid    |     label     | metatype | source_id

+----------+---------------+----------+-----------

  46050 | TTO:1390 | Gymnotiformes | C        |      9630

</javascript>

  • The NODE_ID column holds the unique identifier generated by the Phenoscape database for this term
  • The UID column holds the identifier of this term that is obtained from the Teleost Taxonomy Ontology (TTO). The 'TTO' is the namespace prefix
  • The LABEL column displays the label for this term
  • The METATYPE column shows term is a Class (C). Other metatypes are Relation (R) and Instance (I).
  • The SOURCE_ID column holds the NODE_ID of the ontology from which the term was extracted. In this case, the source ontology is the TTO

The LINK table

The LINK table contains rows which represent Statements which link the Nodes to one another, and also the metadata about these Nodes. The excerpt below shows some of the rows in the LINK table about the Gymnotiformes term

<javascript>

link_id | node_id | predicate_id | object_id


+---------+--------------+-----------

  23854 |    9637 |          102 |     46050
  59897 |   45723 |          102 |     46050
  60223 |   46050 |          102 |     46160
 501448 |    9932 |          102 |     46050

</javascript>

  • The LINK_ID column shows the database generated identifier for the link
  • The NODE_ID column shows the Subject of the Statement (in RDF parlance). This ID the is database generated identifier for the concept Eigenmanniidae (TTO:10000005)
  • The PREDICATE_ID column shows the Predicate of the Statement. This ID is the database generated identifier for the relation OBO_REL:is_a.
  • The OBJECT_ID column shows the Object of the Statement which is the ID generated by the database for concept Gymnotiformes

In simple terms, a sub species of Gymnotiformes is displayed by this Statement as shown in the triple below

<javascript>

     Eigenmanniidae                                is_a                        Gymnotiformes

</javascript>

Similarly, The third row in the display shows that Gymnotiformes is an Otophysian as shown below

<javascript>

     Gymnotiformes                                is_a                        Otophysi

</javascript>

Other important tables

  • The ALIAS table keeps track of the various aliases (alternate labels) of the concepts and relations, which are sourced from the ontologies
  • The DESCRIPTION table stores rich text descriptions of the concepts and relations, which are extracted from the ontologies
  • The OBD_SCHEMA_METADATA stores metadata about OBD such as the version of the OBD format in use, and also the last refresh date and time for the database

Views

The Phenoscape data repository also generates several views from the tables. These views are used in querying the database, some of which are part of the OBDAPI

Procedures

Stored procedures are used in populating the database with defined terms from the ontologies, and with phenotypic descriptions obtained from curators. In addition, they are also used in generating inferences from the asserted data. In the future, stored procedures may be used as necessary for speedier data retrieval.

Loading the data

The repository will be periodically refreshed to include the latest ontology definitions and curated data. At present, curated data is obtained from two different source which are:

  1. The ZFIN data repository (model organism database) containing descriptions of mutant phenotypes and the related genes and genotypes of zebrafish. This data exists primarily as tab delimited simple text files
  2. Annotations from a set of selected publications, which describe in rich-text unstructured format, the observed phenotypes of about 25000 different species of fish belonging to the clade of Ostariophysi. These annotations are entered by curators using the Phenex annotation tool and are saved in the [NeXML] data format.

The Phenoscape data loader is in its final stages of deployment on the application development server at NESCent. Following deployment, it can be invoked by the primary stakeholders of Phenoscape and the Phenoscape staff with suitable access privileges.

Querying the data

A rich library of methods has been inherited from the OBD project to execute various generic queries on the database. This is being extended currently to conform to the very specific requirements of the Phenoscape project.

Speeding up the queries: Data warehouse

Queries used in the Phenoscape data services module were found to be intolerably slow in returning, esp. when asked to retrieve and summarize annotation data for genes and teleost species. The slow times in query execution were primarily due to the large numbers of JOINs in them, and the extensive volume of data, which needed to be processed in various facets of the query execution plan.

To address this issue, it was decided to create summaries of the annotations in the database in simple data warehouse tables. New queries which were tested on these summary tables executed much faster, having dispensed with the numerous JOINs between the NODE and LINK tables, aliased several times over.