Edit this page on GitHub

Bigdata

For querying across the entire Knowledgebase dataset, Phenoscape is using the Bigdata RDF triplestore. We selected Bigdata for several reasons:

The data that go into Bigdata are assembled via the KB build process into a graph data model.

Phenoscape machine configurations

Querying with OWL semantics

While OWL can be stored as RDF, thinking at the OWL axiom level can be quite different from thinking at the RDF triple level. While OWL matches up nicely with RDF and SPARQL when dealing with object property assertions for individuals, querying using complex class descriptions is much more straightforward using DL queries to an OWL reasoner rather than via SPARQL, which is designed for matching triple patterns. Querying SPARQL patterns which match pieces of the OWL-to-RDF serialization is error prone and will likely provide incomplete results. It is best to consider that triples using predicates from the OWL namespace are a private implementation detail (e.g. owl:onProperty, owl:someValuesFrom, etc.).

Since Phenoscape annotations are composed largely of complex OWL class descriptions, we use two approaches to make use of OWL semantics within SPARQL queries:

Issues we’ve encountered

Future applications

Semantic similarity

For several applications we would like to query for data items which are semantically similar to each other, meaning they are each annotated with sets of class descriptions which share semantics to a greater degree than with other data items. For example, a gene may be associated with a set of phenotypic expressions such as round and inheres_in some tail; small and inheres_in some eye. What other genes are annotated most similarly? Perhaps another also affects the tail but with a different shape. These semantic profiles could potentially contain just a few or dozens of expressions.

Current implementations of semantic similarity compute scores for profile pairs and can run for several hours. It would be nice to develop a system which could query for similar entities on the fly, providing some sort of match score. Could MPGraph play a role here?