Difference between revisions of "Bigdata"

From phenoscape
(Issues for consideration)
(Phenoscape machine configurations)
Line 7: Line 7:
 
==Phenoscape machine configurations==
 
==Phenoscape machine configurations==
 
* SPARQL endpoint using the Bigdata NanoSparqlServer within Apache Tomcat. This machine runs Ubuntu Linux, with 4 CPUs and 8 GB RAM. Tomcat is allowed 6 GB RAM.
 
* SPARQL endpoint using the Bigdata NanoSparqlServer within Apache Tomcat. This machine runs Ubuntu Linux, with 4 CPUs and 8 GB RAM. Tomcat is allowed 6 GB RAM.
** This database holds
+
** This database holds about 35 million triples.
 
* owlet service which expands SPARQL queries using the ELK reasoner and submits them to the preceding Bigdata endpoint. This machine runs Ubuntu Linux, with 4 CPUs and 16 GB RAM. Tomcat is allowed 12 GB RAM.
 
* owlet service which expands SPARQL queries using the ELK reasoner and submits them to the preceding Bigdata endpoint. This machine runs Ubuntu Linux, with 4 CPUs and 16 GB RAM. Tomcat is allowed 12 GB RAM.
 
** The ELK instance holds about 750,000 logical axioms.
 
** The ELK instance holds about 750,000 logical axioms.

Revision as of 13:47, 19 March 2014

For querying across the entire Knowledgebase dataset, Phenoscape is using the Bigdata RDF triplestore. We selected Bigdata for several reasons:

  • Top SPARQL query performance among open-source triplestores
  • Support for SPARQL 1.1 query language. This is required for aggregates such as COUNT. Property paths also provide basic transitivity reasoning at query time.
  • Embedded full-text index available within SPARQL queries.
  • Concise bounded description mode for SPARQL DESCRIBE queries. When blank nodes are included in a DESCRIBE result, this recursively describes them until the graph terminates in named nodes in all directions. This is useful for grabbing the necessary and sufficient RDF graph needed to reconstruct OWL class expressions.

Phenoscape machine configurations

  • SPARQL endpoint using the Bigdata NanoSparqlServer within Apache Tomcat. This machine runs Ubuntu Linux, with 4 CPUs and 8 GB RAM. Tomcat is allowed 6 GB RAM.
    • This database holds about 35 million triples.
  • owlet service which expands SPARQL queries using the ELK reasoner and submits them to the preceding Bigdata endpoint. This machine runs Ubuntu Linux, with 4 CPUs and 16 GB RAM. Tomcat is allowed 12 GB RAM.
    • The ELK instance holds about 750,000 logical axioms.
  • Direct loading of Bigdata to answer long-running, high-memory queries on the DSCR cluster. We have machines with many CPUs and over 100 GB RAM.

Querying with OWL semantics

While OWL can be stored as RDF, thinking at the OWL axiom level can be quite different from thinking at the RDF triple level. While OWL matches up nicely with RDF and SPARQL when dealing with object property assertions for individuals, querying using complex class descriptions is much more straightforward using DL queries to an OWL reasoner rather than via SPARQL, which is designed for matching triple patterns. Embedding SPARQL patterns matching pieces of the OWL-to-RDF serialization is error prone and will likely provide incomplete results. It is best to consider that triples using predicates from the OWL namespace are a private implementation detail (e.g. owl:onProperty, owl:someValuesFrom, etc.).

Since Phenoscape annotations are composed largely of complex OWL class descriptions, we use two approaches to make use of OWL semantics within SPARQL queries:

  • Generated sets of named classes for useful expression patterns. These include annotation property relations which make it easier to query these "class-level relationships" from SPARQL.
    • Existential restriction hierarchies, such as (part_of some X), where X is every anatomical structure
    • Absence classes
  • owlet, which allows embedding of arbitrary DL queries into SPARQL. This provides for an infinite number of possible expressions, but precludes using variables in the class expression.

Issues for consideration

  • Full text searches are quite a bit slower when more than one property is considered. It would be really nice to be able to search ?s rdfs:label|obo:related_synonym ?label, but this is not really feasible.
  • Some simple queries can be quite slow, for example counting the number of triples in the database. It seems like there is some special optimization for this in Virtuoso, which returns the answer instantly.