Difference between revisions of "Reasoning over homology statements"
|Line 6:||Line 6:|
Document status: '''in progress'''
Document status: '''in progress'''
== Introduction ==
== Introduction ==
Revision as of 22:12, 22 June 2010
This document provides a collection of inference rules for homology relations. The emphasis is on implementation in OBD, although the rules can be used in other inference systems.
The core iso_homologous_to relation is introduced together with its relation characteristics. This relation is defined in terms of the descended_from relation and its inverse.
Document status: in progress
- 1 test1
- 2 test2
- 3 test3
- 3.1 Introduction
- 3.2 = Mereological Sums
- 3.3 Classification and characteristics of relations
- 3.4 Core Asserted Homology Relations
- 3.5 Recording provenance on asserted statements
- 3.6 Negation
- 3.7 Homology and descent
- 3.8 Modeling homologous-as and ancestry information
- 3.9 Discussion
- 3.10 Recommended Strategy
- 3.11 Future Recommendations
- 3.12 Usage Guidelines
- 3.13 Appendix
Phenoscape is curating homology relationships between anatomical entities across species. It is imperative that the semantics of the homology relation is specified, to ensure annotation consistency and to allow for reasoning.
The background section covers key ontology concepts, in particular those of relevance to homoloy reasoning, such as the distinction between generic classes and mereological sums. A section describing the logical characteristics of relations follows.
The main part of this document presents the core relations intended to be used in phenoscape curation and reasoning. The core relation is the pairwise iso_homologous_to, a strict binary 1:1 homology relation that can be used to specify anatomical entities related by common descent.
I then describe the characteristics of these relations. The emphasis is on the syntax and semantics of OBD, although other ontology information systems or reasoners may be used.
The next section of the document explores ternary (3-argument) relations, and their formalization in terms of a descended_from relation. I explore ways of representing 3-ary relations in OBD on ontology formalisms, including modeling as instances or classes. Modeling in this way is shown to have strong correspondences with both the HOG and MRCA approaches.
Classes and Instances
The distinction between the classes such as "finger" and "left hand finger 1" and individual instances of (e.g. my left hand finger 1) is crucial. Classes are organized in a subsumption (isa) hierarchy.
When making statements about classes, the implications for instances must be explicit.
Named classes and class expressions
A class is essentially a set of instances. Classes can be named (e.g. pre-composed in the ontology) or class expressions (e.g. post-composed).
OBD uses a special ID syntax for class expressions, allowing them to be treated on an almost equal footing to named classes. When describing the basic axioms, we use the OBD ID syntax to denote class expressions (as is used through the phenoscape documentation).
The most common type of class expression used in the context of homology relations is a taxon-restricted anatomical entity class. For readability I will generally write these as:
<taxon label> <anatomical entity label>
These could also be written as an ID expression
<anatomical entity id>^in_taxon(<taxon id>)
cyprinidae vertebra 1 <--> TAO:nnnn^in_taxon(TTO:nnnn)
= Mereological Sums
The distinction between generic classes such as "vertebra" and collection classes such as "vertebral column" is crucial.
- finger 1 is a subclass of finger
- every finger 1 is an instance of a finger
- every finger 1 is part of a collection of fingers on a hand
The FMA uses the "Set of <X>" terminology, but this is confusing because classes can already be equated with sets. We use "collection of here", but we frequently substitute a more convenient synonym. E.g. maximal collection of vertebrae = vertebral column.
For convenience we sometimes use the term "msum" as a functor in a class expression denoting a mereologic sum. So "msum(v1...vn)" denotes a class representing the mereological sum of all vertebrae in an organism. "msum(finger2,finger3)[hand]" denotes the mereological sum of fingers on one hand.
We assume that in general all required mereological sums will be named classes in the ontology. This is already standard practice in TAO. Strictly speaking the axioms defining these are outside the scope of this document since they fall within the purview of the AO. However, more will be said of this in the appendix.
Classification and characteristics of relations
I now present some of the characteristics and means of classification of relations, with an emphasis on the expressive capabilites of OBD.
Binary vs ternary relations
Relations can be classified according to how many arguments they take. Typically relations are binary, but ternary relations may be required.
Homology relations are typically expressed as 3-ary (ternary) relations, read as "A is homologous to B as a C". For example:
|mammalian forelimb||avian wing||forelimb||TRUE|
|mammalian forelimb||teleost pectoral fin||appendage||TRUE|
|avian wing||bat wing||wing||FALSE|
|avian wing||bat wing||forelimb||TRUE|
Unfortunately most ontology information systems do not allow for direct representation of relations with arity greater than 2.
For now I assume the asserted statements are binary, and that the 3rd argument is unstated. I return to the subject of n-ary relations later in this paper.
Axes of classification
There needs to be multiple homology relations, not just one. These relations can be classified along multiple (not necessarily independent) axes:
- 1:1 vs 1:many vs many:many
- inter-species vs intra species
- orthology vs paralogy
- characters and character states: anatomical entities, proteins, DNA
- deep vs "superficial"
Some might include historical homology vs analagy as an additional axis, but here we take the term "homology" to strictly imply common descent.
The focus of this document is anatomical entities as character states. However, we do not want to reinvent the wheel for sequences, and where applicable the same abstractions can be used. This is covered more in the more detailed formalization.
We can derive a full relational lattice by materializing all valid non-redundant combinations of these relation facets. A full exploration of the full cross-product is outside the scope of this document. Instead we select a subset relevant to current Phenoscape use cases. In the future it would be useful to explore other relations in the lattice, particularly with a view to unifying sequence-level and structure-level homology in a knowledge base.
Homology Relations Characteristics
We can describe the properties of various relations according to the following relation characteristics:
- class or instance level
- relation hierarchy
- cardinality constraints
- relation chains
Note that many of these are only well-defined for binary relations.
Class or instance level
In the current implementation of OBD, there is no distinction between class and instance level relations - relations are treated as predicates with semantics supplied on a case by case basis.
In order to improve compatibility with OWL, and to make the semantics of OBD knowledgebases clearer, the following changes will be introduced.
- Relations will be explicitly declared to be class or instance level
- The semantics will be much clearer
See appendix for more details
If a relation R is class level, then a statement
X R Y
Is treated as a binary predicate connecting two classes. There are no default reasoning rules here, but rules for R can be introduced by the KB manager.
If a relation R is instance level, then a statement
X R Y
Is always quantified. The options are:
- all X R some Y
- all X R only Y
- some X R some Y
This quantification is always explicit in OBD (although in th past this has not been used).
Homology Relation Hierarchy
Given a lattic of relations in a relation hierarchy, we can infer more generic statements from more specific ones.
OBD implements the generic rules:
R1 is_a R2 all X R1 some Y ==> all X R2 some Y
(instance level relations)
R1 is_a R2 X R1 Y ==> X R2 Y
(class level relations)
note that obo syntax is followed and the label "is_a" is used for the subrelation_of relation.
OBD implements the generic rule:
symmetric R class_level R X R Y ==> Y R X
(not implemented as yet, but could be added easily)
OBD implements the generic rule:
transitive R X R Y Y R Z ==> Y R Z
transitive R all X R some Y all Y R some Z ==> all Y R some Z
human limb iso_homologous_to mouse limb mouse limb iso_homologous_to teleost fin ==> human limb iso_homologous_to teleost fin
Propagation over isa
OBD implements the generic rules:
W is_a X all X R some Y Y is_a Z ==> all W R some Z
(note that is_a is reflexive, so effectively the first or last is_a is optional)
1:1 homology relations are only 1:1 within taxa.
We can express this as:
A^in_taxon(T) 1:1 B^in_taxon(U) ==> NOT exists C : [ A^in_taxon(T) 1:1 C^in_taxon(U) AND C != B ]
Note that this is not a horn rule, as the head of the clause is negated. However, it can be expressed as a constraint using horn logic. At the moment, OBD does not have a generic mechanism for such constraints, but the constraint can be expressed as a SQL query and executed on demand. In future the OBD reasoner will have an option for automatically checking all specified constraints.
R <- R1 o R2 X R1 Y Y R2 Z ==> X R Z
R <- R1 o R2 all X R1 some Y all Y R2 some Z ==> all X R some Z
Core Asserted Homology Relations
The core relation that a phenoscape curator will be using in homology assertions is the iso_homologous_to relation.
I also consider the serially_homologous_to relation, but more research is required as to if and when this relation would be asserted.
There are a number of support relations that are used for inference and querying. The key relation here is the evolutionary relation descended_from. Even though we may never explicitly assert information about ancestors, this relation turns out to be key for formalization and reasoning.
|human limb||mouse limb|
|human hand digit||mouse hand digit|
|human hand digit 1||mouse hand digit 1|
|human left hand digit 1||mouse left hand digit 1|
|mammalian limb||teleost fin|
|vertebrate frontal bone||teleost parietal bone|
This is a 1:1 relation, which means that an anatomical entity class in one species cannot be related to two different classes in two different species.
See further on in this document for discussion of the homologous-as argument. For now we assume this is present but unstated.
Note that as a class level relation, this does not display propagation over/under is_a. This is desirable consider if we used propagation to infer:
- human hand digit 1 homologous_to human hand digit
- human hand digit 2 homologous_to human hand digit
because of the symmetry property we would infer:
- human hand digit homologous_to human hand digit 1
- human hand digit homologous_to human hand digit 2
this already violates the 1:1 property, and consider the addition of transitivity:
- human hand digit 1 homologous_to human hand digit 2
So the lack of such propagation is desirable. But some kind of propagation is desirable. I return to this issue later on, with the descendant relations
Note in the examples above I include homology statements at different levels of specificity. This is possible, and does not create any incorrect inferences under the semantics specified here.
|mouse vertebra 1||mouse vertebra 2|
|human hand digit 1||human hand digit 2|
|human hand digit 1||human foot digit 1|
|human hand digit 1||human foot digit 2|
This is not a 1:1 relation. This relation is also symmetric and transitive.
Marcotte et al
Recording provenance on asserted statements
Provenance is vital for homology assertions. In OBD these would be created as standard annotations using the built-in OBD reification mechanism. We write this as:
A iso_homologous_to B [S1] S1 has_source PMID:12345
This is trivially convertible to RDF, with subject, predicate and object triples generated for S1.
NOT[A iso_homologous_to B]
Conflict and contradiction
Homology and descent
Thus far we have been treating homology relations as binary. Many treatments of homology include a 3rd "homologous-as" argument.
In order to formalize this we need to introduce a descended_from relation. Later on in this document we attempt to fully define this relation, in terms of genetic specification. However, it is unlikely this definition will need to be used in the short term, so for now I present a high-level treatment, and state the characteristics of this relation.
This relation has the following characteristics:
- inverse_of: has_descendant
This relation obeys the "straight-line" rule:
a descended_from b, a descended_from c ==> b=c or a descended_from c or c descended_from a
This can be encoded in OBD as a constraint.
This relation holds between anatomical entity instances (for example, my left hand and my grandfather's left hand). Usually we would not make statements about instances, but instead would generalize to the class level (e.g. all human hand descended_from some ancestral primate hand).
A is descended from B if A is encoded in a genome GA that is inherited from a genome GB, and B is encoded by GB. This formalization is explored in more detail later on.
Homology as descent
We can provide a definition of the binary homology relation:
A iso_homologous_to B <==> all A descended_from some (has_descendant some B) all B descended_from some (has_descendant some A)
With OBD we can actually make a stronger rule:
A iso_homologous_to B <==> all A descended_from some anc(A,B)
Here anc(A,B) is just shorthand for the ID expression
We use anc(A,B) for convenience in this document, but in OBD it would be stored as the ID expression, which would automatically entail:
all anc(A,B) has_descendant some B all anc(A,B) has_descendant some A
(Note this feature of the OBD reasoner is inexpressible in many rule systems)
We define a new relation descended_from_ancestor_of
simple relation chain:
descended_from o has_descendant -> descended_from_ancestor_of
all X descended_from some (has_descendant some Y) ==> all X descended_from_ancestor_of some Y
Note we can see that this is entailed by iso_homologous_to
X iso_homologous_to Y ==> all X descended_from some anc(X,Y) all anc(X,Y) has_descendant some Y ==> all X descended_from_ancestor_of some Y
Note that this is *not* equivalent to homologous_to. It is an instance-level relation, and does not have the symmetry property.
Note also that because of restrictions on the descended_from relation, this is not a trivial relation that holds between all entities across species.
descended_from_ancestor_of is a little long winded. Polling needed to decide new name. Possibilities include:
Remember this is an instance level relation, so when applies to classes it should be read "all X R some Y"
- descended_from_ancestor_of_part_of <- descended_from_ancestor_of o part_of
- descended_from_ancestor_of_develops_from <- descended_from_ancestor_of o develops_from
Modeling homologous-as and ancestry information
Assuming we wish to represent pairwise homology with the addition of a homologous-as relation, we need some kind of ternary relation, with a 3rd homologous_to argument. We can give this semantics using the descended_from relation.
We have 2 modeling options here:
- Direct n-ary relations
- n-ary relations pattern; which breaks down into
- Homologous groupings
- Common Ancestor (CA) classes
More information on the n-ary relations pattern can be found on the [n-ary relations pattern W3C note http://www.w3.org/TR/2006/NOTE-swbp-n-aryRelations-20060412/],
direct n-ary relations
Typically a statement involving a 3-argument predicated would be written in FOL something like:
We introduce a syntactic sugar keyword "as" purely for documentation purposes, so we can write the 3-arg version as:
X iso_homologous_to Y as A
We treat the binary relation as equivalent but with an unspecified 3rd argument. Or more precisely:
X iso_homologous_to Y as A ==> X iso_homologous_to Y
X iso_homologous_to Y ==> exists A : X iso_homologous_to Y as A
(this can't be stated in OBD, as the head is an existential)
specifying a more generic A is allowed. i.e.
X iso_homologous_to Y as A A is_a B ==> X iso_homologous_to Y as B
This means we can translate:
X iso_homologous_to Y ==> X iso_homologous_to Y as anatomical entity
(which can be represented in OBD)
We need a slightly modified version of transitivity
X iso_homologous_to Y as A Y iso_homologous_to Z as B A is_a C B is_a C ==> X iso_homologous_to Z as C
X iso_homologous_to Y as A ==> Y iso_homologous_to Z as A
Finally, we can suuply a definitional rule:
X iso_homologous_to Y as A <==> all X descended_from some (A and has_descendant some Y) all Y descended_from some (A and has_descendant some X)
We can also use a skolem ID to name the descendant class:
X iso_homologous_to Y as A ==> all X descended_from some anc(X,Y) anc(X,Y) has_descendant some X anc(X,Y) is_a A
Unfortunately 3-ary predicates cannot be directly implemented easily in OBD, RDF or OWL.
One way is to use RDF reification:
X iso_homologous_to Y as A ==> X iso_homologous_to Y [S1] S1 homologous_as A
This is a little awkward.
OBD has some experimental support for more direct n-ary relations, this could be extended.
Another approach is to use an n-ary relations pattern. We present two forms of this, one involving creating a grouping entity or class that subsumes or collects the homology elements. The second involves creating a class representing an ancestral entity or set of entities.
In either case, the above rules could still be useful for a formalization that is not reasoned over directly.
I explore two similar variants of homology groupings. Both of these are compatible with the existing HOG ontology.
The first variant involves grouping entities. For documentation purposes I will name these as:
But these could have numeric IDs if required. These entities would be information artefacts.
OBD would contain triples following this template:
H instance_of HomologyGrouping H has_member TE1 H has_member TE2 H homologous_as AE
mammalian_forelimb_avian_wing_homology instance_of HomologyGrouping mammalian_forelimb_avian_wing_homology has_member mammalian_forelimb mammalian_forelimb_avian_wing_homology has_member avian_wing mammalian_forelimb_avian_wing_homology homologous_as forelimb
This translates to the direct n-ary form as:
H instance_of HomologyGrouping H has_member TE1 H has_member TE2 H homologous_as AE ==> TE1 homologous_to TE2 as AE
Additional provenance triples could be added:
mammalian_forelimb_avian_wing_homology has_source PMID:12345
Note that this communicates the same information as the 3-ary relation above. In fact there is a general pattern called the [n-ary relations pattern http://www.w3.org/TR/2006/NOTE-swbp-n-aryRelations-20060412/], and this can be seen as a straightforward application of the pattern. It is sometimes called reification (which can lead to confusion, as RDF uses reification in a more restricted sense, and OBD follows this usage).
The main rule here is a macro rule for rewriting the grouping as a pair of axioms:
H has_member X H has_member Y H homologous_as A ==> all X descended_from some (A and has_descendant some Y)
Note we do not need to add a second head clause for the symmetric case, because there is no ordering of the members in the has_member axioms
(we may optionally want a clause to avoid reflexivity)
Note that there is nothing that restricts us to pairwise homology here. The macro expansion rule will work even though it looks pairwise, as it applies to all pairs. However, we may wish to make a more efficient axiom:
H has_member X1 ... H has_member Xn H homologous_as A ==> all X1 descended_from some anc(X1..Xn) all anc(X1..Xn) has_descendant some X1
Homology grouping classes
One variant of this is to treat the homology groupings not as information entities, but as anatomical entity classes:
mammalian_forelimb is_a mammalian_forelimb_avian_wing_homology avian_wing is_a mammalian_forelimb_avian_wing_homology mammalian_forelimb_avian_wing_homology homologous_as forelimb
Note the explicit subsumption relationship.
We can still add provenance information:
mammalian_forelimb_avian_wing_homology has_source PMID:12345
X is_a H Y is_a H H homologous_as A ==> all X descended_from some (A and has_descendant some Y) all Y descended_from some (A and has_descendant some X)
Again, we can make this accept arbitrary numbers of members in the grouping, not just pairs:
X is_a H H homologous_as A ==> H is_a A all X descended_from some anc(H) all anc(H) has_descendant some X
Note that this can be treated as a formalization of the HOG ontology. Although HOG leaves out the "homologous_as" argument, this can be treated as a default "anatomical entity" or the ontology LCS of all grouped classes.
This turns out to be very similar to the homologous grouping class approach. The difference is that rather than represent a class that subsumes all homologs, we make a class that subsumes all ancestors of that class:
ALL mammalian_forelimb descended_from SOME mammalian_forelimb_avian_wing_ancestor ALL avian_wing descended_from SOME mammalian_forelimb_avian_wing_ancestor mammalian_forelimb_avian_wing_ancestor is_a forelimb
The general template for pairwise homology assertions is:
all TE1 descended_from some AE all TE2 descended_from some AE AE is_a GE
The macro rule becomes much simpler - simply adding class-level symmetry
all X descended_from some A ==> all A has_descendant some X
Note that additional biological information can be added to the descended_from statements. For example, if a temporal range is known, this can be added. Or if the ancestral class has a temporal range, this too can be added.
Comparison of approaches
With the rules provided, all above approaches are inter-convertible, although going is some directions may entail some information loss. In fact with a system such as OBD, all approaches can co-exist in the same KB (albeit with additional adminstrative burden, and potential loss of efficiency due to storing redundant data).
- All forms can be interconverted via OBD-expressable rules, but there may be info loss
- Simple binary associations contain the least information
- Both HOGs and CA axioms allow expressing something about the ancestor
- CA axioms explicitly represent the line of evolutionary descent and the ancestor
- Simple binary relations might be the simplest relations to work with
- Symmetry is implicit with HOGs and CAs
- HOGs and CAs naturally extend beyond pairwise homology
The (MR)CA approach was initially criticized on the grounds that homology information should not be stored in an ontology. But as far as OBD and every RDF or OWL system is concerned, everything is an ontology. We can trivially add the all-important provenance axioms to either the ancestor class or to the descended_from statements.
From an ontology management perspective, the CA axioms can be managed separately from the homology-neutral ontology. This is perhaps where the difference in strategy lies. A purist MRCA approach may avoid a homology-neutral ontology, and utilize purely taxonomic leaf anatomical entity classes and classes representing common ancestors.
The CA approach offers potential benefits - for example, temporal ranges and other metadata can be assigned to the descendant triples.
Consequences of omitting arg3
The 3rd argument can be regarded as optional for iso_homologous_to. This is because this relation is already heavily constrained by the 1:1 characteristic.
In contrast, arg3 is more useful for non 1:1 homology. Consider serial homology. With pairwise relations we have statements of the form
- hand digit 3 serially homologous to hand digit 4
- hand digit 3 serially homologous to foot digit 4
- hand digit 3 serially homologous to hand digit 4 as hand digit
- hand digit 3 serially homologous to foot digit 4 as digit
Phenoscape is currently collecting a small number of pairwise binary homology statements. On examination, these appear to be promotable to stronger iso_homologous_to statements, as they all have the 1:1 property.
It is not clear that there is an immediate benefit to recording the homologous_as argument, or if this is even possible in all cases.
If pairwise binary statements are asserted, the OBD reasoner can be instructed to materialize the implicit ancestral classes, and the descendant relations. For example:
forelimb_digit_1[mouse] iso_homologous_to forelimb_digit_1[human] ==> all forelimb_digit_1[mouse] descended_from some forelimb_digit_1[mouse+human] all forelimb_digit_1[mouse+human] has_descendant some forelimb_digit_1[mouse]
(we don't need to write a symmetric rule, as we already have the class-level symmetry rule for iso_homologous_to).
The expanded axioms are all-some statements and thus have the expected propagation over and under is_a.
We can also name the composition of descended_from and has_descendant. This will give us inferences such as:
all forelimb_digit_1[mouse] descended_from some forelimb_digit_1[mouse+human] all forelimb_digit_1[mouse+human] has_descendant some forelimb_digit_1[mouse] ==> all forelimb_digit_1[mouse] descended_from_ancestor_of some forelimb_digit_1[human]
Because of the all some rule we infer further statements such as:
all left_forelimb_digit_1[mouse] descended_from_ancestor_of some forelimb_digit_1[human] all forelimb_digit_1[mouse] descended_from_ancestor_of some forelimb_digit[human] all forelimb_digit_1[mouse] descended_from_ancestor_of some digit[human] all forelimb_digit_1[mouse] descended_from_ancestor_of some anatomical_entity[human] all forelimb_digit_1[mouse] descended_from_ancestor_of some digit[vertebrate] all left_forelimb_digit_1[mouse] descended_from_ancestor_of some digit[vertebrate]
As can be seen, this could potentially lead to a large number of inferred statements. The pattern here is one of "up-along-up" - we go up the is_a hierarchy, along the class-level homologous_to, and then up the is_a hierarchy.
If the named subset of the cross-product of taxon x anatomical entity is large, then the number of inferred statements will be large. If this becomes problematic, then some of the reasoning can be done by backward chaining (although this is not yet implemented in the OBD reasoner, non-recursive backward chaining horn rules can be implemented simply as SQL views that mimic statements).
Assert homology at the appropriate level
For example, if tax1 has a bone B that develops from a cartilage element C, and tax2 has a homologous element but is purely cartilaginious, then we can annotate either of the following:
- tax1-B iso_homologous_to tax2-C
- tax1-C iso_homologous_to tax2-C
But a better strategy might be name a skeletal element superclass E, such that B and C are subclasses of E, and then declare
- tax1-E iso_homologous_to tax2-E
From this we would infer the 2 axioms above
Assume that v1 and
Mereological Sums and Fusions
- human mc3 po.df.hd [some] cow cannon bone