Matching Phenotypes
This page discusses the method developed and implemented in 2010-2011 for search for, and scoring, phenotype matches between taxa (Phenoscape) and zebrafish mutants (ZFIN).
Contents
- 1 Purpose
- 2 Selecting Phenotypes from Taxon Annotations
- 3 Using attributes to limit the scope of quality comparisons
- 4 Scoring matches of individual phenotypes
- 5 Scoring matches of sets of phenotypes (phenotypic profiles) associated with a gene or taxon node
- 6 Information Content as a metric of term relatedness
Purpose
An important goal for the Phenoscape project is to be able to suggest candidate genes that may have contributed to evolutionary change. The way that we have proposed to do this is to search for changes in phenotype that appear as the result of mutations in model organisms and also appear as phenotype changes on an evolutionary tree.
Selecting Phenotypes from Taxon Annotations
The matching process involves matching changes in phenotype, not directly matching phenotypes. For phenotypes associated with mutants of model organism mutants, it is understood that they vary with respect to the wild type. For taxa, however, this means looking for taxonomic nodes where variation in a phenotype is observed among the children of the node. For example, there are nine species within the genus Aspidoras with annotations for the shape of the opercle bone. Of these, eight exhibit opercle bones with round shape, but the ninth (A. pauciradiatus) is annotated with a triangular opercle. In contrast, all three annotated species of the related Hoplosternum are annotated with a triangular opercle. Thus there is detectable variation in opercle shape within the children of Aspidoras, but not within Hoplosternum - suggesting that change in opercle shape has occurred somewhere among the descendants of Aspidoras. Once changes are identified, they are treated as variation in the affected entity at the level of the attribute parent of the qualities involved (e.g., shape).
In more detail, variation is inferred by dividing up the set of phenotypes for each child taxa into subsets that share entities and have qualities that are subsumed by the same attribute (a quality that comes from the set of attribute qualities discussed in the Guide_to_Character_Annotation). Child taxa may have phenotypes in multiple groups, one group, or may have no phenotypes at all. Then, for each group represented in the phenotypes for any child taxon, both the union and the intersection of all the sets from child taxa that fall in the (Entity-Attribute) group are formed. If a child taxon has no phenotypes at all, it is ignored. If the child taxon has phenotypes in any groups, then the absence of a phenotype in the current group is just treated as an empty set in the union and intersection calculations. Variation within a group is determined by taking the set difference of the union and the intersection of the sets from child taxa. If the difference is empty, then the union and intersection are the same, which means all child taxa have the same phenotypes within the group, so no variation. Otherwise, the union and intersection differ and at least one child taxon has a different set of phenotypes. In this case, the parent taxon is indicated to have variation within in the Entity-Attribute group.
This approach to detecting variation within entity-attribute groups meshes well with the matching approach described in the next section. Note that ignoring taxa with no phenotypes reflects that nothing is known about such taxa and the presence of such taxa in the taxonomic tree should not affect the process of detecting change. The case where a taxon is annotated, but not in a particular group is taken, at present, to indicate variation that might not have been captured, either in the original publication matrix or the curation process. For example, the absence of a Basihyal bone is observed in Siluriformes, but there is no annotation regarding its presence or absence in sibling taxa, where it is, in fact present. Thus, the noted absence in one child and the lack of mention elsewhere is used to infer that variation is present in this character.
However, identifying variation also requires propagation of phenotype sets up the hierarchy. Continuing with the previous example, what variation could be assigned to the family-level parent of Aspidoras, Callichthyidae. Variation will depend on the phenotypes assigned to each genus in the family and how those are propagated to the family. We initially propagated the union sets within each entity-attribute group to higher taxa and added any phenotypes (few if any) annotated directly to the higher taxa. This seems reasonable as the union sets capture the full set of phenotypes found within the group. However, we found that this tended to inflate the amount of variation reported to the point that the base of the tree showed variation in nearly every entity-attribute group that subsumed phenotypes annotated in the KB.
Changing to propagating intersection sets appears to reduce this inflationary effect, and we are currently using it as our propagation method. There are some issues related to monotypic outgroups that we are still investigating.
Using attributes to limit the scope of quality comparisons
Because the interest is in change in phenotype, phenotypes 'reduced' to a change in an attribute of an anatomical entity, not a particular quality of a phenotype. Subsuming qualities all the way to their attributes is not, in principle, necessary (for example, round and triangular are subsumed by 2-D shape), using a consistent set of subsumers simplifies the matching process.
Therefore, matching phenotypes is reduced to matching the entity component of changes that have a common attribute (so shape changes and color changes will never be matched).
ZFIN and Phenoscape use different (though partially overlapping) sets of qualities (from the PATO ontology) in constructing phenotype annotations.
Scoring matches of individual phenotypes
Each phenotype is linked to multiple entities via inheres_in_part_of relations. This relation holds between the entity in the phenotype as well as its is_a and part_of parents. The reasoner infers the closure of this relation, which means the KB contains inferred inheres_in_part_of relations between the entity and all the is_a and part_of parents of the entity in the phenotype.
Matching requires based on building a set of class expressions, subsumed by quality, that subsume the phenotype. This set of class expressions include all the phenotypes formed from an entity from the set of inheres_in_part_of parents and a quality that subsumes the quality specified in the phenotype. It also includes the quality terms that are subsumers (via is_a) of the quality in the phenotype. Because this set includes both phenotype expressions and qualities, it is best seen as a set of class expressions.
The actual matching consists of taking the sets of subsuming class expressions from each of the phenotypes to be matched and forming the intersection, which is the set of common subsumers. Each of these subsuming expressions can be assigned an information content (IC). The match is scored as the IC of the common subsumer with the highest information content.
Note that under this scheme, an exact match may still score poorly if the selected common subsumer refers to an entity is high up in the anatomy hierarchy. For example round inheres_in bone matched against itself will have a lower score than straight inheres_in neural spine 1 against curved inheres_in neural spine 2 because the latter two share the subsumer straight inheres_in neural spine which will have a higher information content than round inheres_in bone.
Extending matches above the level of Entity-Attribute
Scoring matches of sets of phenotypes (phenotypic profiles) associated with a gene or taxon node
The Washington et al. (2009) paper used four measures to assess similarity in phenotype profiles (sets of phenotypes associated with a gene). These are:
- maxIC - the greatest IC of any pair of phenotypes in the set of pairs of phenotypes where one is drawn from the taxon set and other from the gene set.
- ICCS - each taxon phenotype is matched against all gene phenotypes and the highest IC value among the matches for each taxon phenotype is collected. The final score is the mean of these highest matches.
- simIC - The sum of the IC values for each pair of matching phenotypes (e.g., phenotypes that share attributes) is divided by the sum of the IC values for each phenotype (in either taxon or gene profile) individually (currently implemented as matching the phenotype against itself). C. Mungall (pers comm) indicates that this metric is symmetric.
- simJ - The number of shared phenotypes is divided the sum of the number of phenotypes in each set. Note: the numerator could be either the number of exact matches (e.g, at the level of qualities) or matches at the level of attribute (exact match entity, qualities have common attribute).
Although these seem appropriate for scoring similarity of phenotype profiles associated with genes, it is less clear that they are appropriate for comparing phenotype profiles associated with changes between taxa (or phylogenetic change).
Alternatives
Because of its dependence on the set of annotations, there is a real concern that IC may not be the most appropriate measure of similarity between terms an ontology with associated annotations.