Difference between revisions of "Data Jamboree 2/Notes"
(→Curation consistency experiment: Review of Results)
(→Curation consistency experiment: Review of Results)
|Line 278:||Line 278:|
== Curation consistency experiment: Review of Results ==
== Curation consistency experiment: Review of Results ==
Five curators (Engeman, Grande, Hilton, Mayden, Sabaj) participated in the consistency experiment on September 30, 2008 (Wed.). The results of the experiment are discussed in more detail [[Data_Jamboree_2/Annotation_Experiment | here ]]
Five curators (Engeman, Grande, Hilton, Mayden, Sabaj) participated in the consistency experiment on September 30, 2008 (Wed.). The results of the experiment are discussed in more detail [[Data_Jamboree_2/Annotation_Experiment | here]]
'''''Hilmar's notes on discussion of experiment with curators and advisors'''''
'''''Hilmar's notes on discussion of experiment with curators and advisors'''''
Revision as of 14:42, 8 October 2008
- 1 User interface
- 2 Taxon Concepts subgroup meeting (Monday, Sept. 29, 2008)
- 3 Notes from Data Curation Sessions
- 4 Wrap-up discussion with curators
- 5 Curation consistency experiment: Review of Results
User interface demonstration
DATE? SUNDAY MEETING OR DO THESE NOTES ALSO APPLY TO MEETING WITH INDIVIDUALS IN PAIRS ON MONDAY/TUESDAY/WED?
- Jim Balhoff suggested enabling drill down from higher taxa to lower ones. Typically, query for annotations will yield more results at higher levels in the taxonomy. Drilling down into the lower levels will serve to prune the results and narrow down to users' exact requirements. Monte Westerfield, and Paula Mabee seconded.
- Mark Sabaj suggested using a fish landscape with different predefined areas for visualization of results and guiding the search. Paula Mabee and John Lundberg approved.
- Todd Vision suggests displaying only those nodes of a taxonomy that have been annotated
- Judith Blake suggested using Cytoscape to browse through various nodes in the tree
- Monte Westerfield suggested linking phenotypes to genes
- Todd Vision suggested character correlations
- Todd Vision suggested using auto composition of search terms
- Monte Westerfield suggested using Boolean combinations of query parameters. Seconded by Judith Blake
- option for hierarchical indexing of results (taxonomy, phylogeny, but also anatomy ontology)
- mapping characters on a tree: multiple phenotypes may match any particular query
- map to different colors for indicators? use numbers as indexes?
- mapping phenotypes onto trees cannot typically reconstruct character state changes, and hence traditional visualizations may be misleading?
- ability to prune species with no data (values) for export
- search interface: ability to combine taxon/entity/quality specifications (and, or, not)
- graph navigation: Dbgraphnav, Cytoscape
- clickable fish image for starting navigation
- most common entry point is likely to be a simple one-field form for entering terms
- phenotype query prototype: how do I get from here to the genes?
- ability to see correlations between phenotypes
MGI batch query demo
- users don't use complex query forms
- auto-detect type of input tokens
- allow download in different formats
- computationally savvy users
- pre-written SQL as available from GO website
User interface strategy
- Most common entry gateways: search by
- gene (from ZFIN)
- anatomical entity
- Use case: find evolutionary phenotypes that match a mutant ZFIN phenotype
- Query by phenotype, result is species and ZFIN mutants that have matching annotations
- Alternatively, query by phenotype profile (several phenotypes)
- This could be retrieved by ZFIN mutant rather than typing them all in
- Inverting this use case: find ZFIN mutants for a set of phenotypes that differs between two taxa
- Query by two taxa, pull out all phenotype annotations that aren't annotated to both taxa (each of which may be a clade)
- User can remove phenotype annotations from that result to create the search profile
- Use case: find matching taxa and/or genes for a phenotype or phenotype profile
- Query by anatomical entity to retrieve matching phenotypes ([Q]ualities, essentially)
- Build phenotype profile from that (choosing or removing phenotype annotations)
- Summarization of results:
- Number of matching taxa, ZFIN genes, publications
- Search by anatomy term
- obtain the qualities (and entities if query is higher-level) for the query term
- use this to be build query profile for obtaining matching ZFIN genes
- Search by taxon:
- results in phenotypes annotated to this taxon
- list of publications, possibly as a secondary result after narrowing down phenotype list by anatomy term
- should be able to see where in the tree the currently selected taxon is
- Search publications:
- Search by author, by taxon, by anatomical term
- Search by identifier (doi)
- Search by ZFIN gene (or mutant)
- Results in phenotypes annotated to this mutant
- Use these to retrieve publications describing such phenotypes (regardless of taxon or gene)
- Obtain the number of these phenotypes that are used for evolutionary annotation (i.e., annotated to - presumably normal - taxa)
- Summarizing data per publication
- number of taxa and/or anatomical entities or phenotypes matching the original query, compared to overall number of taxa or anatomical entities or phenotypes
Feedback and Suggestions on proposed Phenoscape UI
- Data cube like representation to represent taxa, phenotypes, and genes on three dimensions. This should allow combinations of two of the three parameters to be displayed at any given time(Rick Mayden)
- While looking for gene-phenotype associations, display ALL phenotypes the gene is associated with IN ADDITION to the phenotypes of interest (Rick Mayden)
- Term Information area for a selected term can be used to display all synonyms (dbxrefs?) and misspellings (Rick Mayden)
- In publication search results, indicate whether the publication was curated or not. If curated, display the details of curations such as curator's info, date curated, and importantly, versions of the ontologies that were used in the curation process (Mark Sabaj, Rick Mayden)
- Provide external links to other morphological databases such as MorphBank, MorphoBank etc (Rick Mayden, Mark Sabaj, Terry Grande, Eric Hilton)
- Enable search for related entity to the phenotype (Eric Hilton)
- While displaying publications, include a link to Authors' page and if possible, links to related people (Terry Grande)
- Provide links to images (Terry Grande)
- Group phenotype query results by taxa. Higher taxa should be expandable to display lower taxa (Paula Mabee, John Lundberg)
- Display result distributions among taxa when displaying results from phenotype queries (Paula Mabee)
- Group results by author, publication date etc while displaying publications (Paula Mabee)
- Group annotations by author, annotation date (Paula Mabee)
- Allow taxon/gene/phenotype combinations for querying (Paula Mabee)
Taxon Concepts subgroup meeting (Monday, Sept. 29, 2008)
(Present: Peter Midford, Paula Mabee, Todd Vision, Wasila Dahdul, Hilmar Lapp, John Lundberg, Suzi Lewis, Judy Blake)
Synonym scanner is working well but will never be perfect because CoF doesn't list every synonym in existence (because they may not detect every use in literature, or maybe deemed a name not worthy of addition as synonym)
TJP: Have we checked requested taxa not in CoF against UBio? - that would provide an LSID that could provide a dbxref to anchor the taxon
Addition of synonyms not in TTO: currently Peter must add by hand.
Need to track - Association between synonym, person who requested and publication? - Not being tracked right now.
JGL: We don't want to give this to CoF because these synonyms are picked up in morphological or phylogenetic studies. These names that appear and we flag as synonyms to valid names in CoF are not names coming out of taxonomic research; these are mistakes: OCR, typographical mistakes of author, or author mistakes of species in wrong genus.
Seems we should track all of this in database?
We want: - author year of the reference (right now it's in comments) - want in a searchable context - does CoF have a database of publications?
PM: ed publication information (DOI, SCSI) -CoF has an internal reference - We also need to record the full citation ourselves (name, year, unpublished dissertation...)
Peter: dbxref from CoF add to TTO TTO just hold dbxref not whole ref (no structured place to hold it).
synonym reference - how should it be handled?
SL: synonym xref can be a person (initials)
to summarize: full text ref would be in db; ontology wpuld have a pointer to that
Peter: need interface to the db; won't be visible to obo-edit - make the dbxref a url link to URI
TV: We need a universal place to resolve that URL
PM: synonym types necessary? misspellings; narrow etc...
Peter: Hoping we can scan stuff in from CoF
John: see ANSP collection search: type in a type specimen name and it pulls in the CoF ref need CoF identifier; CASspec?
PM: who should do this?
JGL: why can't curator do this?
PM: too much time; can be done in bulk
Peter: synonym types: - current vocab is exact, related, broader, narrower - all synonyms currently are related (the weakest relation). - we will want misspelling to be even weaker than related; - distinguish between published synonomy vs, curator made decision
HL: weighing how trustworthy the evidence vs. designation of weak/strong; to HL, misspelling is a strong exact synonym not weak
PM: are misspelings the only exact synonyms?
TV: not critical to have the types of synonyms; just use related
all in agreement (misspellings will be taken as exact, other taxonomic synonyms use related).
There are taxonomic names that are found in publications and are absent from TTO.
- These are usually synonyms of existing taxonomic definitions.
- Curators file TTO synonym term requests for those missing terms, indicating what the currently valid taxon is
- Based on a brief survey, some of these synonyms are in fact contained in the latest CoF update, but many are not.
- Resolution against uBio has not been attempted yet.
Evidence for synonym to current name assignment needs to be recorded.
- OBO format allows dbxref for the synonym, which is used for storing the reference, such as the curator (e.g., pers. comm.)
- Also need unique URLs and GUIDs for publications so that these can be referenced as dbxref.
- OBO edit allows typing in those references, but doesn't allow hyperlinking to a display of the source.
- Many of these publications could be imported from CoF, which maintains a database of publications with identifiers (CoF#).
- Synonym assignments will have relationship type 'RELATED' except for misspellings, for which it would be 'EXACT'.
- Peter will try UBIO for synonyms not in CAS (and report to Phenoscape group)
- Peter will request dbx prefix from OBO for specifying CAS publication ids (e.g., CASREF)
- Peter will propose scheme for adding the name of the requester to synonyms added to the TTO
- Wasila/Paula provide full refs for all curated pubs with CAS number (if cited in CAS); PMID or DOI if not cited in CAS
Notes from Data Curation Sessions
Sylvan Lake Lodge Meeting Room - September 29, 2008 (Monday)
Wasila & Paula's notes
Comparative phenotypes: discussion of relative size, shape characters: Problem: Descriptors comparing size, shape among taxa within a pub cannot be extended to taxa outside the study (e.g., Rick's size of bone example with three character states: large (0), small (1), extremely small (2)) How to do this with ontologies? Judy pointed out that there is a (dynamic line)in annotation, between the depth of a structured vocabulary and free text. I.e. where do you stop using an ontology and begin using free text? Data specific to study should be free-text.
Judy Blake intially suggested simply annotating our complex anatomical characters to "shape". This indexing first pass is useful for users to be able to aggregate the data (vs. full curation). Through further discussion, we agree that annotating to a more granular level, i.e. "shape: width" would be better and more informative. The weakness of the current PATO comes in here, in that there need to be more nodes between shape and all the terminal nodes (the descriptors such as "narrow" "broad" etc.). This might allow more depth than just "shape:width".
We need to index our comparative systematic studies at a level that is useful for the field. It is like a library, "binning" index to multiple things.
John's idea for recording size comparison within a study: Use shape: width and ALSO apply an internal grading system for these characters such that the least/smallest value is given 1.
- graded series of lengths, widths, etc.. give 1, 2, 3 ...
- least/smallest value is given 1
Eric's example of incomplete/complete scute series:
- E: scute series; Q: in contact RE: skull
- E: scute series; Q: in contact RE: dorsal fin
- E: scute series; Q: separated from RE: skull
- E: scute series; Q: separated from RE: dorsal fin
Judy: important to separate tasks of 1) anatomy ontology development and 2) annotation of publications
- suggested having ontology development workshops and curation workshops
- her suggestion: Curator needs to enter terms ahead of time and have experts fix the annotations (problem at our workshop currently is that people are not finding entities in ontology - need ontology work)
Mark submitted TAO request on Weberian vertebra - relationships that don't hold for all taxa
Wasila: batching vs. one term request: ok to submit related terms in one request
Suzi: suggested having a small PATO workshop with ichthyologists, like anatomy workshop
curators understand details of system
JB: put 2-3 people together to do anatomical subtree: send out invite to ontology development
Judy's observation notes
- Wasila and John talking about anatomical terms and what they should be
- Paula showing Mark how to log into sourceforge
- Rick looking at characters for his paper
- Wasila from Peter: question on batching or one term per request
- Jeff helps Eric remember how to log into sourceforge
- Terry working on her paper
- looking for "right" first ... low hanging fruit
- Rick annotating with Wasila's help
- Mark entering successful s.f. proposal
- Eric looking for term in his paper
- Terry asking Jeff how to enter post-composed terms
Wrap-up discussion with curators
October 1, 2008 (Wed morning)
Wasila + Paula's notes
Paula: suggestions for improving curation process?
- Terry: would like to go through a paper first to deal with needed terms
- also liked annotating easy terms to become familiar with ontology structure
- John: some things, like joints, can be added in bulk (e.g. in ontology development meeting like the one in Philadelphia)
- All: Had visualization issues - how to know what terms are there, and their relationships?
- Need large poster of all terms (Paula, Wasila, Jeff Engeman will make one and send file to all curators to print)
- Paula: PDF mark-up tool that highlights terms matching to TAO; characters not highlighted would likely require new term
- Todd and Paula: use Phenex to highlight ontology terms because character and state descriptions from pub are pre-entered as free text (Paula added this to Phenex tracker)
- Mark: how does a curator decide to precompose vs. postcompose?
- Wasila: if entity will be used repeatedly (in single or multiple pubs) then add to TAO; if not, post-compose
- Mark: when post-composing, would be helpful if Phenex would autocomplete to pre-composed cross-product term (if present in TAO) so that curator can be aware of similar term in ontology
- Suzi: tough anatomy stuff: you can discuss as a group of curators then submit and enter in ontology
- Paula: need to set up svn for Phenex files (Jim will do this) - WD will send instructions to new curators
- Rick: can we break down single character into multiple characters?
- Paula: make into multiple annotations - don't break down character
- Pubs with same characters
- Paula: curate it all separately
- Jim: duplicates will come together in database
Paula: Problem: Currently we are not curating of entities that are always present (e.g. "maxilla" present in all Teleostei). Problem is that users may search for distribution of a character and may find variability in very restricted taxon (genus) but have no information as to the condition in the larger group (teleost fishes). Need to add these conserved characters. Two approaches - either fine for us - need input from Jim regarding database feasibility.
- 1. Inherit: annotate 1-2 basal species within each ostariophysan subgroup whose character states are optimized to ancestral/internal nodes.
- 2. Propagate: annotate higher level group node (e.g. "Teleostei") with a character state (e.g. "presence") and flag with weak evidence code. This character state is propagated to leaf nodes/species/terminals (unless run into different character state).
- Jim: preference for 1 (inherit);
- Jim: relationship between EQ and taxon: "inferred to exhibit"; phenotypes are exhibited by taxa; congregating "exhibits" to higher levels seems to be ok to do (e.g., cypriniformes exhibit red and blue dorsal fin)
- Jim: might need more than one relationship between taxon and a phenotype; right now we have "exhibits" as the relationship
- John: We want biological principles incorporated here/ examples of optimization
- Paula: This relates to character mapping/optimization. May need new methods to optimize EQs?
- Paula: Will followup with curators regarding completeness of literature surveyed
- Paula & John: Will work up optimization example for group that summarizes basic user mapping needs
Curation consistency experiment: Review of Results
Five curators (Engeman, Grande, Hilton, Mayden, Sabaj) participated in the consistency experiment on September 30, 2008 (Wed.). The results of the experiment are discussed in more detail here.
Hilmar's notes on discussion of experiment with curators and advisors
- difference between depth, length, width
- some failed to post-compose (e.g., increased length)
- need to standardize shape, length, depth, and width definitions
- should then use higher level character (such as 'shape')
- some use 'increased count' which is-a count
- question of whether basihyal bone precludes presence (or implies absence) of cartilage
- question of how to annotate this looks more difficult than in reality is
- basihyal cartilage absent implies basihyal bone absent (because the latter develops from the former)
- in fact it can also be that the cartilage is absent b/c it has developed into the bone (completely ossified)
- hence need to add that basihyal is absent too
- graph view can be very helpful
- software should prevent filling in relative entities for qualities that aren't relational
- post-composed relative entity because of uncertainty of whether the full (existing) term would be compatible with the definition the author may have been using (for a different clade)
- difficulty to find the 'bony projection' by auto-completion (because it doesn't pop up when typing the beginnings of 'projection')
- problem of capturing the 'overlaps with' aspect in addition to 'anterior to'
- full text search of the definitions would be very useful
- hypural is also contained within the upper lobe of the caudal fin
- difficulty of finding and comparing definitions of relationship types
- Software should ensure that entity starts with an entity term if post-composed, not a quality
- sphenotic needs to go into the comment field
- too complex to express the exact nature of the orientation, so choose just 'orientation'
- triradiate versus tripartite