Difference between revisions of "EQ Editor"

From phenoscape
 
(75 intermediate revisions by the same user not shown)
Line 1: Line 1:
=EQ Editor requirements=
+
{{EventBox1|
===Minimum data entry capabilities to begin EQ curation===
+
[[Phenex]] is our production EQ Editor.  This is an early requirements-gathering document and is out of date.  Feature requests for Phenex should be directed to the Phenex tracker, linked from the [[Phenex|Phenex homepage]].
* species name (free text until taxonomic ontology is available?)
+
}}
* EQ statement for character (i.e. Q should be an attribute rather than value)
+
 
** E from fish anatomy (ontology ID)
+
=Overview=
** Q from PATO (ontology ID)
+
Currently, the EQ Editor is being developed by customizing [http://www.phenote.org/ Phenote] to our needs.  We are collaborating with the developers of Phenote at NCBO by contributing improvements to the core Phenote code, and we are also developing specialized components and configuration required for our annotation workflow.
** E2 from fish anatomy (if Q is descendant of "relational quality of continuant" or "relational quality of occurrent") (ontology ID)
+
 
* Q for value, either:
+
=Requirements=
** Q from PATO, descendant of Q in character (ontology ID)
+
 
** measurement (number followed by unit name)
+
These are the currently understood requirements - an earlier draft can be viewed at [[EQ Editor requirements draft 1]].
* original character description (free text)
+
 
* original state descriptions (free text)
+
The EQ Editor will be used by curators to annotate phenotypic descriptions of Ostariophysi fish using EQ syntax and ontologies.
* publication/citation (DOI? older publications don't have DOI, do they?)
+
 
* image or URL for image (image data or URL)
+
The curators will use the EQ Editor to code phenotypic data from an existing publication into the EQ format.  This data consists of descriptions of character state values for corresponding species (or, more precisely, specimens).  A published character state description may contain: the species or higher taxonomic specification, a textual description of the character state value, reference to a voucher specimen for this description, an image showing the character.
* voucher specimen ID (format?)
+
 
 +
==Data Model==
  
===Interface technological possibilities for EQ editor===
+
The following are essential data elements to be captured by the EQ Editor for each character state description (format is in parentheses).  EQ coding will be performed at the level of character states, but there may be a facility for dynamically viewing entries in a character matrix format.  See [[EQ_for_character_matrices|"EQ for character matrices"]] for a discussion - at the [[WG:PI_Meeting_5Jun07|June 5 PI meeting]] we decided to work with only character states.  Phenotype annotations should generally be made at the species taxonomic level.
  
''This list will need to driven by further discussion of the EQ editing requirements - for now it's just an illustration of some possibilities.''
+
===Phenotype Annotation===
  
* Mesquite plug-in + extensions to NEXUS format
+
* species name (ontology ID)
** this would allow a curator to work locally and begin working with data before any database is created
+
* EQ statement for character state (Q should be a value)
** data would be stored in extended NEXUS format files
+
** E from fish anatomy (ontology ID) - may be post-composed from multiple ontology terms, especially in conjunction with a spatial modifier ontology
** would provide community value, since Mesquite is general and widely used
+
** Q from PATO (ontology ID)
 +
** measurement - used if Q is an attribute suitable for measurement (e.g. "length")
 +
*** number - should be able to put in single value, a range, or <, > ''- can all this be done within EQ formalism?''
 +
*** unit (units ontology ID)
 +
** E2 from fish anatomy (if Q is descendant of "relational quality of continuant" or "relational quality of occurrent") (ontology ID)
 +
* textual description (free text)
 +
* publication/citation (DOI? older publications don't have DOI, do they? Another possibility is [http://iphylo.blogspot.com/2007/05/openurl-and-spiders.html SICI])
 +
* Reference to section, figure, or image from publication (free text)
 +
* Evidence code (''the list of evidence codes requires discussion'')
 +
** Specimen catalog number(s) supporting evidence code when required (direct observation)
  
* Custom web application
+
===Additional data per publication===
** could have a more customized interface
 
** interface will not depend on integrating into Mesquite; this might allow faster development
 
** would a central database need to be set up to store the data?
 
  
* Specialized additions to Phenote
+
Some data from the publications may be useful to store in our database, independent of phenotype annotations.
** already has lots of development behind it
+
* Entire specimen list (catalog numbers) in publication materials.  This could be useful to users of the site who may want to see what specimens the author was using, even if the annotated phenotypes were not explicitly tied to a catalog number.  Further, we could use the catalog numbers to access museum specimen data services for various future applications.
** does not work well with a matrix mindset (Phenote works with a list of value descriptions)
 
** development for this purpose might not mesh well with more central uses of the application
 
  
 +
==Workflow==
  
 +
===Sources of data===
  
==EQ Editor requirements (February discussion)==
+
The publication being coded may contain data in one of a few different formats.  The given data format may suggest its own style of workflow.  These publication types include 3 main forms:
  
''These requirements are a first stab taken at the PI meeting at NESCent on Feb 26-27, 2007.''
+
# Description of many characters for many species and higher taxonomic levels; no character-by-taxon matrix published.
 +
# A data matrix with multiple species and multiple characters.  There is a character state value for each cell in this matrix.
 +
# A single species description of values for many characters.
 +
# Description of a single character for many species (perhaps less common than the other formats?).  If focusing on a single character the data may come from multiple publications.
  
===Morphologist Workflow===
+
It seems like scenarios 2 and 3 can be treated as special cases of scenario 1.  For each character, an interface is required for choosing the Entity and Quality from their respective ontologies, and entering free text such as the original character descriptions, as well as other [[#Data_Model|relevant data]].
  
# One reference publication, many species, several characters
+
===Detailed steps===
#* Have reference publication about taxonomic group, with figures, for skeletal characters
 
#* May proceed section by section; need to specify section, or figure, or generally part of a reference
 
#* Need to denote species, choose anatomical entity, choose quality, such as anterior margin, specify value
 
#* May have questions, or need to input free text comments, e.g., about uncertainties
 
# Single species, single publication, multiple characters
 
#* Might also have a paper describing a single species
 
#* Curator would use a specimen to confirm accuracy of annotation
 
# Many species, many publications, single character
 
#* May also use a character survey
 
#* Would use many different papers
 
#* Would span many different species
 
# Specimen may be a fossil record
 
#* Need to record geological time
 
#* Will do that later
 
# Specimen-based annotation is not part of the project
 
  
* Need to reference "traditional" character: should be able to verbatim quote original character description, also give publication reference; there are often differing, even conflicting, definitions for the same character
+
The standard workflow would be a curator dealing with a single publication during a session of working with the EQ Editor.  Steps might be:
* Need to be able to see what is already present about a particular character; may also need to look at "similar" characters (as defined by, e.g., characters using sibling terms and sibling qualities)
 
* Need to see the values that have been assigned already for a character
 
  
* There may be conflicting character states reported in different publications; the data curator will decide whether these conflicts need to be kept or can be reconciled.
+
# Create a new EQ Editor document.
* Verification of characters descriptions and state values by Data Curator or even Morphologist, e.g., using actual specimen(s), and attributing the verification
+
# Enter document-wide data:
 +
## publication information (author, title, journal)
 +
## list of species - choose taxon from taxonomic ontology for each one
 +
##* For each species, input list of specimen catalog numbers used in the publication, choosing museum institutions from a pick list
 +
# Begin making character state annotations:
 +
## Select a species or multiple species to which this character state applies (there should be facilities for efficiently choosing sets of taxa)
 +
##* If the author makes a statement about a higher taxon (e.g. a family), make a phenotype annotation for every species in the materials list for that publication which is a member of the taxon.
 +
## Create a new EQ statement
 +
## Choose Entity from anatomy ontology
 +
## Choose Quality from PATO (usually a Value term)
 +
## If the Quality is an Attribute term (such as "length"), enter a measurement and its units
 +
## If the Quality is relational, enter a second Entity from the anatomy ontology
 +
## Enter the evidence code appropriate for the published statement - if the phenotype is clearly based on one or more specific specimen numbers, enter them into the catalog number(s) field to support the appropriate evidence code.
 +
## Enter any information for the figure or section within the paper showing the character.
 +
# At any point, the curator can save the current work to a document (or database).
  
* Want all annotations to be associated with voucher specimens (may only be a photograph though)
+
===Working with EQ statements===
 +
* A particular EQ statement (specific combination of Entity and Quality) will often be applied to multiple species within one document.  Species are likely to share EQ values as a result of phylogenetic history, so facilities for efficiently selecting groups of related species should be available.  Entries with this EQ statement would be generated for each selected species.
 +
** An EQ entry panel could allow the user to choose a taxon from the taxonomic ontology, either by directly browsing the ontology or through an autocomplete text search field.  All species within that taxon would be selected.
 +
** A phylogenetic tree view could be provided, which allows the user to select a node or nodes to which to apply an EQ statement.  All species descending from that node would be selected.  This tree could be initialized by using the taxonomic ontology.  The user could manually edit the tree to provide additional resolution (perhaps by following a tree in the paper).  Alternatively, the software could allow the specification in Newick format of a tree created by another application.
 +
** It would be useful to be able to invert selections generated by either of the preceding methods.
 +
* The EQ statements entered into the document should be able to be viewed in various ways to allow checking data entry progress.  Various views are desired:
 +
** Flat list of all entered EQ statements.
 +
** Filtered flat list - a search text field can allow quick filtering by any of the data fields.
 +
** Character-by-taxon matrix - generated from EQ statements by grouping via shared PATO Attributes.
 +
** Filtered matrix view - what capabilities are needed?  filter by taxon, entity, quality: if one of these is a higher term in the ontology, show all descendant matches
  
===UI requirements===
+
===Questions===
  
For example, the Fink & Fink paper
+
* What should the user be able to do if there is not an appropriate term in the ontology?
 +
* Need to add a button to say "Need a new term"
 +
* What will be done (in the immediate term) with the annotations produced by the EQ Editor?  Will they be stored in separate documents, or compiled into a central repository?
  
* start by setting the reference we will be working with
 
* define a set of species we are going to work on
 
* select skeletal region as a focus, e.g. the gill arch region, or tail fin
 
* look at what has already been annotated for this region, as a character-by-taxon matrix
 
** expect several hundred taxa, and between 50 and 200 characters, depending on how feature-rich the region is
 
** a source paper may not give the character at the species level, so the taxon may be a higher-level taxon
 
* if characters are already present, just add the reference
 
* otherwise define new character
 
** choose existing entity term, initially this will be an anatomy term; term may not exist yet in which case we need to work with a provisional term
 
** choose attribute term from PATO; term may not exist yet in which case we need to work with a provisional term
 
** denote original character description, with reference (which will probably be the paper we are working with)
 
* edit/view character: will see the images that have been used for the different states (values) that have been assigned
 
* assign/edit character states using a table with only the set of species chosen earlier, and one or more characters that correspond to the original character definition
 
** denote original character state description, with reference (which will probably be the paper we are working with)
 
  
* Taxonomic naming challenges: need to map original names to current classification; should never have two distinct rows for what is currently considered (as defined by the taxonomic ontology) the same species
+
=Roadmap=
  
===Database requirements===
+
See the [[software roadmap]] for further plans.
  
* Need to have references to digital information, such as specimen record and image
+
[[Category:Informatics]]

Latest revision as of 17:54, 5 November 2008

Phenex is our production EQ Editor. This is an early requirements-gathering document and is out of date. Feature requests for Phenex should be directed to the Phenex tracker, linked from the Phenex homepage.

Overview

Currently, the EQ Editor is being developed by customizing Phenote to our needs. We are collaborating with the developers of Phenote at NCBO by contributing improvements to the core Phenote code, and we are also developing specialized components and configuration required for our annotation workflow.

Requirements

These are the currently understood requirements - an earlier draft can be viewed at EQ Editor requirements draft 1.

The EQ Editor will be used by curators to annotate phenotypic descriptions of Ostariophysi fish using EQ syntax and ontologies.

The curators will use the EQ Editor to code phenotypic data from an existing publication into the EQ format. This data consists of descriptions of character state values for corresponding species (or, more precisely, specimens). A published character state description may contain: the species or higher taxonomic specification, a textual description of the character state value, reference to a voucher specimen for this description, an image showing the character.

Data Model

The following are essential data elements to be captured by the EQ Editor for each character state description (format is in parentheses). EQ coding will be performed at the level of character states, but there may be a facility for dynamically viewing entries in a character matrix format. See "EQ for character matrices" for a discussion - at the June 5 PI meeting we decided to work with only character states. Phenotype annotations should generally be made at the species taxonomic level.

Phenotype Annotation

  • species name (ontology ID)
  • EQ statement for character state (Q should be a value)
    • E from fish anatomy (ontology ID) - may be post-composed from multiple ontology terms, especially in conjunction with a spatial modifier ontology
    • Q from PATO (ontology ID)
    • measurement - used if Q is an attribute suitable for measurement (e.g. "length")
      • number - should be able to put in single value, a range, or <, > - can all this be done within EQ formalism?
      • unit (units ontology ID)
    • E2 from fish anatomy (if Q is descendant of "relational quality of continuant" or "relational quality of occurrent") (ontology ID)
  • textual description (free text)
  • publication/citation (DOI? older publications don't have DOI, do they? Another possibility is SICI)
  • Reference to section, figure, or image from publication (free text)
  • Evidence code (the list of evidence codes requires discussion)
    • Specimen catalog number(s) supporting evidence code when required (direct observation)

Additional data per publication

Some data from the publications may be useful to store in our database, independent of phenotype annotations.

  • Entire specimen list (catalog numbers) in publication materials. This could be useful to users of the site who may want to see what specimens the author was using, even if the annotated phenotypes were not explicitly tied to a catalog number. Further, we could use the catalog numbers to access museum specimen data services for various future applications.

Workflow

Sources of data

The publication being coded may contain data in one of a few different formats. The given data format may suggest its own style of workflow. These publication types include 3 main forms:

  1. Description of many characters for many species and higher taxonomic levels; no character-by-taxon matrix published.
  2. A data matrix with multiple species and multiple characters. There is a character state value for each cell in this matrix.
  3. A single species description of values for many characters.
  4. Description of a single character for many species (perhaps less common than the other formats?). If focusing on a single character the data may come from multiple publications.

It seems like scenarios 2 and 3 can be treated as special cases of scenario 1. For each character, an interface is required for choosing the Entity and Quality from their respective ontologies, and entering free text such as the original character descriptions, as well as other relevant data.

Detailed steps

The standard workflow would be a curator dealing with a single publication during a session of working with the EQ Editor. Steps might be:

  1. Create a new EQ Editor document.
  2. Enter document-wide data:
    1. publication information (author, title, journal)
    2. list of species - choose taxon from taxonomic ontology for each one
      • For each species, input list of specimen catalog numbers used in the publication, choosing museum institutions from a pick list
  3. Begin making character state annotations:
    1. Select a species or multiple species to which this character state applies (there should be facilities for efficiently choosing sets of taxa)
      • If the author makes a statement about a higher taxon (e.g. a family), make a phenotype annotation for every species in the materials list for that publication which is a member of the taxon.
    2. Create a new EQ statement
    3. Choose Entity from anatomy ontology
    4. Choose Quality from PATO (usually a Value term)
    5. If the Quality is an Attribute term (such as "length"), enter a measurement and its units
    6. If the Quality is relational, enter a second Entity from the anatomy ontology
    7. Enter the evidence code appropriate for the published statement - if the phenotype is clearly based on one or more specific specimen numbers, enter them into the catalog number(s) field to support the appropriate evidence code.
    8. Enter any information for the figure or section within the paper showing the character.
  4. At any point, the curator can save the current work to a document (or database).

Working with EQ statements

  • A particular EQ statement (specific combination of Entity and Quality) will often be applied to multiple species within one document. Species are likely to share EQ values as a result of phylogenetic history, so facilities for efficiently selecting groups of related species should be available. Entries with this EQ statement would be generated for each selected species.
    • An EQ entry panel could allow the user to choose a taxon from the taxonomic ontology, either by directly browsing the ontology or through an autocomplete text search field. All species within that taxon would be selected.
    • A phylogenetic tree view could be provided, which allows the user to select a node or nodes to which to apply an EQ statement. All species descending from that node would be selected. This tree could be initialized by using the taxonomic ontology. The user could manually edit the tree to provide additional resolution (perhaps by following a tree in the paper). Alternatively, the software could allow the specification in Newick format of a tree created by another application.
    • It would be useful to be able to invert selections generated by either of the preceding methods.
  • The EQ statements entered into the document should be able to be viewed in various ways to allow checking data entry progress. Various views are desired:
    • Flat list of all entered EQ statements.
    • Filtered flat list - a search text field can allow quick filtering by any of the data fields.
    • Character-by-taxon matrix - generated from EQ statements by grouping via shared PATO Attributes.
    • Filtered matrix view - what capabilities are needed? filter by taxon, entity, quality: if one of these is a higher term in the ontology, show all descendant matches

Questions

  • What should the user be able to do if there is not an appropriate term in the ontology?
  • Need to add a button to say "Need a new term"
  • What will be done (in the immediate term) with the annotations produced by the EQ Editor? Will they be stored in separate documents, or compiled into a central repository?


Roadmap

See the software roadmap for further plans.