Matrix annotation workflow

From phenoscape
Revision as of 18:08, 22 May 2008 by Jpb15 (talk | contribs) (Character/character state editing)

The first Phenoscape data jamboree revealed that our curation workflow implemented in Phenote could be quite cumbersome, particularly in managing the huge numbers of rows containing annotations for every taxon. Discussion after the jamboree settled on a new workflow which allows more division of effort, allowing experienced curators to focus on the description of phenotypes using EQ, while input of specimen lists and character matrices can be done separately, possibly by assistants.

Workflow activities

This workflow is divided into 3 main activities, each of which can be performed independently and in parallel.

EQ character state coding

This activity involves the translation of published character states into ontology-based EQ descriptions. This requires deep domain knowledge providing full comprehension of the published descriptions and thorough understanding of anatomy and quality ontologies. This activity can be performed independent of annotations to particular taxa - the curator focuses only on describing the character states. As such, the curator will need to deal with far fewer rows of data when working in Phenote, as compared to simultaneous taxon annotation.

Taxon and specimen lists

This activity includes entering the list of species as named in the publication ("Publication Taxon") and choosing the matching taxon term from the TTO for each. Also at this step the list of specimens studied is entered for each taxon. This activity can be easily performed by an assistant with minimal training.

Character matrix

A character matrix provides the annotation of each species with a particular character state for each described character. Many studies publish the data already in matrix form. If a matrix is not available, the curator will need to prepare one representing the data for that publication. The states in the matrix are linked to those described in "EQ character state coding" by using corresponding character and character state numbers. The taxon names in the character matrix must match Publication Taxon entries in the taxon list. This activity may also include choosing an evidence code, from the evidence code ontology, for each state assignment (each cell). If a matrix is available for a publication, this activity requires little work on the part of an experience curator.

Output from each of these activities can be merged using a non-interactive script, producing resultant taxon-specific EQ annotations.

Implementation proposal

Overview

"EQ character state coding" can continue to be performed within Phenote, being Phenote's core capability. A revised Phenoscape configuration will be created, eliminating all taxon-related columns from the annotation table, but adding a column for "character state". Curators will enter 1 or more annotation rows for each character/character state combination, with no reference to any particular taxon. The curator can enter multiple EQ's to represent any particular character state. The search filter field at the bottom of the table will be enhanced such that the curator can filter the list down to one character at a time, viewing only the EQ annotations for the states of that character. The curator will save these EQ annotations to a file for later processing.

For the time being, "Taxon and specimen lists" can continue to be entered using the current Phenote configuration. As described below, entering these will likely be incorporated into the "Character matrix" activity within Mesquite. Phenote-based taxon files will be later processed into a to-be-developed NEXUS block.

Mesquite provides a familiar and standard interface for editing evolutionary character matrices. Thus, we will use it instead of Phenote for entry of character matrices. Character columns should be ordered the same way they are numbered in the "EQ character state coding" step. Taxon names should be unabbreviated genus and species as used in the publication. Since the "Publication Taxon" is entered as part of the matrix, it would be efficient to use the Mesquite taxa list directly instead of entering a separate list as in "Taxon and specimen lists". In order to accomplish this, we will need to add some ontology capability to Mesquite, in the form of a plugin. The user could open a panel showing the taxa in the TAXA block, and choose a "Valid Taxon" from the TTO for each one. Adding such ontology capabilities will also facilitate creating an interface within Mesquite for choosing the evidence code for each matrix cell, picking from the evidence code ontology. We can also create a panel for entering specimen lists within Mesquite, eliminating a separate Taxon List step entirely.

After the matrix and taxon list information are entered in Mesquite, all ontology information (Valid Taxon, specimens, evidence codes) would be saved in one or more custom blocks in the matrix NEXUS file. This NEXUS file would be combined with the Phenote-generated EQ list to create per-taxon EQ annotations. This could happen in an external script or perhaps by loading the EQ file into Mesquite.

Requirements for Mesquite plugin

  • Need to be aware of editing of TAXA block in order to keep Valid Taxon list up-to-date.
  • Save and read ontology data in custom block(s).
  • Make use of cell selection of matrix editor within separate evidence code panel.

Detailed Workflow 1

Detailed steps for obtaining a Publication Taxon List in NEXUS file format (creating a NEXUS file from results of Findit)

  1. If starting from scratch, copy taxa from selectable PDF (or re-type, otherwise: Jim why would you do this vs. just type directly into a Nexus file?) into FindIt>NEXUS web page (to be developed)
  2. If starting with NEXUS file, paste into FindIt>NEXUS web page (to be developed)
  3. FindIt>NEXUS web page will return a NEXUS file with corrected taxon names (Jim: what are "corrected taxon names"?) Some of the results from Findit will be garbage, e.g. river names, etc. that the curator will need to edit out.

Detailed steps for Character Matrix entry

  1. Open Mesquite or MacClade
  2. Open NEXUS file (this file will have Publication Taxon name list as per above)
  3. Enter matrix data (0,1,2, etc. for each character); save in NEXUS file

Detailed steps for Specimen entry

  1. Import NEXUS file into Phenote specimen list editor (NEXUS file will come from above or directly from author; if from author directly, some publication names may need to be edited (e.g. if genus is abbreviated))
  2. Phenote will attempt to choose a TTO taxon for each publication taxon
  3. Correct Phenote TTO choices as necessary
  4. Enter specimens for each taxon

Detailed steps for EQ entry which can happen in parallel to Taxon entry above

  1. Curator enters EQ phenotypes for each character state in Phenote; add free text character descriptions at this point; add Evidence codes (this will remain a Phenote function; if there is a matrix in the paper, we will assign IVS to all species, characters, and character states).
  2. Merge NEXUS file containing matrix with Phenote EQ file, using standalone tool (this is where we have the creation of the Phenote Character Matrix; this will be editable and we can add more taxa or characters as required)

Detailed Workflow 2

Creation of Taxon List

  1. Taxon List is created using the current Phenote interface and file format
  2. Specimen information (vouchers) are entered at this step

Creation of NEXUS file and Character Matrix entry

  1. Phenote exports NEXUS file with Publication Taxon names from the taxon list
  2. Open file in Mesquite, MacClade, or excel (possible?)
  3. Enter matrix data (0,1,2, etc. for each character); save in NEXUS file

Detailed steps for character matrix + specimen list merge

  1. Import NEXUS file into Phenote specimen list editor (NEXUS file will come from above or directly from author; if from author directly, some publication names may need to be edited (e.g. if genus is abbreviated))
  2. Phenote will attempt to choose a TTO taxon for each publication taxon
  3. Correct Phenote TTO choices as necessary
  4. Previously entered specimen information should appear at this step

Detailed steps for EQ entry which can happen in parallel to Taxon entry above

  1. Curator enters EQ phenotypes for each character state in Phenote; add free text character descriptions at this point; add Evidence codes (this will remain a Phenote function; if there is a matrix in the paper, we will assign IVS to all species, characters, and character states in bulk).
  2. Merge NEXUS file containing matrix with Phenote EQ file, using standalone tool (this is where we have the creation of the Phenote Character Matrix; this will be editable and we can add more taxa or characters as required)

Integrated workflow via NEXml

Ideally we could have a single file containing the character matrix and phenotype annotations, which the user could edit as needed in both Phenote and Mesquite. This should be feasible using the new standard NEXml. With this approach tasks could be performed in almost any order. Some tasks could optionally be accomplished in both Phenote and Mesquite, while some are possible in only one or the other of the programs.

File creation

The user should be able to create a new NEXml file using either Mesquite or Phenote. If an existing NEXUS matrix is available to start from, the user can open the NEXUS file in Mesquite and re-save it as NEXml.

Mesquite tasks

Taxon list editing

  • Enter list of taxa, with free-text names as used in publication.

Character/character state editing

  • Create and edit list of characters, with free-text for each
  • Create and edit list of states for each character, with free-text for each

Matrix editing

Tree editing

  • The user can create and edit a phylogeny containing taxa from the taxon list. This tree can be saved within the NEXml file. Note, the new data-entry workflow contains no use cases involving phylogenies.

Phenote tasks

Taxon list editing

  • Enter list of taxa, with free-text names as used in publication.
  • Choose TTO term for each taxon.

Character/character state editing