• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building the Hymenoptera Anatomy Ontology through exploration of the Journal of Hymenoptera Research
 

Building the Hymenoptera Anatomy Ontology through exploration of the Journal of Hymenoptera Research

on

  • 903 views

The Hymenoptera Anatomy Ontology (HAO) project aims to capture the complex lexica used to describe hymenoptera anatomy. Our core data are extracted from the corpus of published works, particularly ...

The Hymenoptera Anatomy Ontology (HAO) project aims to capture the complex lexica used to describe hymenoptera anatomy. Our core data are extracted from the corpus of published works, particularly descriptions of new taxa. We reviewed the Journal of Hymenoptera Research (JHR) to extract new labels and ontological classes, explored the completeness of the present version of the HAO, and reflected upon community language trends. Three hundred and fifty three (353) Journal of Hymenoptera Research articles were parsed, accessed through the Biodiversity Heritage Library and vetted against the present ontology. New labels (2121) were collected during this process including about 650 adjectives used to qualify morphological features. Language trends were revealed in the process, showing the occurrence of anatomical labels used in the literature, possibly reflecting the character systems and qualifiers we most often use to describe novel taxa. Additionally the novel software used for text extraction is reviewed, outlining possible improvements and useful tools resulting from this effort.

Statistics

Views

Total Views
903
Views on SlideShare
903
Embed Views
0

Actions

Likes
1
Downloads
7
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Presentation for the International Society of Hymenopterists Congress , Kőszeg Hungary, 2010
  • We had an opportunity with the Hymenoptera Anatomy Ontology project to look at how we (hymenopterists) use text in our research. The availability of the Journal of Hymenoptera Research processed using optical character recognition software (OCR) through the Biodiversity Heritage Library provided us with a discrete dataset to explore our use of language.
  • We began with two questions. The first is the usability of natural language text extraction in finding terms and concepts for developing and expanding the Hymenoptera Anatomy Ontology (HAO). The second, is to discover if some of our assumptions on how we use language in descriptions is true; particularly the assumption that people who work in different taxon groups use different terms. Is it true that our use of terminology follows phylogeny?
  • Capturing terms from Journal of Hymenoptera Research text was a multi-step process. This is a simplified outline of that process. The Journal of Hymenoptera Research journals were downloaded from the Biodiversity Heritage Library and then manually split into articles. The resulting text was put in a field in our database specifically created for OCR captured text from references. We already had an advanced ontology with many hymenoptera terms in the database. We used these terms to match against the OCR text. Strings of words that were not matched were than presented to the person looking at the text for possible inclusion into the database. We modified this process as we continued through the 353 articles to make more exact string matches for terms that already exist and reduce the number of potential terms presented to the reviewer. The important thing to remember from this slide is that YES the software did allow us to parse vast amounts of legacy hymenoptera text relatively quickly for new terms for the HAO, but it would be impossible to remove the authoritative human component in this process.
  • From the 353 articles we collected 2121 morphological terms and 643 qualitative terms. Qualitative terms are words used to describe the position or physical quality of a morphological feature. Qualitative terms are not included in the Hymenoptera Anatomy Ontology but are important to collect for another ontology we collaborate with and for creation of future descriptive tools. Interestingly, of these collected terms 2065 of them are presently not tied to a concept. They are floating in the database without a definition. This calls into question, at least at the moment, how useful natural language text parsing en masse of legacy literature is for building ontologies. Substantial more effort is needed to tie these terms to a concept (which is a definition) and supply the reference for that association.
  • But what can we say about the morphological terms we collected so far….
  • In the list the first number is the number of times that term occurred in the 353 publications. The second number is the number of JHR articles that term occurred. From these numbers we see that hymenopterists like to talk about carina, wings and setae.
  • The numbers may reflect other trends such as complexity of a body region or the number of morphological characters demonstrated from a region. There were more instances of the term propodeum than mesosoma than metasoma. Perhaps this reflects the trend of complexity for these structures?
  • And what can we say regarding the qualitative terms we found….
  • Hymenopterists like small, short, smooth, rounded and shiny things. The majority of the quality terms are spatial, reflecting the position of a morphological feature in relation to another feature or body region. Also it is revealed that specific comparative words such as slightly, similar and nearly are commonly used. It would be interesting to map these terms on the tree to see if there is a trend in particular taxon groups to use comparative terms in descriptions.
  • Well how about another way to approach these data. To get at some potentially deeper questions regarding hymenopterists use of terminology in association with phylogeny we reviewed each of the 353 quickly and placed into 2 categories; description or non-description. An article was considered a description if, by reading the title of the article, it was clear because the authors used the words “description of” or similar. Also, for each article, the taxon being described was captured in the database. A script was then created to produce a tnt file using terms as characters and articles as terminals. The terms used as characters were limited to morphological terms (1162 total) and 181 articles were used as terminals. A matrix was created based on the presence/absence of terms(characters) in a given article(terminal). The resulting matrix was run in tnt producing a nelsen consensus tree.
  • The resulting tree, with taxa clumping to the superfamily level, tells us that we use morphological terminology differently for different taxa. We cannot say anything regarding the concepts or meaning behind these terms, only that the authors thought them important enough to use them in the descriptions for the different groups. The use of different terms could reflect the focus on particular character systems, or learned protocol for describing organisms (either from professors or from previously published literature).
  • If we focus in on one area of the tree…
  • The granulation is even on a finer level in many areas of the tree showing taxon clumping to family level.
  • But what is happening here?     
  • When you look at the authors of these two publications I found out they were: Bohart, R. M. 1992. The genus Oxybelus in Chile (Hymenoptera: Sphecidae, Crabroninae). Journal of Hymenoptera Research 1:157-163. id: 68187) and Grissell, E. E., and S. A. Cameron. 2002. A new Leucospis Fabricius (Hymenoptera: Leucospidae), the first reported gregarious species. Journal of Hymenoptera Research 11:271-278. id: 68482 Eric Grissell is Richard Bohart’s former student. When looking at aberrant groupings throughout the tree, commonly student/teacher relationships were typically discovered between the authors. This potentially reveals another trend we instinctively already knew, that the terminology used by one professor is reflected in his/her students subsequent publications. This trend in information may be particularly strong in some taxon groups that have a historically strong leader in the field who taught or influenced a large number of students. Since people tend to work on a limited subset of taxa for a career, their influence may be extremely strong in filtering the morphological terms and characters used within those groups.   
  • If we expand our view of the tree a little further….
  • And map one morphological term on the tree we can start to reveal some other trends. First we tend to only talk about a morphological term because of its presence in a group, or in comparison to another group in which the condition is present. Case in point with the term “petiole”. Thymninae tiphiids, formicids, and pemphredoninae crabrionids all possess the petiolate condition of the first metasomal segment. Paper 68361 described a new genus of mutillidae that contains this condition. Thus, the term appears in the publications in which the taxa being described possess that character. Secondly, mapping terms on a tree can help us (the Hymenoptera Anatomy Ontology group) discover homonyms. Both the leucospid (Grissel) and sphecid (Bohart) use the term petiole to reference the first metasomal tergum 1, although not particularly petiolate in shape. This definition varies from the “narrowly constricted” definition of petiole inferred in the other publications. Discovery and cataloging homonyms is an important part of the HAO project. The potential to find homonyms becomes grossly obvious when we look at the last grouping that contain the term petiole. Why did the authors of descriptions of Pergid and Argid papers use the term? Upon finding the term in the publication its discovered that “petiole” in the Pergid papers refer to the location on the leaf in which these sawflies oviposit and in the Argid paper it is a larval term.
  • So what do these findings mean for the International Society of Hymenopterists? And the future of our work? In the next session we will address moving to an open access journal. There are many potential benefits to having publications digital and freely available. In the context of the Hymenoptera Anatomy Ontology and this talk we can see other utility for our publications, one of discovering trends, novel hypothesis and new associations between the groups we work with based on the language we use to describe those groups. Natural language alone, because it does not have reference to the concept or definition of our intended use of a word, is strongly limited in the conclusions we can draw from our work. However, by adding annotations to our publications, we can increase the value of all of our efforts. Part of the Hymenoptera Anatomy Ontology project goals is to produce tools to annotate our publications, increasing their value now and for future hymenopterists.
  • The resulting goal is to create less work, with greater benefit, for those stewards in charge of describing our biodiversity.
  • Thanks…

Building the Hymenoptera Anatomy Ontology through exploration of the Journal of Hymenoptera Research Building the Hymenoptera Anatomy Ontology through exploration of the Journal of Hymenoptera Research Presentation Transcript

  • Building the Hymenoptera Anatomy Ontology through exploration of the Journal of Hymenoptera Research Katja Seltmann Matthew Bertone Matthew J. Yoder István Mikó Elizabeth Macleod Andrew Ernst Andrew R. Deans
  • Volumes: 1-16 Years: 1992-2007 The opportunity…
    • Database (infrastructure)
    • Terms used in hymenoptera morphology
    • 3. JHR Volumes 1-16 are online and processed using optical character recognition (OCR) software through the Biodiversity Heritage Library
  • . Volumes: 1-16 Years: 1992-2007 We were wondering…
    • Can we find new terms for the HAO by text extraction?
    • Look for ways we as a community do things. Is it really true that terminology follows phylogeny?
  • How captured terms from Journal of Hymenoptera Research…
    • Download articles from Biodiversity Heritage Library (http://www.biodiversitylibrary.org)
    • Put text in database (MX)
    • Match the article text to the words we know are terms
    • (also cataloged in the
    • same database)
    • Add new terms based on
    • what is NOT matched
    • 4. People made decisions
    353 articles
  • from 353 articles: 2121 morphological terms 643 qualitative 2065 terms from JHR are not defined as concepts. Floating without definition!       As of June 1, 2010…
  •  
  • carina (3638, 160) wing (3297, 194) setae (3294, 171) vein (2891, 141) cell (2855, 202) seta (2545, 55) eye (2438, 186) segment (2415, 159) tergum (2381, 137) hind (2209, 172) larva (1751, 113) propodeum (1617, 184 ) tooth (1604, 110) punctures (1490, 96) clypeus (1482, 175) segments (1422, 159) flagellomere (1392, 91) tergite (1371, 87) mandible (1369, 143) antenna (1365, 164) body (1359, 244) region (1289, 214) tibia (1261, 129) leg (1244, 101) ovipositor (1230, 127) ocellus (1218, 116) larvae (1214, 161) scutellum (1201, 159) line (1166, 147) lobe (1160, 133) mesosoma (1137, 159) longitudinal (1131, 161) scape (1127, 133) legs (1072, 202) carinae (1014, 118) pronotum (1011, 162) terga (1002, 122) forewing (988, 132) antennal (966, 168) metasoma (960, 168)
  • carina (3638, 160) wing (3297, 194) setae (3294, 171) vein (2891, 141) cell (2855, 202) seta (2545, 55) eye (2438, 186) segment (2415, 159) tergum (2381, 137) hind (2209, 172) larva (1751, 113) propodeum (1617, 184 ) tooth (1604, 110) punctures (1490, 96) clypeus (1482, 175) segments (1422, 159) flagellomere (1392, 91) tergite (1371, 87) mandible (1369, 143) antenna (1365, 164) body (1359, 244) region (1289, 214) tibia (1261, 129) leg (1244, 101) ovipositor (1230, 127) ocellus (1218, 116) larvae (1214, 161) scutellum (1201, 159) line (1166, 147) lobe (1160, 133) mesosoma (1137, 159) longitudinal (1131, 161) scape (1127, 133) legs (1072, 202) carinae (1014, 118) pronotum (1011, 162) terga (1002, 122) forewing (988, 132) antennal (966, 168) metasoma (960, 168)
  • Qualifying terms: spatial, adjectives, comparative
  • Qualifying terms: spatial, adjectives, comparative posterior (2694, 216) dorsal (2654, 216) anterior (2475, 221) slightly (2247, 227 small (2048, 284) short (1930, 249) apex (1894, 192) smooth (1817, 174) large (1629, 266) distinct (1487, 201) transverse (1486, 173) similar (1476, 276) base (1471, 200) broad (1394, 178) half (1357, 207) separated (1217, 182) single (1097, 243) rounded (1037, 158) dorsally (1017, 146) nearly (990, 185) shiny (980, 83) inner (950, 158) shorter (938, 177) few (874, 239) elongate (859, 147) lower (834, 188)
  • . Look at the data a different way…
    • Terminals are taxa discussed in articles
      • Use only articles that have the word “description of” in the title
      • Holes: Ichnumonoidea(49), Chalcidoidea(38), Vespoidea(36), Apoidea(36),Symphyta(9), Cynipoidea(7), Chrysidoidea(4), Stephanidae(1), Mymarommatidae(1)
    • Characters presence or absence of a term
      • Use only terms that occurred in more than one article
    • Created a matrix excluding spatial and qualifying words
      • (1162 terms, 181 terminals)
    • TNT analysis
      • xmult /level 7 replications 5 hits 5
      • nelsen
  • http://tiny.cc/p0aan
  • http://tiny.cc/p0aan
  • http://tiny.cc/p0aan
  • http://tiny.cc/p0aan
  • http://tiny.cc/p0aan student
  • http://tiny.cc/p0aan
  • Petiole: http://tiny.cc/p0aan
  • What does this mean to ISH…
    • Next session addresses this…moving to open access journal
    • Things we can do in our publications (in the form of annotations) that can make data synthesis easier and less
    • need to repeat work.
  •  
    • funding:
      • Advances in Biological Informatics ( NSF DBI-0850223 )
      • NESCent ( NSF EF-0423641 ) 
      • Morphbank ( NSF DBI-0446224 )
      • HymAToL ( NSF EF-0337220 )
      • PEET: Monographic research on parasitic Hymenoptera ( NSF DEB-0328922 )
      •  
    •  
    •  
    •  
    • intellect and enthusiasm:
      • Biodiveristy Heritage Library, Rick Prelinger
      • International Society of Hymenopterists
    • NESCent
    • Other ontology projects
      • Deans Lab (Barb Sharanowski, Trish Mullins, Bob Blinn, Rinchhuanawma, Lydia Abernethy)
    Acknowledgments http://tiny.cc/p0aan [email_address]