Improving Interoperability of Text Mining Tools with BioC

215 views
162 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
215
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Improving Interoperability of Text Mining Tools with BioC

  1. 1. Ritu  Khare,  Chih-­‐Hsuan  Wei,  Yuqing  Mao,  Robert  Leaman,  Zhiyong  Lu   National  Center  for  Biotechnology  Information  (NCBI)   National  Institutes  of  Health     1  
  2. 2. ¡  Motivation     ¡  Our  Text  Mining  Tools     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   2  
  3. 3. ¡  Building  complex  text  mining  applications  requires   combining  different  tools  developed  by  different   groups   ¡  Each  tool  is  developed  independently   §  Group  conventions:  data  representation,  programming,   execution  environments   ¡  Heterogeneity  in  data/text  representations  limits   and  slows  down   §  tool  interoperability,  application  development,  and   research  and  innovation.   3  
  4. 4. EXISTING  SOLUTIONS       ¡  Unstructured  information   management  architecture   (UIMA)  –  2004   ¡  General  Architecture  for   Text  Engineering  (GATE)  -­‐   2009   ¡  Steep  Learning  Curve     ¡  Substantial  Development   and  Re-­‐development  time   BIOC   ¡  Minimal  change   requirement  to  existing   applications  and  datasets   ¡  BioC  family   §  XML  formats  to  present  text   documents  and  annotations   §  Functions  (C++,  JAVA)  to  read/ write  documents  in  BioC   format       4  
  5. 5. ¡  Motivation     ¡  Our  Text  Mining  Tools     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   5  
  6. 6. 6   DNormDNorm tmVartmVar SR4GNSR4GN tmChemtmChem GenNormGenNorm PubMed   Abstract Disease  Mentions   with  MEDIC  IDs Mutation  Mentions Species  Mentions  with   Taxonomy  IDs Chemical  Mentions Gene  Mentions   with  Entrez  IDs Annotations  for   Various  BioConcepts Concept  Recognition   and  Annotation  Toolkit PubMed  Abstracts   or  Full-­‐Text  Articles DNorm   Disease  Mentions  with  MEDIC   IDs  (F-­‐measure=  80.90%)   tmVar   Mutation  Mentions     (F-­‐measure=  91.39%)   SR4GN   Species  Mentions  with  Taxonomy   IDs  (F-­‐measure=  85.42%)   tmChem   Chemical  Mentions     (F-­‐measure=  88.27%)   GenNorm   Gene  Mentions  with  Entrez   IDs  (F-­‐measure=  92.89%)   Annotations  with  various   BioConcepts  
  7. 7. NER  tools   Programming   Language   Method   Formats   PubMed/   PMC  XML   Free  Text   PubTator   Format   GenNorm   Format   tmChem   (Chemical)   Java,  Perl,  C++   CRF   √   √   DNorm   (Disease)   Java   CRF   √   √   tmVar   (Mutation)   Perl,  C++   CRF   √   √   √   SR4GN   (Species)   Perl   Rule-­‐based   √   √   √   GenNorm   (Gene)   Perl   Statistical     √   √   √   PubTator   Perl,  JavaScript   Web  server   √   √   7  
  8. 8. 8  
  9. 9. ¡  Official  corpus  for  BioCreative  IV  GO  Task     ¡  200  full-­‐text  articles  along  with  their  gene   ontology  (GO)  annotations       §  evidence  sentences   §  gene/protein  entities,  GO  terms,  GO  evidence   codes   ¡  Developed  by  expert  GO  curators  via  a  web-­‐ based  annotation  tool.     9  
  10. 10. ¡  Motivation     ¡  The  NCBI  Text  Mining  Toolkit     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   10  
  11. 11. ¡  The  BioC  family     §   XML  DTD     ▪  how  to  present  text   document  and  annotations   (higher-­‐level  semantics)   §  C++  and  Java  Libraries     ▪  functions/classes  to  read/ write  documents  in  BioC   format     ¡  BioC  Recommendations   §  Full-­‐text  articles  and   Annotations   ▪  Present  in  BioC  XML  Format     ▪  Keep  in  separate  files   §  Key  file     ▪  describes  how  data  should   be  interpreted  in  the   annotation  file  (lower-­‐level   semantics)   ▪  needs  to  be  created  for  a   specific  type  of  data.     11  
  12. 12. ¡  Steps  taken  to  comply  our  tools  with  BioC   §  Created  the  key  file   §  Modified  the  input/output  formats  of  the  tools   ▪  Added  the  BioC  format  as  a  new  option  for  input/output     ¡  Challenges   §  Defining  an  appropriate  key  file     §  Offset  calculation     §  Translating   web-­‐based   annotation   file   to   BioC   annotation  file  (Unicode  to  ASCII  conversion)   12  
  13. 13. ¡  Motivation     ¡  Our  Text  Mining  Tools     ¡  Building  BioC  Compatible  Tools     ¡  Results  and  Conclusions   13  
  14. 14. ¡  Common  key  file  for  all  tools  since  they  are  designed  for   similar  types  of  data     14   id:    PubMed  id.   Passage:    e.g.,  title,  abstract   Offset  of  the  passage   Id  of  the  bioconcept   Offset  of  the  bioconcept   Length  of  the  bioconcept   Mention  of  the  bioconcept   date:    the  time  annotation  create  
  15. 15. NER   tools   bioconcept   PubMed/   PMC  XML   BioC   Free   Text   PubTator   GenNorm   tmChem   Chemical   √   √   √   DNorm   Disease   √   √   √   tmVar   Mutation   √   √   √   √   SR4GN   Species   √   √   √   √   GenNorm   Gene   √   √   √   √   PubTator   N/A   √   √   √   15   Our  Text  Mining  Toolkit  available  for  public  access:   http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/  
  16. 16. 16   BioC   Article  File   BioC  Annotation     File   DNorm   tmVar   tmChem   SR4GN   GenNorm   Identifying   Disease Identifying   Mutation Identifying   chemical Identifying   Species Identifying   Gene
  17. 17. 17   id:    PubMed  id.   passage:    title   date:    the  time  file  download   passage:    abstract  
  18. 18. 18   Id  of  the  bioconcept   Offset  of  the  bioconcept   Length  of  the  bioconcept   Mention  of  the  bioconcept   Type  of  the  bioconcept  
  19. 19. Time:    Time  annotation  created.   ID:  PMID  of  the  article.   GO  term:  e.g.,  receptor-­‐mediated  endocytosis   GO  evidence  code:  e.g.,  Inferred  from  Mutant   Phenotype  (IMP)   Curatable  entity:  i.e.,  gene  or  gene  product   Text:  GO  evidence  text  
  20. 20. ¡  Our  experience  with  BioC     §  Minimal  changes  required  to  prepare  BioC  versions     §  Easy  to  learn  and  use   §  Improved  interoperability  within  the  toolkit   ¡  Implications     §  Improved  interoperability   ▪  With  other  tools  to  build  sophisticated  applications   §  The  key  file  could  evolve  as  a  standard  for  concept   recognition  and  normalization  tasks   §  Anticipate  broader  usage  of  our  tools  as  BioC  gains   popularity     20  
  21. 21. ¡  BioC  Developers   §  W.  John  Wilbur   §  Rezarta  Islamaj  Doğan     §  Donald  Comeau     ¡  Intramural  Research  Program  of  the  NIH,   National  Library  Medicine   21  
  22. 22. ¡  Chih-Hsuan Wei §  weic4@ncbi.nlm.nih.gov §  +1 301-594-5290 22  

×