Your SlideShare is downloading. ×
0
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

937

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
937
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Automac  extracon  of  microorganisms  and  their  habitats   from  free  text  using  text-­‐mining   workflows   BalaKrishna  Kolluru,  Sirintra  Nakjang,   Robert.  P.  Hirt,  Anil  Wipat  and  Sophia   Ananiadou  
  • 2. Outline  of  the  talk  •  Movaon  •  Experiments  •  Results  &  inferences  •  Discussion  •  Current  work  
  • 3. Movaon  •  In  the  study  of  symbioc  relaonships,  host-­‐ microbe  interacons  play  an  important  role  •  To  date,  there  is  no  comprehensive  database     regarding  microbe—habitat  relaon,  but  there   is  an  explosion  in  the  numbers  of  taxa    •  With  this,  there  is  an  urgent  need  for   automated  host-­‐microbe  relaon  extracon  
  • 4. Experiments:  relevant  work  •  Idenficaon  of  named  enes  such  as   microorganisms,  diseases,  genes  etc.,  has   received  sufficient  importance  from  the   scienfic  community  at  large  [Sasaki,  Hanisch,   Chikashi]  •  Researchers  have  also  used  ontology  based   approaches  to  idenfy  concepts  such  as  public   health  rumors  etc  [Biocaster]  
  • 5. Experiments:  our  approach   Named  enty   recognion   • Free  text   • Habitats  &   arcles   organisms   • pdf   Text   Relaon   processing   mining  Employ  text  mining  workflows  consisng  of     •   text/pdf  processor   •   Named  enty  recognizer  to  idenfy  microorganisms        and  their  habitats   •   Relaon  mining  component  to  extract  sentences        which  express  this  relaon    
  • 6. Experiments:  our  approach  •  The  named  enty  recognizer  used  a  hybrid   diconary-­‐machine  learning  based  approach   –  It  combined  the  informaon  diconaries  with  a   feature  set  for  a  condional  random  field  (CRF)   based  classifier  [Mallet]   –  The  CRFs  used  a  linear  chain  model  and  were   trained  on  a  corpus  consisng  of  32  full  papers  
  • 7. Experiments:  our  approach   –  The  feature  set  included     •  lexical  informaon  of  the  word  e.g.,  word,  POS  tag  etc   •  Orthographic  informaon  e.g.  any  uppercase  le^ers,   numbers   •  Contextual  informaon;  informaon  about  two  word   preceding  and  succeeding  the  word    •  For  the  relaon  mining  component,  a  linear  chain  CRF   was  trained  using     –  Occurrence  of  organisms  and  habitats   –  Contextual  informaon  of  all  the  enes  in  a  sentence      
  • 8. Results  and  inference  Performance  of  our  named  enty  recognizer    on  a  9-­‐fold  cross-­‐validaon     Class  of   Precision(%)   Recall(%)   F-­‐score(%)   en**es   2PR/(P+R)   Organisms                  84              79                81   Habitats                  68              55                61   improved  results  from  the  me  of  submission  •   Microorganisms  have  been  recognized  quite  well.  •   Habitat  recognion  is  modest  •   One  of  the  observaons  is  that  in  a  free  text,  the          descripon  of  habitats/host  is  devoid  any  salient  features          such  as  uppercase  le^ers,  hyphens  etc.  •   Instances  such  as  abscess,  lung  were  typical  misses    
  • 9. Results  and  inference  Relaon  mining  results  •  For  the  relaon  mining  experiment,  the  CRF-­‐based   classifier  achieved  a  precision  of  ~  80%  •  Most  of  the  false  negaves  (  sentences  which  should   have  been  picked  up,  but  were  not)  due  to  the  noise   in  pdf  to  text  conversion  •  Another  reason  for  false  negaves  is  the  modest   performance  of  habitat  recognion  which  affected   the  relaon  mining  algorithm  
  • 10. Discussion    •  The  workflows  we  have  developed  bring   together  pdf-­‐conversion,  machine  learning   and  diconaries  together   –  Performance  of  individual  components  obviously   has  an  impact  its  overall  performance   –  Pdf  conversion  is  not  trivial  by  any  means  and  this   component  is  the  most  liming  factor  for  any   sentence-­‐based  classificaon  task  
  • 11. Discussion  •  Pdf-­‐to-­‐text  sentence  examples      These  mechanisms  may  have  evolved  in  bacterial   pathogens  to  increase  the  frequency  of  phenotypic   variaon  in  genes  involved  in          1  100,000  200,000  300,000  1,600,00  Figure  2  Circular   representaon  of  the  H.  pylori  26695  chromosome.   [Clearly,  data  from  a  table  and  figure  corrupted  the   sentence]      airborne  pigs  [noisy  conversion  of  table  discussing   airborne  diseases  in  pigs  ]  
  • 12. Discussion  •  The  CRF  model  for  habitats  is  evidently  weak   –  There  is  a  need  to  augment  the  features  to   alleviate  this  weakness.  We  are  currently   enhancing  model  to  include  more  features  such  as   character-­‐level  n-­‐grams   –   Results  reflect  inial  success  •  Relaon  mining  is  a  hyper-­‐classificaon  task   and  perhaps  it  is  prone  to  cascading  errors  
  • 13. Current  work  •  Work  is  underway  to  improve  the  relaon   mining  component  using  bag-­‐of-­‐words  and   character  level  n-­‐grams  to  augment  the   feature  space  •  We  are  also  working  on  less  noisy  conversion   techniques  for  pdf-­‐to-­‐text  •  Export  the  workflows  to  the  public  domain  so   that  sciensts  across  the  spectrum  can  use  our   workflows  
  • 14. Snapshot  of  relaon  miner  References  •   Hanisch,  D.  et  al.  ProMiner:  Organism  specific  protein  name  detecon  using          approximate  string  matching.  Embo  Workshop  Granada,  Spain,  2004  • Sasaki,  Y.  et  al.  (2008).  How  to  make  the  most  of  NE  diconaries  in  stascal  NER?      In:  BMC  Bioinformacs,  9(Suppl  11),  S5    •   Collier,  N.  et  al.  BioCaster:  detecng  public  health  rumors  with  a  Web-­‐based  text          mining  system.  Bioinformacs,  24(24),  2008.    •   Nobata,  C.  et  al  Mining  Metabolites:  Extracng  the  Yeast  Metabolome  from  the  Literature.          Metabolomics,  2010.    

×