Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Upcoming SlideShare
Loading in...5
×
 

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

on

  • 1,089 views

 

Statistics

Views

Total Views
1,089
Views on SlideShare
1,089
Embed Views
0

Actions

Likes
0
Downloads
6
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows Automatic extraction of microorganisms and their habitats from free text using text-mining workflows Presentation Transcript

  • Automac  extracon  of  microorganisms  and  their  habitats   from  free  text  using  text-­‐mining   workflows   BalaKrishna  Kolluru,  Sirintra  Nakjang,   Robert.  P.  Hirt,  Anil  Wipat  and  Sophia   Ananiadou  
  • Outline  of  the  talk  •  Movaon  •  Experiments  •  Results  &  inferences  •  Discussion  •  Current  work  
  • Movaon  •  In  the  study  of  symbioc  relaonships,  host-­‐ microbe  interacons  play  an  important  role  •  To  date,  there  is  no  comprehensive  database     regarding  microbe—habitat  relaon,  but  there   is  an  explosion  in  the  numbers  of  taxa    •  With  this,  there  is  an  urgent  need  for   automated  host-­‐microbe  relaon  extracon  
  • Experiments:  relevant  work  •  Idenficaon  of  named  enes  such  as   microorganisms,  diseases,  genes  etc.,  has   received  sufficient  importance  from  the   scienfic  community  at  large  [Sasaki,  Hanisch,   Chikashi]  •  Researchers  have  also  used  ontology  based   approaches  to  idenfy  concepts  such  as  public   health  rumors  etc  [Biocaster]  
  • Experiments:  our  approach   Named  enty   recognion   • Free  text   • Habitats  &   arcles   organisms   • pdf   Text   Relaon   processing   mining  Employ  text  mining  workflows  consisng  of     •   text/pdf  processor   •   Named  enty  recognizer  to  idenfy  microorganisms        and  their  habitats   •   Relaon  mining  component  to  extract  sentences        which  express  this  relaon    
  • Experiments:  our  approach  •  The  named  enty  recognizer  used  a  hybrid   diconary-­‐machine  learning  based  approach   –  It  combined  the  informaon  diconaries  with  a   feature  set  for  a  condional  random  field  (CRF)   based  classifier  [Mallet]   –  The  CRFs  used  a  linear  chain  model  and  were   trained  on  a  corpus  consisng  of  32  full  papers  
  • Experiments:  our  approach   –  The  feature  set  included     •  lexical  informaon  of  the  word  e.g.,  word,  POS  tag  etc   •  Orthographic  informaon  e.g.  any  uppercase  le^ers,   numbers   •  Contextual  informaon;  informaon  about  two  word   preceding  and  succeeding  the  word    •  For  the  relaon  mining  component,  a  linear  chain  CRF   was  trained  using     –  Occurrence  of  organisms  and  habitats   –  Contextual  informaon  of  all  the  enes  in  a  sentence      
  • Results  and  inference  Performance  of  our  named  enty  recognizer    on  a  9-­‐fold  cross-­‐validaon     Class  of   Precision(%)   Recall(%)   F-­‐score(%)   en**es   2PR/(P+R)   Organisms                  84              79                81   Habitats                  68              55                61   improved  results  from  the  me  of  submission  •   Microorganisms  have  been  recognized  quite  well.  •   Habitat  recognion  is  modest  •   One  of  the  observaons  is  that  in  a  free  text,  the          descripon  of  habitats/host  is  devoid  any  salient  features          such  as  uppercase  le^ers,  hyphens  etc.  •   Instances  such  as  abscess,  lung  were  typical  misses    
  • Results  and  inference  Relaon  mining  results  •  For  the  relaon  mining  experiment,  the  CRF-­‐based   classifier  achieved  a  precision  of  ~  80%  •  Most  of  the  false  negaves  (  sentences  which  should   have  been  picked  up,  but  were  not)  due  to  the  noise   in  pdf  to  text  conversion  •  Another  reason  for  false  negaves  is  the  modest   performance  of  habitat  recognion  which  affected   the  relaon  mining  algorithm  
  • Discussion    •  The  workflows  we  have  developed  bring   together  pdf-­‐conversion,  machine  learning   and  diconaries  together   –  Performance  of  individual  components  obviously   has  an  impact  its  overall  performance   –  Pdf  conversion  is  not  trivial  by  any  means  and  this   component  is  the  most  liming  factor  for  any   sentence-­‐based  classificaon  task  
  • Discussion  •  Pdf-­‐to-­‐text  sentence  examples      These  mechanisms  may  have  evolved  in  bacterial   pathogens  to  increase  the  frequency  of  phenotypic   variaon  in  genes  involved  in          1  100,000  200,000  300,000  1,600,00  Figure  2  Circular   representaon  of  the  H.  pylori  26695  chromosome.   [Clearly,  data  from  a  table  and  figure  corrupted  the   sentence]      airborne  pigs  [noisy  conversion  of  table  discussing   airborne  diseases  in  pigs  ]  
  • Discussion  •  The  CRF  model  for  habitats  is  evidently  weak   –  There  is  a  need  to  augment  the  features  to   alleviate  this  weakness.  We  are  currently   enhancing  model  to  include  more  features  such  as   character-­‐level  n-­‐grams   –   Results  reflect  inial  success  •  Relaon  mining  is  a  hyper-­‐classificaon  task   and  perhaps  it  is  prone  to  cascading  errors  
  • Current  work  •  Work  is  underway  to  improve  the  relaon   mining  component  using  bag-­‐of-­‐words  and   character  level  n-­‐grams  to  augment  the   feature  space  •  We  are  also  working  on  less  noisy  conversion   techniques  for  pdf-­‐to-­‐text  •  Export  the  workflows  to  the  public  domain  so   that  sciensts  across  the  spectrum  can  use  our   workflows  
  • Snapshot  of  relaon  miner  References  •   Hanisch,  D.  et  al.  ProMiner:  Organism  specific  protein  name  detecon  using          approximate  string  matching.  Embo  Workshop  Granada,  Spain,  2004  • Sasaki,  Y.  et  al.  (2008).  How  to  make  the  most  of  NE  diconaries  in  stascal  NER?      In:  BMC  Bioinformacs,  9(Suppl  11),  S5    •   Collier,  N.  et  al.  BioCaster:  detecng  public  health  rumors  with  a  Web-­‐based  text          mining  system.  Bioinformacs,  24(24),  2008.    •   Nobata,  C.  et  al  Mining  Metabolites:  Extracng  the  Yeast  Metabolome  from  the  Literature.          Metabolomics,  2010.