Automa'c	
  extrac'on	
  of	
  
microorganisms	
  and	
  their	
  habitats	
  
 from	
  free	
  text	
  using	
  text-­‐mining	
  
                  workflows	
  
      BalaKrishna	
  Kolluru,	
  Sirintra	
  Nakjang,	
  
      Robert.	
  P.	
  Hirt,	
  Anil	
  Wipat	
  and	
  Sophia	
  
                          Ananiadou	
  
Outline	
  of	
  the	
  talk	
  
•    Mo'va'on	
  
•    Experiments	
  
•    Results	
  &	
  inferences	
  
•    Discussion	
  
•    Current	
  work	
  
Mo'va'on	
  
•  In	
  the	
  study	
  of	
  symbio'c	
  rela'onships,	
  host-­‐
   microbe	
  interac'ons	
  play	
  an	
  important	
  role	
  
•  To	
  date,	
  there	
  is	
  no	
  comprehensive	
  database	
  	
  
   regarding	
  microbe—habitat	
  rela'on,	
  but	
  there	
  
   is	
  an	
  explosion	
  in	
  the	
  numbers	
  of	
  taxa	
  	
  
•  With	
  this,	
  there	
  is	
  an	
  urgent	
  need	
  for	
  
   automated	
  host-­‐microbe	
  rela'on	
  extrac'on	
  
Experiments:	
  relevant	
  work	
  
•  Iden'fica'on	
  of	
  named	
  en''es	
  such	
  as	
  
   microorganisms,	
  diseases,	
  genes	
  etc.,	
  has	
  
   received	
  sufficient	
  importance	
  from	
  the	
  
   scien'fic	
  community	
  at	
  large	
  [Sasaki,	
  Hanisch,	
  
   Chikashi]	
  
•  Researchers	
  have	
  also	
  used	
  ontology	
  based	
  
   approaches	
  to	
  iden'fy	
  concepts	
  such	
  as	
  public	
  
   health	
  rumors	
  etc	
  [Biocaster]	
  
Experiments:	
  our	
  approach	
  
                                        Named	
  en'ty	
  
                                         recogni'on	
  
               • Free	
  text	
                              • Habitats	
  &	
  
                 ar'cles	
                                     organisms	
  
               • pdf	
  
                          Text	
                                     Rela'on	
  
                       processing	
                                   mining	
  




Employ	
  text	
  mining	
  workflows	
  consis'ng	
  of	
  	
  
  • 	
  text/pdf	
  processor	
  
  • 	
  Named	
  en'ty	
  recognizer	
  to	
  iden'fy	
  microorganisms	
  	
  
  	
  	
  and	
  their	
  habitats	
  
  • 	
  Rela'on	
  mining	
  component	
  to	
  extract	
  sentences	
  	
  
  	
  	
  which	
  express	
  this	
  rela'on	
  	
  
Experiments:	
  our	
  approach	
  
•  The	
  named	
  en'ty	
  recognizer	
  used	
  a	
  hybrid	
  
   dic'onary-­‐machine	
  learning	
  based	
  approach	
  
   –  It	
  combined	
  the	
  informa'on	
  dic'onaries	
  with	
  a	
  
      feature	
  set	
  for	
  a	
  condi'onal	
  random	
  field	
  (CRF)	
  
      based	
  classifier	
  [Mallet]	
  
   –  The	
  CRFs	
  used	
  a	
  linear	
  chain	
  model	
  and	
  were	
  
      trained	
  on	
  a	
  corpus	
  consis'ng	
  of	
  32	
  full	
  papers	
  
Experiments:	
  our	
  approach	
  
    –  The	
  feature	
  set	
  included	
  	
  
        •  lexical	
  informa'on	
  of	
  the	
  word	
  e.g.,	
  word,	
  POS	
  tag	
  etc	
  
        •  Orthographic	
  informa'on	
  e.g.	
  any	
  uppercase	
  le^ers,	
  
           numbers	
  
        •  Contextual	
  informa'on;	
  informa'on	
  about	
  two	
  word	
  
           preceding	
  and	
  succeeding	
  the	
  word	
  	
  

•  For	
  the	
  rela'on	
  mining	
  component,	
  a	
  linear	
  chain	
  CRF	
  
   was	
  trained	
  using	
  	
  
    –  Occurrence	
  of	
  organisms	
  and	
  habitats	
  
    –  Contextual	
  informa'on	
  of	
  all	
  the	
  en''es	
  in	
  a	
  sentence	
  	
  	
  
Results	
  and	
  inference	
  
Performance	
  of	
  our	
  named	
  en'ty	
  recognizer	
  	
  
on	
  a	
  9-­‐fold	
  cross-­‐valida'on	
  	
  
            Class	
  of	
     Precision(%)	
                           Recall(%)	
                      F-­‐score(%)	
  
            en**es	
                                                                                    2PR/(P+R)	
  
            Organisms	
       	
  	
  	
  	
  	
  	
  	
  	
  84	
     	
  	
  	
  	
  	
  	
  79	
     	
  	
  	
  	
  	
  	
  	
  81	
  
            Habitats	
        	
  	
  	
  	
  	
  	
  	
  	
  68	
   	
  	
  	
  	
  	
  	
  55	
   	
  	
  	
  	
  	
  	
  	
  61	
  
                                                improved	
  results	
  from	
  the	
  'me	
  of	
  submission	
  
• 	
  Microorganisms	
  have	
  been	
  recognized	
  quite	
  well.	
  
• 	
  Habitat	
  recogni'on	
  is	
  modest	
  
• 	
  One	
  of	
  the	
  observa'ons	
  is	
  that	
  in	
  a	
  free	
  text,	
  the	
  	
  
	
  	
  	
  descrip'on	
  of	
  habitats/host	
  is	
  devoid	
  any	
  salient	
  features	
  	
  
	
  	
  	
  such	
  as	
  uppercase	
  le^ers,	
  hyphens	
  etc.	
  
• 	
  Instances	
  such	
  as	
  abscess,	
  lung	
  were	
  typical	
  misses	
  	
  
Results	
  and	
  inference	
  
Rela'on	
  mining	
  results	
  
•  For	
  the	
  rela'on	
  mining	
  experiment,	
  the	
  CRF-­‐based	
  
   classifier	
  achieved	
  a	
  precision	
  of	
  ~	
  80%	
  
•  Most	
  of	
  the	
  false	
  nega'ves	
  (	
  sentences	
  which	
  should	
  
   have	
  been	
  picked	
  up,	
  but	
  were	
  not)	
  due	
  to	
  the	
  noise	
  
   in	
  pdf	
  to	
  text	
  conversion	
  
•  Another	
  reason	
  for	
  false	
  nega'ves	
  is	
  the	
  modest	
  
   performance	
  of	
  habitat	
  recogni'on	
  which	
  affected	
  
   the	
  rela'on	
  mining	
  algorithm	
  
Discussion	
  	
  
•  The	
  workflows	
  we	
  have	
  developed	
  bring	
  
   together	
  pdf-­‐conversion,	
  machine	
  learning	
  
   and	
  dic'onaries	
  together	
  
   –  Performance	
  of	
  individual	
  components	
  obviously	
  
      has	
  an	
  impact	
  its	
  overall	
  performance	
  
   –  Pdf	
  conversion	
  is	
  not	
  trivial	
  by	
  any	
  means	
  and	
  this	
  
      component	
  is	
  the	
  most	
  limi'ng	
  factor	
  for	
  any	
  
      sentence-­‐based	
  classifica'on	
  task	
  
Discussion	
  
•  Pdf-­‐to-­‐text	
  sentence	
  examples	
  
     	
  These	
  mechanisms	
  may	
  have	
  evolved	
  in	
  bacterial	
  
                    pathogens	
  to	
  increase	
  the	
  frequency	
  of	
  phenotypic	
  
                    varia'on	
  in	
  genes	
  involved	
  in	
  
    	
  	
  	
  	
  1	
  100,000	
  200,000	
  300,000	
  1,600,00	
  Figure	
  2	
  Circular	
  
                    representa'on	
  of	
  the	
  H.	
  pylori	
  26695	
  chromosome.	
  
                    [Clearly,	
  data	
  from	
  a	
  table	
  and	
  figure	
  corrupted	
  the	
  
                    sentence]	
  
     	
  airborne	
  pigs	
  [noisy	
  conversion	
  of	
  table	
  discussing	
  
                    airborne	
  diseases	
  in	
  pigs	
  ]	
  
Discussion	
  
•  The	
  CRF	
  model	
  for	
  habitats	
  is	
  evidently	
  weak	
  
    –  There	
  is	
  a	
  need	
  to	
  augment	
  the	
  features	
  to	
  
       alleviate	
  this	
  weakness.	
  We	
  are	
  currently	
  
       enhancing	
  model	
  to	
  include	
  more	
  features	
  such	
  as	
  
       character-­‐level	
  n-­‐grams	
  
    –  	
  Results	
  reflect	
  ini'al	
  success	
  
•  Rela'on	
  mining	
  is	
  a	
  hyper-­‐classifica'on	
  task	
  
   and	
  perhaps	
  it	
  is	
  prone	
  to	
  cascading	
  errors	
  
Current	
  work	
  
•  Work	
  is	
  underway	
  to	
  improve	
  the	
  rela'on	
  
   mining	
  component	
  using	
  bag-­‐of-­‐words	
  and	
  
   character	
  level	
  n-­‐grams	
  to	
  augment	
  the	
  
   feature	
  space	
  
•  We	
  are	
  also	
  working	
  on	
  less	
  noisy	
  conversion	
  
   techniques	
  for	
  pdf-­‐to-­‐text	
  
•  Export	
  the	
  workflows	
  to	
  the	
  public	
  domain	
  so	
  
   that	
  scien'sts	
  across	
  the	
  spectrum	
  can	
  use	
  our	
  
   workflows	
  
Snapshot	
  of	
  rela'on	
  miner	
  




References	
  
• 	
  Hanisch,	
  D.	
  et	
  al.	
  ProMiner:	
  Organism	
  specific	
  protein	
  name	
  detec'on	
  using	
  	
  
	
  	
  	
  approximate	
  string	
  matching.	
  Embo	
  Workshop	
  Granada,	
  Spain,	
  2004	
  
• Sasaki,	
  Y.	
  et	
  al.	
  (2008).	
  How	
  to	
  make	
  the	
  most	
  of	
  NE	
  dic'onaries	
  in	
  sta's'cal	
  NER?	
  
	
  	
  In:	
  BMC	
  Bioinforma'cs,	
  9(Suppl	
  11),	
  S5	
  	
  
• 	
  Collier,	
  N.	
  et	
  al.	
  BioCaster:	
  detec'ng	
  public	
  health	
  rumors	
  with	
  a	
  Web-­‐based	
  text	
  	
  
	
  	
  	
  mining	
  system.	
  Bioinforma'cs,	
  24(24),	
  2008.	
  	
  
• 	
  Nobata,	
  C.	
  et	
  al	
  Mining	
  Metabolites:	
  Extrac'ng	
  the	
  Yeast	
  Metabolome	
  from	
  the	
  Literature.	
  	
  
	
  	
  	
  Metabolomics,	
  2010.	
  	
  

Automatic extraction of microorganisms and their habitats from free text using text-mining workflows

  • 1.
    Automa'c  extrac'on  of   microorganisms  and  their  habitats   from  free  text  using  text-­‐mining   workflows   BalaKrishna  Kolluru,  Sirintra  Nakjang,   Robert.  P.  Hirt,  Anil  Wipat  and  Sophia   Ananiadou  
  • 2.
    Outline  of  the  talk   •  Mo'va'on   •  Experiments   •  Results  &  inferences   •  Discussion   •  Current  work  
  • 3.
    Mo'va'on   •  In  the  study  of  symbio'c  rela'onships,  host-­‐ microbe  interac'ons  play  an  important  role   •  To  date,  there  is  no  comprehensive  database     regarding  microbe—habitat  rela'on,  but  there   is  an  explosion  in  the  numbers  of  taxa     •  With  this,  there  is  an  urgent  need  for   automated  host-­‐microbe  rela'on  extrac'on  
  • 4.
    Experiments:  relevant  work   •  Iden'fica'on  of  named  en''es  such  as   microorganisms,  diseases,  genes  etc.,  has   received  sufficient  importance  from  the   scien'fic  community  at  large  [Sasaki,  Hanisch,   Chikashi]   •  Researchers  have  also  used  ontology  based   approaches  to  iden'fy  concepts  such  as  public   health  rumors  etc  [Biocaster]  
  • 5.
    Experiments:  our  approach   Named  en'ty   recogni'on   • Free  text   • Habitats  &   ar'cles   organisms   • pdf   Text   Rela'on   processing   mining   Employ  text  mining  workflows  consis'ng  of     •   text/pdf  processor   •   Named  en'ty  recognizer  to  iden'fy  microorganisms        and  their  habitats   •   Rela'on  mining  component  to  extract  sentences        which  express  this  rela'on    
  • 6.
    Experiments:  our  approach   •  The  named  en'ty  recognizer  used  a  hybrid   dic'onary-­‐machine  learning  based  approach   –  It  combined  the  informa'on  dic'onaries  with  a   feature  set  for  a  condi'onal  random  field  (CRF)   based  classifier  [Mallet]   –  The  CRFs  used  a  linear  chain  model  and  were   trained  on  a  corpus  consis'ng  of  32  full  papers  
  • 7.
    Experiments:  our  approach   –  The  feature  set  included     •  lexical  informa'on  of  the  word  e.g.,  word,  POS  tag  etc   •  Orthographic  informa'on  e.g.  any  uppercase  le^ers,   numbers   •  Contextual  informa'on;  informa'on  about  two  word   preceding  and  succeeding  the  word     •  For  the  rela'on  mining  component,  a  linear  chain  CRF   was  trained  using     –  Occurrence  of  organisms  and  habitats   –  Contextual  informa'on  of  all  the  en''es  in  a  sentence      
  • 8.
    Results  and  inference   Performance  of  our  named  en'ty  recognizer     on  a  9-­‐fold  cross-­‐valida'on     Class  of   Precision(%)   Recall(%)   F-­‐score(%)   en**es   2PR/(P+R)   Organisms                  84              79                81   Habitats                  68              55                61   improved  results  from  the  'me  of  submission   •   Microorganisms  have  been  recognized  quite  well.   •   Habitat  recogni'on  is  modest   •   One  of  the  observa'ons  is  that  in  a  free  text,  the          descrip'on  of  habitats/host  is  devoid  any  salient  features          such  as  uppercase  le^ers,  hyphens  etc.   •   Instances  such  as  abscess,  lung  were  typical  misses    
  • 9.
    Results  and  inference   Rela'on  mining  results   •  For  the  rela'on  mining  experiment,  the  CRF-­‐based   classifier  achieved  a  precision  of  ~  80%   •  Most  of  the  false  nega'ves  (  sentences  which  should   have  been  picked  up,  but  were  not)  due  to  the  noise   in  pdf  to  text  conversion   •  Another  reason  for  false  nega'ves  is  the  modest   performance  of  habitat  recogni'on  which  affected   the  rela'on  mining  algorithm  
  • 10.
    Discussion     • The  workflows  we  have  developed  bring   together  pdf-­‐conversion,  machine  learning   and  dic'onaries  together   –  Performance  of  individual  components  obviously   has  an  impact  its  overall  performance   –  Pdf  conversion  is  not  trivial  by  any  means  and  this   component  is  the  most  limi'ng  factor  for  any   sentence-­‐based  classifica'on  task  
  • 11.
    Discussion   •  Pdf-­‐to-­‐text  sentence  examples      These  mechanisms  may  have  evolved  in  bacterial   pathogens  to  increase  the  frequency  of  phenotypic   varia'on  in  genes  involved  in          1  100,000  200,000  300,000  1,600,00  Figure  2  Circular   representa'on  of  the  H.  pylori  26695  chromosome.   [Clearly,  data  from  a  table  and  figure  corrupted  the   sentence]      airborne  pigs  [noisy  conversion  of  table  discussing   airborne  diseases  in  pigs  ]  
  • 12.
    Discussion   •  The  CRF  model  for  habitats  is  evidently  weak   –  There  is  a  need  to  augment  the  features  to   alleviate  this  weakness.  We  are  currently   enhancing  model  to  include  more  features  such  as   character-­‐level  n-­‐grams   –   Results  reflect  ini'al  success   •  Rela'on  mining  is  a  hyper-­‐classifica'on  task   and  perhaps  it  is  prone  to  cascading  errors  
  • 13.
    Current  work   • Work  is  underway  to  improve  the  rela'on   mining  component  using  bag-­‐of-­‐words  and   character  level  n-­‐grams  to  augment  the   feature  space   •  We  are  also  working  on  less  noisy  conversion   techniques  for  pdf-­‐to-­‐text   •  Export  the  workflows  to  the  public  domain  so   that  scien'sts  across  the  spectrum  can  use  our   workflows  
  • 14.
    Snapshot  of  rela'on  miner   References   •   Hanisch,  D.  et  al.  ProMiner:  Organism  specific  protein  name  detec'on  using          approximate  string  matching.  Embo  Workshop  Granada,  Spain,  2004   • Sasaki,  Y.  et  al.  (2008).  How  to  make  the  most  of  NE  dic'onaries  in  sta's'cal  NER?      In:  BMC  Bioinforma'cs,  9(Suppl  11),  S5     •   Collier,  N.  et  al.  BioCaster:  detec'ng  public  health  rumors  with  a  Web-­‐based  text          mining  system.  Bioinforma'cs,  24(24),  2008.     •   Nobata,  C.  et  al  Mining  Metabolites:  Extrac'ng  the  Yeast  Metabolome  from  the  Literature.          Metabolomics,  2010.