SlideShare a Scribd company logo
1 of 43
Download to read offline
POLBASE	
  
HARVESTER	
  
Machine	
  Learning	
  Approaches	
  to	
  Find	
  More	
  DNA	
  
Polymerase	
  Papers	
  	
  
polbase.neb.com	
  
	
  
Ashwin	
  Natarajan	
  
Brad	
  Langhorst	
  
	
  
Polbase	
  repository	
  
•  The	
  DNA	
  Polymerase	
  Database	
  (Polbase)	
  intends	
  to	
  
serve	
  as	
  an	
  open	
  resource	
  for	
  informaBon	
  about	
  exisBng	
  
DNA	
  polymerase	
  
•  This	
  informaBon	
  is	
  sourced	
  from	
  several	
  public	
  and	
  
private	
  database.	
  	
  
	
  
0	
  
50	
  
100	
  
150	
  
200	
  
250	
  
300	
  
350	
  
1955	
   1960	
   1965	
   1970	
   1975	
   1980	
   1985	
   1990	
   1995	
   2000	
   2005	
   2010	
   2015	
  
Paper	
  count	
  
Year	
  
#	
  Reference	
  papers	
  discovered	
  
paper	
  count	
  
polbase.neb.com	
  
Objective	
  
Expand	
  the	
  Polbase	
  reference	
  repository	
  
idenBfying	
  and	
  extracBng	
  more	
  scienBfic	
  
papers	
  related	
  to	
  DNA	
  polymerases.	
  
	
  
•  Target	
  Features:	
  
•  AutomaBc	
  discovery	
  of	
  new	
  relevant	
  papers	
  
•  Human	
  confirms	
  imported	
  papers	
  
•  Minimize	
  the	
  import	
  of	
  irrelevant	
  papers	
  
•  System	
  should	
  self-­‐learn	
  and	
  respond	
  to	
  expert	
  
classificaBon	
  as	
  well	
  
A	
  simple	
  binary	
  classiDication	
  problem	
  
	
  A	
  computer	
  tries	
  to	
  classify	
  cars	
  and	
  boat	
  
A	
  simple	
  binary	
  classiDication	
  problem	
  
	
  A	
  computer	
  tries	
  to	
  classify	
  cars	
  and	
  boat	
  
Proper9es	
   Car	
   Boat	
  
1.)	
  Wheel	
   Y	
   N	
  
2.)	
  Hull	
   N	
   Y	
  
3.)	
  Mainsail	
   N	
   Y	
  
4.)	
  Headlights	
   Y	
   N	
  
Filter:	
  Classifica9on	
  rule	
  
A	
  simple	
  binary	
  classiDication	
  problem	
  
	
  A	
  computer	
  tries	
  to	
  classify	
  cars	
  and	
  boat	
  
Proper9es	
   Car	
   Boat	
  
1.)	
  Wheel	
   Y	
   N	
  
2.)	
  Hull	
   N	
   Y	
  
3.)	
  Mainsail	
   N	
   Y	
  
4.)	
  Headlights	
   Y	
   N	
  
Filter:	
  Classifica9on	
  rule	
  
Class-­‐	
  cars	
   Class-­‐	
  boats	
  
Well..!	
  odd	
  ones..	
  
How	
  can	
  I	
  classify	
  this	
  vehicle...???	
  
DeDining	
  the	
  target	
  
•  Finding	
  the	
  Key	
  indicators(for	
  classificaBon	
  rule)	
  that	
  give	
  the	
  
highest	
  likelihood	
  of	
  classifying	
  all	
  the	
  papers.	
  
•  Different	
  approaches	
  can	
  be	
  used	
  to	
  idenBfy	
  key	
  indicators.	
  	
  
•  Text	
  search	
  
•  StaBsBcal	
  modeling	
  	
  
•  Have	
  a	
  Subject	
  Ma]er	
  Expert	
  read	
  and	
  classify	
  the	
  papers	
  
	
  
•  Key	
  indicators	
  are	
  generated	
  by	
  staBsBcal	
  and	
  machine	
  
learning	
  algorithms	
  
Approach-­‐3	
  Approach-­‐2	
  Approach-­‐1	
  
Different	
  approaches	
  for	
  deDining	
  
classiDication	
  rule	
  
Pub-­‐Med	
  
papers	
  
DNA-­‐
polymerase	
  
papers	
   Non-­‐DNA-­‐
polymerase	
  
papers	
  
	
  Text	
  search	
  for	
  
presence	
  of	
  MeSH-­‐
terms	
  	
  
Make	
  Subject	
  MaHer	
  
Expert	
  read	
  and	
  classify	
  
the	
  papers	
  
	
  Sta9s9cal	
  modeling	
  or	
   or	
  
Approach-­‐3	
  Approach-­‐1	
  
Different	
  approaches	
  for	
  deDining	
  
classiDication	
  rule	
  
Pub-­‐Med	
  
papers	
  
DNA-­‐
polymerase	
  
papers	
   Non-­‐DNA-­‐
polymerase	
  
papers	
  
	
  Text	
  search	
  for	
  
presence	
  of	
  MeSH-­‐
terms	
  	
  
	
  Sta9s9cal	
  modeling	
  (a	
  
machine	
  learning	
  based	
  
classifier)	
  
or	
  
Currently	
  used	
  by	
  Polbase	
   Proposed	
  system	
  
Approach-­‐1	
  
Existing	
  infrastructure	
  
•  A	
  simple	
  query	
  based	
  data	
  retrieval	
  system	
  is	
  a	
  part	
  of	
  
Polbase.	
  
•  The	
  system	
  retrieves	
  papers	
  from	
  PubMed	
  that	
  match	
  the	
  
query	
  criteria.	
  	
  
•  Problems	
  
•  Some	
  papers	
  are	
  found	
  by	
  the	
  query,	
  but	
  are	
  not	
  relevant	
  	
  
•  Many	
  (?)	
  relevant	
  papers	
  are	
  missed	
  by	
  this	
  simple	
  query	
  
•  Query	
  system	
  cannot	
  respond	
  to	
  changing	
  nomenclature	
  
Pub-­‐Med	
  polbase.neb.com	
  
	
  Text	
  search	
  for	
  
presence	
  of	
  MeSH-­‐
terms	
  	
  
All	
  PubMed	
  literature	
  
DNA	
  related	
  	
  
literature	
  
DNA	
  polymerase	
  
literature	
  
XML	
  feed	
  
Approach-­‐3	
  
Proposed	
  classiDier	
  
Pub-­‐Med	
  
papers	
  
DNA-­‐
polymerase	
  
papers	
   Non-­‐DNA-­‐
polymerase	
  
papers	
  
	
  Sta9s9cal	
  modeling	
  (a	
  
machine	
  learning	
  based	
  
classifier)	
  
Job	
  Dlow	
  in	
  classiDier	
  
Filter:	
  scores	
  preprocessing	
  
Crawler	
  and	
  
data	
  
management	
  
Xml	
  data	
  feeds	
  
from	
  Pub-­‐Med	
  	
  
modeling	
  
DNA-­‐
polymerase	
  
papers	
  
Non-­‐DNA-­‐
polymerase	
  
papers	
  
Pub-­‐Med	
  
papers	
  
Structured	
  xml	
  
files	
  
Filter:	
  scores	
  preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   modeling	
  
Data	
  transformation	
  in	
  each	
  component	
  
in	
  classiDier	
  
Source-­‐files	
  
Classified	
  and	
  
labeled	
  papers	
  
Literature	
  
Relevant	
  data	
  
PubMed-­‐id,	
  
Title,	
  abstract	
  
Literature	
  
Tokenized	
  list	
  
and	
  Token	
  
frequency	
  
Text	
  strings	
  
Ranks	
  and	
  
scores	
  
Text	
  strings	
  
Filter:	
  scores	
  preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   modeling	
  
Components	
  in	
  classiDier	
  
Let	
  us	
  take	
  a	
  closer	
  look	
  at	
  each	
  component	
  of	
  the	
  
classiDier	
  	
  
XML	
  data-­‐feed	
  from	
  Pub-­‐Med	
  
Filter:	
  scores	
  Preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   Modeling	
  
Cron	
  job	
  in	
  a	
  
fixed	
  frequency	
  
•  XML	
  data-­‐feeds	
  are	
  being	
  downloaded	
  from	
  Pub-­‐
Med	
  in	
  fixed	
  frequencies.	
  
•  Each	
  XML	
  source	
  file	
  contains	
  more	
  than	
  10000	
  
literatures/	
  papers	
  on	
  an	
  average.	
  
•  Each	
  XML	
  file	
  id	
  given	
  a	
  unique	
  ID	
  and	
  saved	
  in	
  the	
  
repository	
  	
  
Structured	
  xml	
  
files	
  
Source-­‐files	
  
components	
  Sub-­‐components	
  
Data	
  
structure	
  
 
	
  
Data	
  management	
  
Filter:	
  scores	
  Preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   Modeling	
  
Cron	
  job	
  in	
  a	
  fixed	
  
frequency	
  
Postgresql	
  
database	
  
Python	
  crawler	
  
using	
  	
  xml-­‐
element-­‐tree	
  	
  and	
  
psycopg2	
  
Structured	
  xml	
  
files	
  
Source-­‐files	
  
Relevant	
  data	
  
PubMed-­‐id,	
  
Title,	
  abstract	
  
Literature	
  
components	
  Sub-­‐components	
  
Data	
  
structure	
  
 
	
  
Preprocessing	
  	
  
Convert	
  text	
  strings	
  into	
  quantiDiable	
  data	
  
Filter:	
  scores	
  Preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   Modeling	
  
Cron	
  job	
  in	
  a	
  fixed	
  
frequency	
  
Postgresql	
  
database	
  
Python	
  crawler	
  
using	
  	
  xml-­‐
element-­‐tree	
  	
  and	
  
psycopg2	
  
Feature	
  
extrac9on	
  
Quan9fying	
  
target	
  data	
  using	
  
word	
  frequency	
  
measure	
  and	
  NLP	
  
	
  
Structured	
  xml	
  
files	
  
Source-­‐files	
  
Relevant	
  data	
  
PubMed-­‐id,	
  
Title,	
  abstract	
  
Literature	
  
Tokenized	
  list	
  
and	
  Token	
  
frequency	
  
Text	
  strings	
  
•  Looking	
  for	
  important	
  terms	
  
(similar	
  to	
  properBes	
  of	
  cars	
  
and	
  boat)	
  
•  CounBng	
  the	
  frequency	
  
of	
  important	
  terms	
  
components	
  Sub-­‐components	
  
Data	
  
structure	
  
NDS	
  approach	
  for	
  preprocessing	
  
	
  
•  In	
  Numerical	
  DataSets	
  (NDS)	
  approach,	
  we	
  transform	
  textual	
  
informaBon	
  to	
  quanBfiable	
  data	
  based	
  on	
  word	
  frequency/	
  
term	
  frequency.	
  	
  
For	
  Example:	
  Lets	
  consider	
  a	
  document,	
  
	
  
Document1:	
  “this	
  is	
  a	
  sample	
  of	
  a	
  sentence”	
  
 
	
  
Modeling	
  
Filter:	
  scores	
  Preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   Modeling	
  
Cron	
  job	
  in	
  a	
  fixed	
  
frequency	
  
Postgresql	
  
database	
  
Python	
  crawler	
  
using	
  	
  xml-­‐
element-­‐tree	
  	
  and	
  
psycopg2	
  
Logis9c	
  
regression	
  
classifier	
  
	
  using	
  scikit-­‐learn	
  
Feature	
  
extrac9on	
  
Quan9fying	
  
target	
  data	
  using	
  
word	
  frequency	
  
measure	
  and	
  NLP	
  
	
  
Structured	
  xml	
  
files	
  
Source-­‐files	
  
Relevant	
  data	
  
PubMed-­‐id,	
  
Title,	
  abstract	
  
Literature	
  
Tokenized	
  list	
  
and	
  Token	
  
frequency	
  
Text	
  strings	
  
Ranks	
  and	
  
scores	
  
Text	
  strings	
  
components	
  Sub-­‐components	
  
Data	
  
structure	
  
Elements	
  in	
  modeling	
  
Preprocessing	
  
component	
  
Different	
  transacBons	
  takes	
  place	
  between	
  Preprocessing	
  
and	
  modeling	
  component	
  
It	
  is	
  important	
  to	
  understand	
  these	
  transacBons	
  to	
  
understand	
  the	
  output	
  
Modeling	
  
component	
  
Elements	
  in	
  modeling	
  
Modeling	
  component	
  generates	
  the	
  ClassificaBon	
  filter	
  
Training	
  
dataset	
  
Preprocessing	
  
component	
  
Modeling	
  
component	
  
Preprocessed	
  
Training	
  
dataset	
  
Filter:	
  
scores	
  
•  Training	
  dataset	
  contains	
  papers	
  that	
  whose	
  
classes	
  are	
  already	
  known	
  
•  Training	
  dataset	
  has	
  both	
  DNA-­‐Pol	
  and	
  Non-­‐
DNA-­‐Pol	
  papers	
  
Elements	
  in	
  modeling	
  
Training	
  dataset	
  is	
  a	
  part	
  of	
  Reference	
  data	
  
Reference	
  data	
  
Training	
  
dataset	
  
Preprocessing	
  
component	
  
Modeling	
  
component	
  
Preprocessed	
  
Training	
  
dataset	
  
Filter:	
  
scores	
  
TesBng	
  
dataset	
  
•  Reference	
  data	
  are	
  pre	
  classified	
  set	
  of	
  data	
  
•  TesBng	
  data	
  is	
  a	
  subset	
  of	
  reference	
  data	
  
•  TesBng	
  data	
  is	
  used	
  only	
  for	
  assessing	
  the	
  
self-­‐learning	
  capacity	
  of	
  the	
  model	
  over	
  Bme.	
  
Elements	
  in	
  modeling	
  
Training	
  dataset	
  is	
  a	
  part	
  of	
  Reference	
  data	
  
Reference	
  data	
  
Training	
  
dataset	
  
Preprocessing	
  
component	
  
Modeling	
  
component	
  
Preprocessed	
  
Training	
  
dataset	
  
Filter:	
  
scores	
  
•  Reference	
  data	
  are	
  pre	
  classified	
  set	
  of	
  data	
  
•  Training	
  data	
  is	
  a	
  subset	
  of	
  reference	
  data	
  
Elements	
  in	
  modeling	
  
Preprocessing	
  
component	
  
Modeling	
  
component	
  
Unclassified	
  
data	
  
Filter:	
  
scores	
  
Preprocessed	
  
Unclassified	
  
dataset	
  
DNA-­‐polymerase	
  
papers	
  
Non-­‐DNA-­‐polymerase	
  
papers	
  
This	
  transacBon	
  explain	
  the	
  flow	
  of	
  unclassified	
  papers	
  
through	
  modeling	
  component	
  
Unclassified	
  data	
  
Elements	
  in	
  modeling	
  
Preprocessing	
  
component	
  
Modeling	
  
component	
  
Unclassified	
  
data	
  
Filter:	
  
scores	
  
Preprocessed	
  
Unclassified	
  
dataset	
  
DNA-­‐polymerase	
  
papers	
  
Non-­‐DNA-­‐polymerase	
  
papers	
  
ValidaBon	
  
data	
  
ValidaBon	
  dataset	
  are	
  randomly	
  chosen	
  unclassified	
  
papers	
  that	
  are	
  curated	
  manually	
  by	
  approach-­‐2.	
  
Unclassified	
  data	
  
Reference	
  data	
  
Elements	
  in	
  modeling	
  
Training	
  
dataset	
  
Preprocessing	
  
component	
  
Modeling	
  
component	
  
Unclassified	
  
data	
  
Preprocessed	
  
Training	
  
dataset	
  
Filter:	
  
scores	
  
Preprocessed	
  
Unclassified	
  
dataset	
  
DNA-­‐polymerase	
  
papers	
  
Non-­‐DNA-­‐polymerase	
  
papers	
  
ValidaBon	
  
data	
  
Unclassified	
  data	
  
Reference	
  data	
  
Elements	
  in	
  modeling	
  
Training	
  
dataset	
  
Preprocessing	
  
component	
  
Modeling	
  
component	
  
Unclassified	
  
data	
  
Preprocessed	
  
Training	
  
dataset	
  
Filter:	
  
scores	
  
Preprocessed	
  
Unclassified	
  
dataset	
  
DNA-­‐polymerase	
  
papers	
  
Non-­‐DNA-­‐polymerase	
  
papers	
  
TesBng	
  
dataset	
  
ValidaBon	
  
data	
  
Unclassified	
  data	
  
Reference	
  data	
  
For	
  choosing	
  the	
  model	
  
ValidaBon	
  
data	
  
Training	
  
dataset	
  
Results	
  of	
  classifying	
  validation	
  Diles	
  	
  
0	
  
2	
  
4	
  
6	
  
8	
  
10	
  
12	
  
medline	
  
#933	
  
medline	
  
#937	
  
medline	
  
#938	
  
medline	
  
#780	
  
DNA	
  Polymerase	
  papers	
  
correctly	
  classified	
  	
  
(true	
  posiBves)	
  
Actual	
  count	
  of	
  DNA	
  
polymerase	
  papers	
  
found	
  in	
  the	
  XML	
  
source	
  file	
  
Many	
  relevant	
  papers	
  are	
  
not	
  idenBfied	
  
Initial	
  Result:	
  Wrongly	
  classiDied	
  papers	
  
0	
  
50	
  
100	
  
150	
  
200	
  
250	
  
300	
  
350	
  
medline	
  
#933	
  
medline	
  
#937	
  
medline	
  
#938	
  
medline	
  
#780	
  
DNA	
  Polymerase	
  papers	
  
wrongly	
  excluded	
  (False	
  
negaBves)	
  
Non-­‐DNA	
  Polymerase	
  
papers	
  wrongly	
  
included	
  (False	
  
posiBves)	
  
Actual	
  count	
  of	
  DNA	
  
Polymerase	
  papers	
  
found	
  in	
  the	
  XML	
  
source	
  file	
  
Many	
  irrelevant	
  papers	
  
 
	
  
Revisiting	
  Preprocessing	
  
Filter:	
  scores	
  Preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   Modeling	
  
Cron	
  job	
  in	
  a	
  fixed	
  
frequency	
  
Postgresql	
  
database	
  
Python	
  crawler	
  
using	
  	
  xml-­‐
element-­‐tree	
  	
  and	
  
psycopg2	
  
Quan9fying	
  
target	
  data	
  with	
  
td-­‐idf	
  measure	
  
components	
  Sub-­‐components	
  
Data	
  
structure	
  
Structured	
  xml	
  
files	
  
Source-­‐files	
  
Relevant	
  data	
  
PubMed-­‐id,	
  
Title,	
  abstract	
  
Literature	
  
Tokenized	
  list	
  
and	
  Token	
  
frequency	
  
Text	
  strings	
  
Working	
  of	
  tf-­‐idf	
  
Tf	
   means	
   term-­‐frequency	
   while	
   l–idf	
   means	
   term-­‐frequency	
   Bmes	
   inverse	
  
document-­‐frequency.	
  This	
  is	
  a	
  originally	
  a	
  term	
  weighBng	
  scheme	
  developed	
  for	
  
informaBon	
   retrieval	
   (as	
   a	
   ranking	
   funcBon	
   for	
   search	
   engines	
   results),	
   that	
   has	
  
also	
  found	
  good	
  use	
  in	
  document	
  classificaBon	
  and	
  clustering.	
  	
  
For	
  Example:	
  Lets	
  consider	
  a	
  
document,	
  
	
  
Document1:	
  “this	
  is	
  a	
  sample	
  of	
  a	
  
sentence”	
  
	
  
Document2:	
  “this	
  example	
  is	
  
another	
  example	
  of	
  another	
  
example	
  ”	
  
	
  
Results	
  of	
  classifying	
  validation	
  Diles	
  
using	
  tf-­‐idf	
  approach	
  of	
  preprocessing	
  	
  
0	
  
2	
  
4	
  
6	
  
8	
  
10	
  
12	
  
medline	
  
#933	
  
medline	
  
#937	
  
medline	
  
#938	
  
medline	
  
#780	
  
DNA	
  Polymerase	
  papers	
  
correctly	
  classified	
  (true	
  
posiBves)	
  
Actual	
  count	
  of	
  DNA	
  
Polymerase	
  papers	
  
found	
  in	
  the	
  XML	
  
source	
  file	
  
All	
  relevant	
  papers	
  are	
  
idenBfied	
  
Wrongly	
  classiDied	
  Diles	
  
0	
  
50	
  
100	
  
150	
  
200	
  
250	
  
medline	
  
#933	
  
medline	
  
#937	
  
medline	
  
#938	
  
medline	
  
#780	
  
DNA	
  Polymerase	
  papers	
  
wrongly	
  excluded	
  (False	
  
negaBves)	
  
Non-­‐DNA	
  Polymerase	
  
papers	
  wrongly	
  
included	
  (False	
  
posiBves)	
  
Actual	
  count	
  of	
  DNA	
  
Polymerase	
  papers	
  
found	
  in	
  the	
  XML	
  
source	
  file	
  
Irrelevant	
  papers	
  are	
  sBll	
  
incorrectly	
  classified.	
  False	
  
posiBve	
  rate	
  looks	
  bad.	
  
 
	
  
Revisiting	
  Modeling	
  
Filter:	
  scores	
  Preprocessing	
  
Data	
  
management	
  
Xml	
  data	
  feeds	
   Modeling	
  
Cron	
  job	
  in	
  a	
  fixed	
  
frequency	
  
Postgresql	
  
database	
  
Python	
  crawler	
  
using	
  	
  xml-­‐
element-­‐tree	
  	
  and	
  
psycopg2	
  
Trying	
  different	
  
classifiers	
  using	
  
scikit-­‐learn	
  
components	
  Sub-­‐components	
  
Data	
  
structure	
  
Quan9fying	
  
target	
  data	
  with	
  
td-­‐idf	
  measure	
  
Structured	
  xml	
  
files	
  
Source-­‐files	
  
Relevant	
  data	
  
PubMed-­‐id,	
  
Title,	
  abstract	
  
Literature	
  
Tokenized	
  list	
  
and	
  Token	
  
frequency	
  
Text	
  strings	
  
Ranks	
  and	
  
scores	
  
Text	
  strings	
  
Finding	
  the	
  classiDier	
  that	
  gives	
  better	
  
false	
  positive	
  count.	
  
•  We	
  decided	
  to	
  work	
  on	
  two	
  addiBonal	
  classifiers.	
  1.	
  Bagging	
  
with	
  LogisBc	
  regression	
  esBmator,	
  2.BoosBng	
  with	
  decision	
  
stump.	
  
•  We	
  also	
  designed	
  a	
  grid	
  search	
  experiment	
  to	
  find	
  the	
  best	
  
combinaBon	
  of	
  training	
  data	
  to	
  feed	
  into	
  these	
  classifiers.	
  
•  Parameters	
  varied:	
  
•  1.	
  number	
  of	
  included	
  papers	
  
•  2.	
  number	
  of	
  “close”	
  papers	
  (e.g.	
  use	
  of	
  PCR,	
  but	
  not	
  studying	
  
Polymerases)	
  
•  3.	
  number	
  of	
  excluded	
  papers	
  
•  4.	
  Target	
  data	
  (Btle/	
  abstract/	
  both)	
  
Grid	
  search	
  for	
  optimal	
  parameter	
  
Grid	
  search	
  for	
  optimal	
  parameter	
  
Wrongly	
  classiDied	
  Diles	
  
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
120	
  
medline	
  
#933	
  
medline	
  
#937	
  
medline	
  
#938	
  
medline	
  
#780	
  
DNA	
  Polymerase	
  papers	
  
wrongly	
  excluded	
  (False	
  
negaBves)	
  
Non-­‐DNA	
  Polymerase	
  
papers	
  wrongly	
  
included	
  (False	
  
posiBves)	
  
Actual	
  count	
  of	
  DNA	
  
Polymerase	
  papers	
  
found	
  in	
  the	
  XML	
  
source	
  file	
  
Irrelevant	
  papers	
  count	
  has	
  
considerably	
  come	
  down.	
  
Lessons	
  learnt	
  from	
  the	
  project	
  
•  More	
  preprocessing	
  and	
  model	
  alternaBves	
  
needs	
  to	
  be	
  considered	
  in	
  all	
  stages	
  of	
  the	
  
project.	
  
•  ValidaBon	
  infrastructure	
  should	
  built	
  
simultaneously.	
  Which	
  will	
  help	
  improve	
  the	
  
results	
  in	
  the	
  later	
  stage.	
  
Future	
  development	
  
• Moving	
  into	
  ProducBon	
  
• MulBple	
  ClassificaBon	
  
• Can	
  we	
  expand	
  this	
  method	
  to	
  other	
  
topic	
  areas?	
  (e.g.	
  Ligases,	
  SyntheBc	
  
Biology,	
  etc.)	
  
Acknowledgements	
  
•  Polbase	
  creators	
  
•  Brad	
  Langhorst	
  
•  Nicole	
  Nichols	
  
•  Bill	
  Jack	
  
•  Polbase	
  External	
  contributors	
  
•  Linda	
  Reha-­‐Krantz	
  
•  Cathy	
  Joyce	
  
•  Stu	
  Linn	
  
•  Stefan	
  Sarafianos	
  
•  Sam	
  Wilson	
  
•  Roger	
  Woodgate	
  
•  NEB	
  	
  
•  Yanhong	
  Tong	
  
•  Eric	
  Peterson	
  
•  Janos	
  Posfai	
  
•  Ellen	
  Zaglakas	
  
•  Mehmet	
  Karaca	
  
•  IT	
  
•  Servers,	
  and	
  network	
  
connecBon	
  to	
  PubMed	
  

More Related Content

What's hot

Introduction to XPath
Introduction to XPathIntroduction to XPath
Introduction to XPathtorp42
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge queryStanley Wang
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf OpenflydataJun Zhao
 
Xml query language and navigation
Xml query language and navigationXml query language and navigation
Xml query language and navigationRaghu nath
 
Semantic Web - Ontology 101
Semantic Web - Ontology 101Semantic Web - Ontology 101
Semantic Web - Ontology 101Luigi De Russis
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." Avalon Media System
 
Annotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonAnnotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonRobert Sanderson
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQueryKatrien Verbert
 
Chado for evolutionary biology
Chado for evolutionary biologyChado for evolutionary biology
Chado for evolutionary biologyChris Mungall
 
An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath Format
An IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath FormatAn IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath Format
An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath FormatThe HDF-EOS Tools and Information Center
 

What's hot (19)

Introduction to XPath
Introduction to XPathIntroduction to XPath
Introduction to XPath
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
SWT Lecture Session 10 R2RML Part 1
SWT Lecture Session 10 R2RML Part 1SWT Lecture Session 10 R2RML Part 1
SWT Lecture Session 10 R2RML Part 1
 
2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Xml query language and navigation
Xml query language and navigationXml query language and navigation
Xml query language and navigation
 
5 rdfs
5 rdfs5 rdfs
5 rdfs
 
Semantic Web - Ontology 101
Semantic Web - Ontology 101Semantic Web - Ontology 101
Semantic Web - Ontology 101
 
DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World." DLF 2015 Presentation, "RDF in the Real World."
DLF 2015 Presentation, "RDF in the Real World."
 
Annotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and TriannonAnnotations as Linked Data with Fedora4 and Triannon
Annotations as Linked Data with Fedora4 and Triannon
 
Fedora Migration Considerations
Fedora Migration ConsiderationsFedora Migration Considerations
Fedora Migration Considerations
 
Ontology
OntologyOntology
Ontology
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
 
Chado introduction
Chado introductionChado introduction
Chado introduction
 
Chado for evolutionary biology
Chado for evolutionary biologyChado for evolutionary biology
Chado for evolutionary biology
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Ist16-04 An introduction to RDF
Ist16-04 An introduction to RDF Ist16-04 An introduction to RDF
Ist16-04 An introduction to RDF
 
Chado-XML
Chado-XMLChado-XML
Chado-XML
 
An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath Format
An IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath FormatAn IDL-Based  Validation Toolkit: Extensions to  use the HDF-EOS Swath Format
An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath Format
 

Similar to Harvester_presentaion

The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?Axel de Romblay
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningS. Diana Hu
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
ontology.ppt
ontology.pptontology.ppt
ontology.pptPrerak10
 
Kampmeier ecn 2012
Kampmeier ecn 2012Kampmeier ecn 2012
Kampmeier ecn 2012ECNOfficer
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyondErnesto Reig
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.Shyjal Raazi
 

Similar to Harvester_presentaion (20)

Final presentation
Final presentationFinal presentation
Final presentation
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
ontology.ppt
ontology.pptontology.ppt
ontology.ppt
 
Kampmeier ecn 2012
Kampmeier ecn 2012Kampmeier ecn 2012
Kampmeier ecn 2012
 
XML and XPath details
XML and XPath detailsXML and XPath details
XML and XPath details
 
Text categorization
Text categorizationText categorization
Text categorization
 
Multi-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing ParadigmsMulti-Model Data Query Languages and Processing Paradigms
Multi-Model Data Query Languages and Processing Paradigms
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Solved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdfSolved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdf
 

Harvester_presentaion

  • 1. POLBASE   HARVESTER   Machine  Learning  Approaches  to  Find  More  DNA   Polymerase  Papers     polbase.neb.com     Ashwin  Natarajan   Brad  Langhorst    
  • 2. Polbase  repository   •  The  DNA  Polymerase  Database  (Polbase)  intends  to   serve  as  an  open  resource  for  informaBon  about  exisBng   DNA  polymerase   •  This  informaBon  is  sourced  from  several  public  and   private  database.       0   50   100   150   200   250   300   350   1955   1960   1965   1970   1975   1980   1985   1990   1995   2000   2005   2010   2015   Paper  count   Year   #  Reference  papers  discovered   paper  count   polbase.neb.com  
  • 3. Objective   Expand  the  Polbase  reference  repository   idenBfying  and  extracBng  more  scienBfic   papers  related  to  DNA  polymerases.     •  Target  Features:   •  AutomaBc  discovery  of  new  relevant  papers   •  Human  confirms  imported  papers   •  Minimize  the  import  of  irrelevant  papers   •  System  should  self-­‐learn  and  respond  to  expert   classificaBon  as  well  
  • 4. A  simple  binary  classiDication  problem    A  computer  tries  to  classify  cars  and  boat  
  • 5. A  simple  binary  classiDication  problem    A  computer  tries  to  classify  cars  and  boat   Proper9es   Car   Boat   1.)  Wheel   Y   N   2.)  Hull   N   Y   3.)  Mainsail   N   Y   4.)  Headlights   Y   N   Filter:  Classifica9on  rule  
  • 6. A  simple  binary  classiDication  problem    A  computer  tries  to  classify  cars  and  boat   Proper9es   Car   Boat   1.)  Wheel   Y   N   2.)  Hull   N   Y   3.)  Mainsail   N   Y   4.)  Headlights   Y   N   Filter:  Classifica9on  rule   Class-­‐  cars   Class-­‐  boats  
  • 7. Well..!  odd  ones..   How  can  I  classify  this  vehicle...???  
  • 8. DeDining  the  target   •  Finding  the  Key  indicators(for  classificaBon  rule)  that  give  the   highest  likelihood  of  classifying  all  the  papers.   •  Different  approaches  can  be  used  to  idenBfy  key  indicators.     •  Text  search   •  StaBsBcal  modeling     •  Have  a  Subject  Ma]er  Expert  read  and  classify  the  papers     •  Key  indicators  are  generated  by  staBsBcal  and  machine   learning  algorithms  
  • 9. Approach-­‐3  Approach-­‐2  Approach-­‐1   Different  approaches  for  deDining   classiDication  rule   Pub-­‐Med   papers   DNA-­‐ polymerase   papers   Non-­‐DNA-­‐ polymerase   papers    Text  search  for   presence  of  MeSH-­‐ terms     Make  Subject  MaHer   Expert  read  and  classify   the  papers    Sta9s9cal  modeling  or   or  
  • 10. Approach-­‐3  Approach-­‐1   Different  approaches  for  deDining   classiDication  rule   Pub-­‐Med   papers   DNA-­‐ polymerase   papers   Non-­‐DNA-­‐ polymerase   papers    Text  search  for   presence  of  MeSH-­‐ terms      Sta9s9cal  modeling  (a   machine  learning  based   classifier)   or   Currently  used  by  Polbase   Proposed  system  
  • 11. Approach-­‐1   Existing  infrastructure   •  A  simple  query  based  data  retrieval  system  is  a  part  of   Polbase.   •  The  system  retrieves  papers  from  PubMed  that  match  the   query  criteria.     •  Problems   •  Some  papers  are  found  by  the  query,  but  are  not  relevant     •  Many  (?)  relevant  papers  are  missed  by  this  simple  query   •  Query  system  cannot  respond  to  changing  nomenclature   Pub-­‐Med  polbase.neb.com    Text  search  for   presence  of  MeSH-­‐ terms     All  PubMed  literature   DNA  related     literature   DNA  polymerase   literature   XML  feed  
  • 12. Approach-­‐3   Proposed  classiDier   Pub-­‐Med   papers   DNA-­‐ polymerase   papers   Non-­‐DNA-­‐ polymerase   papers    Sta9s9cal  modeling  (a   machine  learning  based   classifier)  
  • 13. Job  Dlow  in  classiDier   Filter:  scores  preprocessing   Crawler  and   data   management   Xml  data  feeds   from  Pub-­‐Med     modeling   DNA-­‐ polymerase   papers   Non-­‐DNA-­‐ polymerase   papers   Pub-­‐Med   papers  
  • 14. Structured  xml   files   Filter:  scores  preprocessing   Data   management   Xml  data  feeds   modeling   Data  transformation  in  each  component   in  classiDier   Source-­‐files   Classified  and   labeled  papers   Literature   Relevant  data   PubMed-­‐id,   Title,  abstract   Literature   Tokenized  list   and  Token   frequency   Text  strings   Ranks  and   scores   Text  strings  
  • 15. Filter:  scores  preprocessing   Data   management   Xml  data  feeds   modeling   Components  in  classiDier   Let  us  take  a  closer  look  at  each  component  of  the   classiDier    
  • 16. XML  data-­‐feed  from  Pub-­‐Med   Filter:  scores  Preprocessing   Data   management   Xml  data  feeds   Modeling   Cron  job  in  a   fixed  frequency   •  XML  data-­‐feeds  are  being  downloaded  from  Pub-­‐ Med  in  fixed  frequencies.   •  Each  XML  source  file  contains  more  than  10000   literatures/  papers  on  an  average.   •  Each  XML  file  id  given  a  unique  ID  and  saved  in  the   repository     Structured  xml   files   Source-­‐files   components  Sub-­‐components   Data   structure  
  • 17.     Data  management   Filter:  scores  Preprocessing   Data   management   Xml  data  feeds   Modeling   Cron  job  in  a  fixed   frequency   Postgresql   database   Python  crawler   using    xml-­‐ element-­‐tree    and   psycopg2   Structured  xml   files   Source-­‐files   Relevant  data   PubMed-­‐id,   Title,  abstract   Literature   components  Sub-­‐components   Data   structure  
  • 18.     Preprocessing     Convert  text  strings  into  quantiDiable  data   Filter:  scores  Preprocessing   Data   management   Xml  data  feeds   Modeling   Cron  job  in  a  fixed   frequency   Postgresql   database   Python  crawler   using    xml-­‐ element-­‐tree    and   psycopg2   Feature   extrac9on   Quan9fying   target  data  using   word  frequency   measure  and  NLP     Structured  xml   files   Source-­‐files   Relevant  data   PubMed-­‐id,   Title,  abstract   Literature   Tokenized  list   and  Token   frequency   Text  strings   •  Looking  for  important  terms   (similar  to  properBes  of  cars   and  boat)   •  CounBng  the  frequency   of  important  terms   components  Sub-­‐components   Data   structure  
  • 19. NDS  approach  for  preprocessing     •  In  Numerical  DataSets  (NDS)  approach,  we  transform  textual   informaBon  to  quanBfiable  data  based  on  word  frequency/   term  frequency.     For  Example:  Lets  consider  a  document,     Document1:  “this  is  a  sample  of  a  sentence”  
  • 20.     Modeling   Filter:  scores  Preprocessing   Data   management   Xml  data  feeds   Modeling   Cron  job  in  a  fixed   frequency   Postgresql   database   Python  crawler   using    xml-­‐ element-­‐tree    and   psycopg2   Logis9c   regression   classifier    using  scikit-­‐learn   Feature   extrac9on   Quan9fying   target  data  using   word  frequency   measure  and  NLP     Structured  xml   files   Source-­‐files   Relevant  data   PubMed-­‐id,   Title,  abstract   Literature   Tokenized  list   and  Token   frequency   Text  strings   Ranks  and   scores   Text  strings   components  Sub-­‐components   Data   structure  
  • 21. Elements  in  modeling   Preprocessing   component   Different  transacBons  takes  place  between  Preprocessing   and  modeling  component   It  is  important  to  understand  these  transacBons  to   understand  the  output   Modeling   component  
  • 22. Elements  in  modeling   Modeling  component  generates  the  ClassificaBon  filter   Training   dataset   Preprocessing   component   Modeling   component   Preprocessed   Training   dataset   Filter:   scores   •  Training  dataset  contains  papers  that  whose   classes  are  already  known   •  Training  dataset  has  both  DNA-­‐Pol  and  Non-­‐ DNA-­‐Pol  papers  
  • 23. Elements  in  modeling   Training  dataset  is  a  part  of  Reference  data   Reference  data   Training   dataset   Preprocessing   component   Modeling   component   Preprocessed   Training   dataset   Filter:   scores   TesBng   dataset   •  Reference  data  are  pre  classified  set  of  data   •  TesBng  data  is  a  subset  of  reference  data   •  TesBng  data  is  used  only  for  assessing  the   self-­‐learning  capacity  of  the  model  over  Bme.  
  • 24. Elements  in  modeling   Training  dataset  is  a  part  of  Reference  data   Reference  data   Training   dataset   Preprocessing   component   Modeling   component   Preprocessed   Training   dataset   Filter:   scores   •  Reference  data  are  pre  classified  set  of  data   •  Training  data  is  a  subset  of  reference  data  
  • 25. Elements  in  modeling   Preprocessing   component   Modeling   component   Unclassified   data   Filter:   scores   Preprocessed   Unclassified   dataset   DNA-­‐polymerase   papers   Non-­‐DNA-­‐polymerase   papers   This  transacBon  explain  the  flow  of  unclassified  papers   through  modeling  component  
  • 26. Unclassified  data   Elements  in  modeling   Preprocessing   component   Modeling   component   Unclassified   data   Filter:   scores   Preprocessed   Unclassified   dataset   DNA-­‐polymerase   papers   Non-­‐DNA-­‐polymerase   papers   ValidaBon   data   ValidaBon  dataset  are  randomly  chosen  unclassified   papers  that  are  curated  manually  by  approach-­‐2.  
  • 27. Unclassified  data   Reference  data   Elements  in  modeling   Training   dataset   Preprocessing   component   Modeling   component   Unclassified   data   Preprocessed   Training   dataset   Filter:   scores   Preprocessed   Unclassified   dataset   DNA-­‐polymerase   papers   Non-­‐DNA-­‐polymerase   papers   ValidaBon   data  
  • 28. Unclassified  data   Reference  data   Elements  in  modeling   Training   dataset   Preprocessing   component   Modeling   component   Unclassified   data   Preprocessed   Training   dataset   Filter:   scores   Preprocessed   Unclassified   dataset   DNA-­‐polymerase   papers   Non-­‐DNA-­‐polymerase   papers   TesBng   dataset   ValidaBon   data  
  • 29. Unclassified  data   Reference  data   For  choosing  the  model   ValidaBon   data   Training   dataset  
  • 30. Results  of  classifying  validation  Diles     0   2   4   6   8   10   12   medline   #933   medline   #937   medline   #938   medline   #780   DNA  Polymerase  papers   correctly  classified     (true  posiBves)   Actual  count  of  DNA   polymerase  papers   found  in  the  XML   source  file   Many  relevant  papers  are   not  idenBfied  
  • 31. Initial  Result:  Wrongly  classiDied  papers   0   50   100   150   200   250   300   350   medline   #933   medline   #937   medline   #938   medline   #780   DNA  Polymerase  papers   wrongly  excluded  (False   negaBves)   Non-­‐DNA  Polymerase   papers  wrongly   included  (False   posiBves)   Actual  count  of  DNA   Polymerase  papers   found  in  the  XML   source  file   Many  irrelevant  papers  
  • 32.     Revisiting  Preprocessing   Filter:  scores  Preprocessing   Data   management   Xml  data  feeds   Modeling   Cron  job  in  a  fixed   frequency   Postgresql   database   Python  crawler   using    xml-­‐ element-­‐tree    and   psycopg2   Quan9fying   target  data  with   td-­‐idf  measure   components  Sub-­‐components   Data   structure   Structured  xml   files   Source-­‐files   Relevant  data   PubMed-­‐id,   Title,  abstract   Literature   Tokenized  list   and  Token   frequency   Text  strings  
  • 33. Working  of  tf-­‐idf   Tf   means   term-­‐frequency   while   l–idf   means   term-­‐frequency   Bmes   inverse   document-­‐frequency.  This  is  a  originally  a  term  weighBng  scheme  developed  for   informaBon   retrieval   (as   a   ranking   funcBon   for   search   engines   results),   that   has   also  found  good  use  in  document  classificaBon  and  clustering.     For  Example:  Lets  consider  a   document,     Document1:  “this  is  a  sample  of  a   sentence”     Document2:  “this  example  is   another  example  of  another   example  ”    
  • 34. Results  of  classifying  validation  Diles   using  tf-­‐idf  approach  of  preprocessing     0   2   4   6   8   10   12   medline   #933   medline   #937   medline   #938   medline   #780   DNA  Polymerase  papers   correctly  classified  (true   posiBves)   Actual  count  of  DNA   Polymerase  papers   found  in  the  XML   source  file   All  relevant  papers  are   idenBfied  
  • 35. Wrongly  classiDied  Diles   0   50   100   150   200   250   medline   #933   medline   #937   medline   #938   medline   #780   DNA  Polymerase  papers   wrongly  excluded  (False   negaBves)   Non-­‐DNA  Polymerase   papers  wrongly   included  (False   posiBves)   Actual  count  of  DNA   Polymerase  papers   found  in  the  XML   source  file   Irrelevant  papers  are  sBll   incorrectly  classified.  False   posiBve  rate  looks  bad.  
  • 36.     Revisiting  Modeling   Filter:  scores  Preprocessing   Data   management   Xml  data  feeds   Modeling   Cron  job  in  a  fixed   frequency   Postgresql   database   Python  crawler   using    xml-­‐ element-­‐tree    and   psycopg2   Trying  different   classifiers  using   scikit-­‐learn   components  Sub-­‐components   Data   structure   Quan9fying   target  data  with   td-­‐idf  measure   Structured  xml   files   Source-­‐files   Relevant  data   PubMed-­‐id,   Title,  abstract   Literature   Tokenized  list   and  Token   frequency   Text  strings   Ranks  and   scores   Text  strings  
  • 37. Finding  the  classiDier  that  gives  better   false  positive  count.   •  We  decided  to  work  on  two  addiBonal  classifiers.  1.  Bagging   with  LogisBc  regression  esBmator,  2.BoosBng  with  decision   stump.   •  We  also  designed  a  grid  search  experiment  to  find  the  best   combinaBon  of  training  data  to  feed  into  these  classifiers.   •  Parameters  varied:   •  1.  number  of  included  papers   •  2.  number  of  “close”  papers  (e.g.  use  of  PCR,  but  not  studying   Polymerases)   •  3.  number  of  excluded  papers   •  4.  Target  data  (Btle/  abstract/  both)  
  • 38. Grid  search  for  optimal  parameter  
  • 39. Grid  search  for  optimal  parameter  
  • 40. Wrongly  classiDied  Diles   0   20   40   60   80   100   120   medline   #933   medline   #937   medline   #938   medline   #780   DNA  Polymerase  papers   wrongly  excluded  (False   negaBves)   Non-­‐DNA  Polymerase   papers  wrongly   included  (False   posiBves)   Actual  count  of  DNA   Polymerase  papers   found  in  the  XML   source  file   Irrelevant  papers  count  has   considerably  come  down.  
  • 41. Lessons  learnt  from  the  project   •  More  preprocessing  and  model  alternaBves   needs  to  be  considered  in  all  stages  of  the   project.   •  ValidaBon  infrastructure  should  built   simultaneously.  Which  will  help  improve  the   results  in  the  later  stage.  
  • 42. Future  development   • Moving  into  ProducBon   • MulBple  ClassificaBon   • Can  we  expand  this  method  to  other   topic  areas?  (e.g.  Ligases,  SyntheBc   Biology,  etc.)  
  • 43. Acknowledgements   •  Polbase  creators   •  Brad  Langhorst   •  Nicole  Nichols   •  Bill  Jack   •  Polbase  External  contributors   •  Linda  Reha-­‐Krantz   •  Cathy  Joyce   •  Stu  Linn   •  Stefan  Sarafianos   •  Sam  Wilson   •  Roger  Woodgate   •  NEB     •  Yanhong  Tong   •  Eric  Peterson   •  Janos  Posfai   •  Ellen  Zaglakas   •  Mehmet  Karaca   •  IT   •  Servers,  and  network   connecBon  to  PubMed