1. POLBASE
HARVESTER
Machine
Learning
Approaches
to
Find
More
DNA
Polymerase
Papers
polbase.neb.com
Ashwin
Natarajan
Brad
Langhorst
2. Polbase
repository
• The
DNA
Polymerase
Database
(Polbase)
intends
to
serve
as
an
open
resource
for
informaBon
about
exisBng
DNA
polymerase
• This
informaBon
is
sourced
from
several
public
and
private
database.
0
50
100
150
200
250
300
350
1955
1960
1965
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015
Paper
count
Year
#
Reference
papers
discovered
paper
count
polbase.neb.com
3. Objective
Expand
the
Polbase
reference
repository
idenBfying
and
extracBng
more
scienBfic
papers
related
to
DNA
polymerases.
• Target
Features:
• AutomaBc
discovery
of
new
relevant
papers
• Human
confirms
imported
papers
• Minimize
the
import
of
irrelevant
papers
• System
should
self-‐learn
and
respond
to
expert
classificaBon
as
well
4. A
simple
binary
classiDication
problem
A
computer
tries
to
classify
cars
and
boat
5. A
simple
binary
classiDication
problem
A
computer
tries
to
classify
cars
and
boat
Proper9es
Car
Boat
1.)
Wheel
Y
N
2.)
Hull
N
Y
3.)
Mainsail
N
Y
4.)
Headlights
Y
N
Filter:
Classifica9on
rule
6. A
simple
binary
classiDication
problem
A
computer
tries
to
classify
cars
and
boat
Proper9es
Car
Boat
1.)
Wheel
Y
N
2.)
Hull
N
Y
3.)
Mainsail
N
Y
4.)
Headlights
Y
N
Filter:
Classifica9on
rule
Class-‐
cars
Class-‐
boats
8. DeDining
the
target
• Finding
the
Key
indicators(for
classificaBon
rule)
that
give
the
highest
likelihood
of
classifying
all
the
papers.
• Different
approaches
can
be
used
to
idenBfy
key
indicators.
• Text
search
• StaBsBcal
modeling
• Have
a
Subject
Ma]er
Expert
read
and
classify
the
papers
• Key
indicators
are
generated
by
staBsBcal
and
machine
learning
algorithms
9. Approach-‐3
Approach-‐2
Approach-‐1
Different
approaches
for
deDining
classiDication
rule
Pub-‐Med
papers
DNA-‐
polymerase
papers
Non-‐DNA-‐
polymerase
papers
Text
search
for
presence
of
MeSH-‐
terms
Make
Subject
MaHer
Expert
read
and
classify
the
papers
Sta9s9cal
modeling
or
or
10. Approach-‐3
Approach-‐1
Different
approaches
for
deDining
classiDication
rule
Pub-‐Med
papers
DNA-‐
polymerase
papers
Non-‐DNA-‐
polymerase
papers
Text
search
for
presence
of
MeSH-‐
terms
Sta9s9cal
modeling
(a
machine
learning
based
classifier)
or
Currently
used
by
Polbase
Proposed
system
11. Approach-‐1
Existing
infrastructure
• A
simple
query
based
data
retrieval
system
is
a
part
of
Polbase.
• The
system
retrieves
papers
from
PubMed
that
match
the
query
criteria.
• Problems
• Some
papers
are
found
by
the
query,
but
are
not
relevant
• Many
(?)
relevant
papers
are
missed
by
this
simple
query
• Query
system
cannot
respond
to
changing
nomenclature
Pub-‐Med
polbase.neb.com
Text
search
for
presence
of
MeSH-‐
terms
All
PubMed
literature
DNA
related
literature
DNA
polymerase
literature
XML
feed
12. Approach-‐3
Proposed
classiDier
Pub-‐Med
papers
DNA-‐
polymerase
papers
Non-‐DNA-‐
polymerase
papers
Sta9s9cal
modeling
(a
machine
learning
based
classifier)
13. Job
Dlow
in
classiDier
Filter:
scores
preprocessing
Crawler
and
data
management
Xml
data
feeds
from
Pub-‐Med
modeling
DNA-‐
polymerase
papers
Non-‐DNA-‐
polymerase
papers
Pub-‐Med
papers
14. Structured
xml
files
Filter:
scores
preprocessing
Data
management
Xml
data
feeds
modeling
Data
transformation
in
each
component
in
classiDier
Source-‐files
Classified
and
labeled
papers
Literature
Relevant
data
PubMed-‐id,
Title,
abstract
Literature
Tokenized
list
and
Token
frequency
Text
strings
Ranks
and
scores
Text
strings
15. Filter:
scores
preprocessing
Data
management
Xml
data
feeds
modeling
Components
in
classiDier
Let
us
take
a
closer
look
at
each
component
of
the
classiDier
16. XML
data-‐feed
from
Pub-‐Med
Filter:
scores
Preprocessing
Data
management
Xml
data
feeds
Modeling
Cron
job
in
a
fixed
frequency
• XML
data-‐feeds
are
being
downloaded
from
Pub-‐
Med
in
fixed
frequencies.
• Each
XML
source
file
contains
more
than
10000
literatures/
papers
on
an
average.
• Each
XML
file
id
given
a
unique
ID
and
saved
in
the
repository
Structured
xml
files
Source-‐files
components
Sub-‐components
Data
structure
17.
Data
management
Filter:
scores
Preprocessing
Data
management
Xml
data
feeds
Modeling
Cron
job
in
a
fixed
frequency
Postgresql
database
Python
crawler
using
xml-‐
element-‐tree
and
psycopg2
Structured
xml
files
Source-‐files
Relevant
data
PubMed-‐id,
Title,
abstract
Literature
components
Sub-‐components
Data
structure
18.
Preprocessing
Convert
text
strings
into
quantiDiable
data
Filter:
scores
Preprocessing
Data
management
Xml
data
feeds
Modeling
Cron
job
in
a
fixed
frequency
Postgresql
database
Python
crawler
using
xml-‐
element-‐tree
and
psycopg2
Feature
extrac9on
Quan9fying
target
data
using
word
frequency
measure
and
NLP
Structured
xml
files
Source-‐files
Relevant
data
PubMed-‐id,
Title,
abstract
Literature
Tokenized
list
and
Token
frequency
Text
strings
• Looking
for
important
terms
(similar
to
properBes
of
cars
and
boat)
• CounBng
the
frequency
of
important
terms
components
Sub-‐components
Data
structure
19. NDS
approach
for
preprocessing
• In
Numerical
DataSets
(NDS)
approach,
we
transform
textual
informaBon
to
quanBfiable
data
based
on
word
frequency/
term
frequency.
For
Example:
Lets
consider
a
document,
Document1:
“this
is
a
sample
of
a
sentence”
20.
Modeling
Filter:
scores
Preprocessing
Data
management
Xml
data
feeds
Modeling
Cron
job
in
a
fixed
frequency
Postgresql
database
Python
crawler
using
xml-‐
element-‐tree
and
psycopg2
Logis9c
regression
classifier
using
scikit-‐learn
Feature
extrac9on
Quan9fying
target
data
using
word
frequency
measure
and
NLP
Structured
xml
files
Source-‐files
Relevant
data
PubMed-‐id,
Title,
abstract
Literature
Tokenized
list
and
Token
frequency
Text
strings
Ranks
and
scores
Text
strings
components
Sub-‐components
Data
structure
21. Elements
in
modeling
Preprocessing
component
Different
transacBons
takes
place
between
Preprocessing
and
modeling
component
It
is
important
to
understand
these
transacBons
to
understand
the
output
Modeling
component
22. Elements
in
modeling
Modeling
component
generates
the
ClassificaBon
filter
Training
dataset
Preprocessing
component
Modeling
component
Preprocessed
Training
dataset
Filter:
scores
• Training
dataset
contains
papers
that
whose
classes
are
already
known
• Training
dataset
has
both
DNA-‐Pol
and
Non-‐
DNA-‐Pol
papers
23. Elements
in
modeling
Training
dataset
is
a
part
of
Reference
data
Reference
data
Training
dataset
Preprocessing
component
Modeling
component
Preprocessed
Training
dataset
Filter:
scores
TesBng
dataset
• Reference
data
are
pre
classified
set
of
data
• TesBng
data
is
a
subset
of
reference
data
• TesBng
data
is
used
only
for
assessing
the
self-‐learning
capacity
of
the
model
over
Bme.
24. Elements
in
modeling
Training
dataset
is
a
part
of
Reference
data
Reference
data
Training
dataset
Preprocessing
component
Modeling
component
Preprocessed
Training
dataset
Filter:
scores
• Reference
data
are
pre
classified
set
of
data
• Training
data
is
a
subset
of
reference
data
25. Elements
in
modeling
Preprocessing
component
Modeling
component
Unclassified
data
Filter:
scores
Preprocessed
Unclassified
dataset
DNA-‐polymerase
papers
Non-‐DNA-‐polymerase
papers
This
transacBon
explain
the
flow
of
unclassified
papers
through
modeling
component
26. Unclassified
data
Elements
in
modeling
Preprocessing
component
Modeling
component
Unclassified
data
Filter:
scores
Preprocessed
Unclassified
dataset
DNA-‐polymerase
papers
Non-‐DNA-‐polymerase
papers
ValidaBon
data
ValidaBon
dataset
are
randomly
chosen
unclassified
papers
that
are
curated
manually
by
approach-‐2.
27. Unclassified
data
Reference
data
Elements
in
modeling
Training
dataset
Preprocessing
component
Modeling
component
Unclassified
data
Preprocessed
Training
dataset
Filter:
scores
Preprocessed
Unclassified
dataset
DNA-‐polymerase
papers
Non-‐DNA-‐polymerase
papers
ValidaBon
data
28. Unclassified
data
Reference
data
Elements
in
modeling
Training
dataset
Preprocessing
component
Modeling
component
Unclassified
data
Preprocessed
Training
dataset
Filter:
scores
Preprocessed
Unclassified
dataset
DNA-‐polymerase
papers
Non-‐DNA-‐polymerase
papers
TesBng
dataset
ValidaBon
data
30. Results
of
classifying
validation
Diles
0
2
4
6
8
10
12
medline
#933
medline
#937
medline
#938
medline
#780
DNA
Polymerase
papers
correctly
classified
(true
posiBves)
Actual
count
of
DNA
polymerase
papers
found
in
the
XML
source
file
Many
relevant
papers
are
not
idenBfied
31. Initial
Result:
Wrongly
classiDied
papers
0
50
100
150
200
250
300
350
medline
#933
medline
#937
medline
#938
medline
#780
DNA
Polymerase
papers
wrongly
excluded
(False
negaBves)
Non-‐DNA
Polymerase
papers
wrongly
included
(False
posiBves)
Actual
count
of
DNA
Polymerase
papers
found
in
the
XML
source
file
Many
irrelevant
papers
32.
Revisiting
Preprocessing
Filter:
scores
Preprocessing
Data
management
Xml
data
feeds
Modeling
Cron
job
in
a
fixed
frequency
Postgresql
database
Python
crawler
using
xml-‐
element-‐tree
and
psycopg2
Quan9fying
target
data
with
td-‐idf
measure
components
Sub-‐components
Data
structure
Structured
xml
files
Source-‐files
Relevant
data
PubMed-‐id,
Title,
abstract
Literature
Tokenized
list
and
Token
frequency
Text
strings
33. Working
of
tf-‐idf
Tf
means
term-‐frequency
while
l–idf
means
term-‐frequency
Bmes
inverse
document-‐frequency.
This
is
a
originally
a
term
weighBng
scheme
developed
for
informaBon
retrieval
(as
a
ranking
funcBon
for
search
engines
results),
that
has
also
found
good
use
in
document
classificaBon
and
clustering.
For
Example:
Lets
consider
a
document,
Document1:
“this
is
a
sample
of
a
sentence”
Document2:
“this
example
is
another
example
of
another
example
”
34. Results
of
classifying
validation
Diles
using
tf-‐idf
approach
of
preprocessing
0
2
4
6
8
10
12
medline
#933
medline
#937
medline
#938
medline
#780
DNA
Polymerase
papers
correctly
classified
(true
posiBves)
Actual
count
of
DNA
Polymerase
papers
found
in
the
XML
source
file
All
relevant
papers
are
idenBfied
35. Wrongly
classiDied
Diles
0
50
100
150
200
250
medline
#933
medline
#937
medline
#938
medline
#780
DNA
Polymerase
papers
wrongly
excluded
(False
negaBves)
Non-‐DNA
Polymerase
papers
wrongly
included
(False
posiBves)
Actual
count
of
DNA
Polymerase
papers
found
in
the
XML
source
file
Irrelevant
papers
are
sBll
incorrectly
classified.
False
posiBve
rate
looks
bad.
36.
Revisiting
Modeling
Filter:
scores
Preprocessing
Data
management
Xml
data
feeds
Modeling
Cron
job
in
a
fixed
frequency
Postgresql
database
Python
crawler
using
xml-‐
element-‐tree
and
psycopg2
Trying
different
classifiers
using
scikit-‐learn
components
Sub-‐components
Data
structure
Quan9fying
target
data
with
td-‐idf
measure
Structured
xml
files
Source-‐files
Relevant
data
PubMed-‐id,
Title,
abstract
Literature
Tokenized
list
and
Token
frequency
Text
strings
Ranks
and
scores
Text
strings
37. Finding
the
classiDier
that
gives
better
false
positive
count.
• We
decided
to
work
on
two
addiBonal
classifiers.
1.
Bagging
with
LogisBc
regression
esBmator,
2.BoosBng
with
decision
stump.
• We
also
designed
a
grid
search
experiment
to
find
the
best
combinaBon
of
training
data
to
feed
into
these
classifiers.
• Parameters
varied:
• 1.
number
of
included
papers
• 2.
number
of
“close”
papers
(e.g.
use
of
PCR,
but
not
studying
Polymerases)
• 3.
number
of
excluded
papers
• 4.
Target
data
(Btle/
abstract/
both)
40. Wrongly
classiDied
Diles
0
20
40
60
80
100
120
medline
#933
medline
#937
medline
#938
medline
#780
DNA
Polymerase
papers
wrongly
excluded
(False
negaBves)
Non-‐DNA
Polymerase
papers
wrongly
included
(False
posiBves)
Actual
count
of
DNA
Polymerase
papers
found
in
the
XML
source
file
Irrelevant
papers
count
has
considerably
come
down.
41. Lessons
learnt
from
the
project
• More
preprocessing
and
model
alternaBves
needs
to
be
considered
in
all
stages
of
the
project.
• ValidaBon
infrastructure
should
built
simultaneously.
Which
will
help
improve
the
results
in
the
later
stage.
42. Future
development
• Moving
into
ProducBon
• MulBple
ClassificaBon
• Can
we
expand
this
method
to
other
topic
areas?
(e.g.
Ligases,
SyntheBc
Biology,
etc.)
43. Acknowledgements
• Polbase
creators
• Brad
Langhorst
• Nicole
Nichols
• Bill
Jack
• Polbase
External
contributors
• Linda
Reha-‐Krantz
• Cathy
Joyce
• Stu
Linn
• Stefan
Sarafianos
• Sam
Wilson
• Roger
Woodgate
• NEB
• Yanhong
Tong
• Eric
Peterson
• Janos
Posfai
• Ellen
Zaglakas
• Mehmet
Karaca
• IT
• Servers,
and
network
connecBon
to
PubMed