Harvester_presentaion

POLBASE

HARVESTER

Machine
Learning
Approaches
to
Find
More
DNA

Polymerase
Papers

polbase.neb.com

Ashwin
Natarajan

Brad
Langhorst

Polbase
repository

•  The
DNA
Polymerase
Database
(Polbase)
intends
to

serve
as
an
open
resource
for
informaBon
about
exisBng

DNA
polymerase

•  This
informaBon
is
sourced
from
several
public
and

private
database.

0

50

100

150

200

250

300

350

1955
1960
1965
1970
1975
1980
1985
1990
1995
2000
2005
2010
2015

Paper
count

Year

#
Reference
papers
discovered

paper
count

polbase.neb.com

Objective

Expand
the
Polbase
reference
repository

idenBfying
and
extracBng
more
scienBfic

papers
related
to
DNA
polymerases.

•  Target
Features:

•  AutomaBc
discovery
of
new
relevant
papers

•  Human
confirms
imported
papers

•  Minimize
the
import
of
irrelevant
papers

•  System
should
self-‐learn
and
respond
to
expert

classificaBon
as
well

A
simple
binary
classiDication
problem

A
computer
tries
to
classify
cars
and
boat

A
simple
binary
classiDication
problem

A
computer
tries
to
classify
cars
and
boat

Proper9es
Car
Boat

1.)
Wheel
Y
N

2.)
Hull
N
Y

3.)
Mainsail
N
Y

4.)
Headlights
Y
N

Filter:
Classiﬁca9on
rule

A
simple
binary
classiDication
problem

A
computer
tries
to
classify
cars
and
boat

Proper9es
Car
Boat

1.)
Wheel
Y
N

2.)
Hull
N
Y

3.)
Mainsail
N
Y

4.)
Headlights
Y
N

Filter:
Classiﬁca9on
rule

Class-‐
cars
Class-‐
boats

Well..!
odd
ones..

How
can
I
classify
this
vehicle...???

DeDining
the
target

•  Finding
the
Key
indicators(for
classiﬁcaBon
rule)
that
give
the

highest
likelihood
of
classifying
all
the
papers.

•  Diﬀerent
approaches
can
be
used
to
idenBfy
key
indicators.

•  Text
search

•  StaBsBcal
modeling

•  Have
a
Subject
Ma]er
Expert
read
and
classify
the
papers

•  Key
indicators
are
generated
by
staBsBcal
and
machine

learning
algorithms

Approach-‐3
Approach-‐2
Approach-‐1

Different
approaches
for
deDining

classiDication
rule

Pub-‐Med

papers

DNA-‐
polymerase

papers
Non-‐DNA-‐
polymerase

papers

Text
search
for

presence
of
MeSH-‐
terms

Make
Subject
MaHer

Expert
read
and
classify

the
papers

Sta9s9cal
modeling
or
or

Approach-‐3
Approach-‐1

Different
approaches
for
deDining

classiDication
rule

Pub-‐Med

papers

DNA-‐
polymerase

papers
Non-‐DNA-‐
polymerase

papers

Text
search
for

presence
of
MeSH-‐
terms

Sta9s9cal
modeling
(a

machine
learning
based

classiﬁer)

or

Currently
used
by
Polbase
Proposed
system

Approach-‐1

Existing
infrastructure

•  A
simple
query
based
data
retrieval
system
is
a
part
of

Polbase.

•  The
system
retrieves
papers
from
PubMed
that
match
the

query
criteria.

•  Problems

•  Some
papers
are
found
by
the
query,
but
are
not
relevant

•  Many
(?)
relevant
papers
are
missed
by
this
simple
query

•  Query
system
cannot
respond
to
changing
nomenclature

Pub-‐Med
polbase.neb.com

Text
search
for

presence
of
MeSH-‐
terms

All
PubMed
literature

DNA
related

literature

DNA
polymerase

literature

XML
feed

Approach-‐3

Proposed
classiDier

Pub-‐Med

papers

DNA-‐
polymerase

papers
Non-‐DNA-‐
polymerase

papers

Sta9s9cal
modeling
(a

machine
learning
based

classiﬁer)

Job
Dlow
in
classiDier

Filter:
scores
preprocessing

Crawler
and

data

management

Xml
data
feeds

from
Pub-‐Med

modeling

DNA-‐
polymerase

papers

Non-‐DNA-‐
polymerase

papers

Pub-‐Med

papers

Structured
xml

files

Filter:
scores
preprocessing

Data

management

Xml
data
feeds
modeling

Data
transformation
in
each
component

in
classiDier

Source-‐files

Classified
and

labeled
papers

Literature

Relevant
data

PubMed-‐id,

Title,
abstract

Literature

Tokenized
list

and
Token

frequency

Text
strings

Ranks
and

scores

Text
strings

Filter:
scores
preprocessing

Data

management

Xml
data
feeds
modeling

Components
in
classiDier

Let
us
take
a
closer
look
at
each
component
of
the

classiDier

XML
data-‐feed
from
Pub-‐Med

Filter:
scores
Preprocessing

Data

management

Xml
data
feeds
Modeling

Cron
job
in
a

fixed
frequency

•  XML
data-‐feeds
are
being
downloaded
from
Pub-‐
Med
in
fixed
frequencies.

•  Each
XML
source
file
contains
more
than
10000

literatures/
papers
on
an
average.

•  Each
XML
file
id
given
a
unique
ID
and
saved
in
the

repository

Structured
xml

files

Source-‐files

components
Sub-‐components

Data

structure

Data
management

Filter:
scores
Preprocessing

Data

management

Xml
data
feeds
Modeling

Cron
job
in
a
fixed

frequency

Postgresql

database

Python
crawler

using

xml-‐
element-‐tree

and

psycopg2

Structured
xml

files

Source-‐files

Relevant
data

PubMed-‐id,

Title,
abstract

Literature

components
Sub-‐components

Data

structure

Preprocessing

Convert
text
strings
into
quantiDiable
data

Filter:
scores
Preprocessing

Data

management

Xml
data
feeds
Modeling

Cron
job
in
a
fixed

frequency

Postgresql

database

Python
crawler

using

xml-‐
element-‐tree

and

psycopg2

Feature

extrac9on

Quan9fying

target
data
using

word
frequency

measure
and
NLP

Structured
xml

files

Source-‐files

Relevant
data

PubMed-‐id,

Title,
abstract

Literature

Tokenized
list

and
Token

frequency

Text
strings

•  Looking
for
important
terms

(similar
to
properBes
of
cars

and
boat)

•  CounBng
the
frequency

of
important
terms

components
Sub-‐components

Data

structure

NDS
approach
for
preprocessing

•  In
Numerical
DataSets
(NDS)
approach,
we
transform
textual

informaBon
to
quanBﬁable
data
based
on
word
frequency/

term
frequency.

For
Example:
Lets
consider
a
document,

Document1:
“this
is
a
sample
of
a
sentence”

Modeling

Filter:
scores
Preprocessing

Data

management

Xml
data
feeds
Modeling

Cron
job
in
a
fixed

frequency

Postgresql

database

Python
crawler

using

xml-‐
element-‐tree

and

psycopg2

Logis9c

regression

classifier

using
scikit-‐learn

Feature

extrac9on

Quan9fying

target
data
using

word
frequency

measure
and
NLP

Structured
xml

files

Source-‐files

Relevant
data

PubMed-‐id,

Title,
abstract

Literature

Tokenized
list

and
Token

frequency

Text
strings

Ranks
and

scores

Text
strings

components
Sub-‐components

Data

structure

Elements
in
modeling

Preprocessing

component

Diﬀerent
transacBons
takes
place
between
Preprocessing

and
modeling
component

It
is
important
to
understand
these
transacBons
to

understand
the
output

Modeling

component

Elements
in
modeling

Modeling
component
generates
the
ClassiﬁcaBon
ﬁlter

Training

dataset

Preprocessing

component

Modeling

component

Preprocessed

Training

dataset

Filter:

scores

•  Training
dataset
contains
papers
that
whose

classes
are
already
known

•  Training
dataset
has
both
DNA-‐Pol
and
Non-‐
DNA-‐Pol
papers

Elements
in
modeling

Training
dataset
is
a
part
of
Reference
data

Reference
data

Training

dataset

Preprocessing

component

Modeling

component

Preprocessed

Training

dataset

Filter:

scores

TesBng

dataset

•  Reference
data
are
pre
classiﬁed
set
of
data

•  TesBng
data
is
a
subset
of
reference
data

•  TesBng
data
is
used
only
for
assessing
the

self-‐learning
capacity
of
the
model
over
Bme.

Elements
in
modeling

Training
dataset
is
a
part
of
Reference
data

Reference
data

Training

dataset

Preprocessing

component

Modeling

component

Preprocessed

Training

dataset

Filter:

scores

•  Reference
data
are
pre
classiﬁed
set
of
data

•  Training
data
is
a
subset
of
reference
data

Elements
in
modeling

Preprocessing

component

Modeling

component

Unclassified

data

Filter:

scores

Preprocessed

Unclassified

dataset

DNA-‐polymerase

papers

Non-‐DNA-‐polymerase

papers

This
transacBon
explain
the
flow
of
unclassified
papers

through
modeling
component

Unclassified
data

Elements
in
modeling

Preprocessing

component

Modeling

component

Unclassified

data

Filter:

scores

Preprocessed

Unclassified

dataset

DNA-‐polymerase

papers


papers

ValidaBon

data

ValidaBon
dataset
are
randomly
chosen
unclassified

papers
that
are
curated
manually
by
approach-‐2.

Unclassified
data

Reference
data

Elements
in
modeling

Training

dataset

Preprocessing

component

Modeling

component

Unclassified

data

Preprocessed

Training

dataset

Filter:

scores

Preprocessed

Unclassified

dataset

DNA-‐polymerase

papers


papers

ValidaBon

data

Unclassified
data

Reference
data

Elements
in
modeling

Training

dataset

Preprocessing

component

Modeling

component

Unclassified

data

Preprocessed

Training

dataset

Filter:

scores

Preprocessed

Unclassified

dataset

DNA-‐polymerase

papers


papers

TesBng

dataset

ValidaBon

data

Unclassiﬁed
data

Reference
data

For
choosing
the
model

ValidaBon

data

Training

dataset

Results
of
classifying
validation
Diles

0

2

4

6

8

10

12

medline

#933

medline

#937

medline

#938

medline

#780

DNA
Polymerase
papers

correctly
classified

(true
posiBves)

Actual
count
of
DNA

polymerase
papers

found
in
the
XML

source
file

Many
relevant
papers
are

not
idenBfied

Initial
Result:
Wrongly
classiDied
papers

0

50

100

150

200

250

300

350

medline

#933

medline

#937

medline

#938

medline

#780

DNA
Polymerase
papers

wrongly
excluded
(False

negaBves)

Non-‐DNA
Polymerase

papers
wrongly

included
(False

posiBves)

Actual
count
of
DNA

Polymerase
papers

found
in
the
XML

source
ﬁle

Many
irrelevant
papers

Revisiting
Preprocessing

Filter:
scores
Preprocessing

Data

management

Xml
data
feeds
Modeling

Cron
job
in
a
fixed

frequency

Postgresql

database

Python
crawler

using

xml-‐
element-‐tree

and

psycopg2

Quan9fying

target
data
with

td-‐idf
measure

components
Sub-‐components

Data

structure

Structured
xml

files

Source-‐files

Relevant
data

PubMed-‐id,

Title,
abstract

Literature

Tokenized
list

and
Token

frequency

Text
strings

Working
of
tf-‐idf

Tf
means
term-‐frequency
while
l–idf
means
term-‐frequency
Bmes
inverse

document-‐frequency.
This
is
a
originally
a
term
weighBng
scheme
developed
for

informaBon
retrieval
(as
a
ranking
funcBon
for
search
engines
results),
that
has

also
found
good
use
in
document
classiﬁcaBon
and
clustering.

For
Example:
Lets
consider
a

document,

Document1:
“this
is
a
sample
of
a

sentence”

Document2:
“this
example
is

another
example
of
another

example
”

Results
of
classifying
validation
Diles

using
tf-‐idf
approach
of
preprocessing

0

2

4

6

8

10

12

medline

#933

medline

#937

medline

#938

medline

#780

DNA
Polymerase
papers

correctly
classified
(true

posiBves)

Actual
count
of
DNA

Polymerase
papers

found
in
the
XML

source
file

All
relevant
papers
are

idenBfied

Wrongly
classiDied
Diles

0

50

100

150

200

250

medline

#933

medline

#937

medline

#938

medline

#780

DNA
Polymerase
papers

wrongly
excluded
(False

negaBves)

Non-‐DNA
Polymerase

papers
wrongly

included
(False

posiBves)

Actual
count
of
DNA

Polymerase
papers

found
in
the
XML

source
ﬁle

Irrelevant
papers
are
sBll

incorrectly
classiﬁed.
False

posiBve
rate
looks
bad.

Revisiting
Modeling

Filter:
scores
Preprocessing

Data

management

Xml
data
feeds
Modeling

Cron
job
in
a
fixed

frequency

Postgresql

database

Python
crawler

using

xml-‐
element-‐tree

and

psycopg2

Trying
different

classifiers
using

scikit-‐learn

components
Sub-‐components

Data

structure

Quan9fying

target
data
with

td-‐idf
measure

Structured
xml

files

Source-‐files

Relevant
data

PubMed-‐id,

Title,
abstract

Literature

Tokenized
list

and
Token

frequency

Text
strings

Ranks
and

scores

Text
strings

Finding
the
classiDier
that
gives
better

false
positive
count.

•  We
decided
to
work
on
two
addiBonal
classifiers.
1.
Bagging

with
LogisBc
regression
esBmator,
2.BoosBng
with
decision

stump.

•  We
also
designed
a
grid
search
experiment
to
find
the
best

combinaBon
of
training
data
to
feed
into
these
classifiers.

•  Parameters
varied:

•  1.
number
of
included
papers

•  2.
number
of
“close”
papers
(e.g.
use
of
PCR,
but
not
studying

Polymerases)

•  3.
number
of
excluded
papers

•  4.
Target
data
(Btle/
abstract/
both)

Grid
search
for
optimal
parameter

Wrongly
classiDied
Diles

0

20

40

60

80

100

120

medline

#933

medline

#937

medline

#938

medline

#780

DNA
Polymerase
papers

wrongly
excluded
(False

negaBves)

Non-‐DNA
Polymerase

papers
wrongly

included
(False

posiBves)

Actual
count
of
DNA

Polymerase
papers

found
in
the
XML

source
ﬁle

Irrelevant
papers
count
has

considerably
come
down.

Lessons
learnt
from
the
project

•  More
preprocessing
and
model
alternaBves

needs
to
be
considered
in
all
stages
of
the

project.

•  ValidaBon
infrastructure
should
built

simultaneously.
Which
will
help
improve
the

results
in
the
later
stage.

Future
development

• Moving
into
ProducBon

• MulBple
ClassiﬁcaBon

• Can
we
expand
this
method
to
other

topic
areas?
(e.g.
Ligases,
SyntheBc

Biology,
etc.)

Acknowledgements

•  Polbase
creators

•  Brad
Langhorst

•  Nicole
Nichols

•  Bill
Jack

•  Polbase
External
contributors

•  Linda
Reha-‐Krantz

•  Cathy
Joyce

•  Stu
Linn

•  Stefan
Saraﬁanos

•  Sam
Wilson

•  Roger
Woodgate

•  NEB

•  Yanhong
Tong

•  Eric
Peterson

•  Janos
Posfai

•  Ellen
Zaglakas

•  Mehmet
Karaca

•  IT

•  Servers,
and
network

connecBon
to
PubMed

Harvester_presentaion

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Harvester_presentaion

Similar to Harvester_presentaion (20)

Harvester_presentaion