AnalyticOps - Chicago PAW 2016

Best
Prac*ces
for
Deploying
Analy*c

Models
into
Opera*ons

Robert
L.
Grossman

Open
Data
Group

and

University
of
Chicago

Predic*ve
Analy*c
World
Chicago

June
21,
2016

rgrossman.com

@bobgrossman

Exploratory
Data

Analysis

Get
and

clean
the

data
Build
model
in
dev/
modeling
environment

Deploy
model
in

opera*onal
systems

with
scoring

applica*on

Monitor
performance

and
employ

champion-‐challenger

methodology

Analy*c
modeling

Analy*c
opera*ons

Deploy

model

Re*re
model
and
deploy

improved
model

Select
analy*c

problem
&

approach

Scale
up

deployment

Model Env
Deployment Env
Perf.

data

Life
Cycle
of
an
Analy*c
Model

Differences
Between
the
Modeling
and

Deployment
Environments

•  Typically
modelers
use
specialized
languages
such
as

SAS,
SPSS
or
R.

•  Usually,
developers
responsible
for
products
and

services
use
languages
such
as
Java,
JavaScript,

Python,
C++,
etc.

•  This
can
result
in
significant
effort
and
significant

delays
moving
the
model
from
the
modeling

environment
to
the
deployment
environment.

Would you minding writing
all your models in Java?
Alice,
Data
Scien*st
Bob,
Data
Scien*st

Joe,
IT

I write all my models
in R, why don’t you
do the same?
I write all my
models in scikit-
learn, why don’t you
do the same?

Ways
to
Deploy
Models
into

Products/Services/Opera*ons

•  Push
code.

•  Embed
a
model
into
a
product
or
service.

•  Export
and
import
tables
of
scores

•  Export
and
import
tables
of
parameters

•  Have
the
product/service
interact
with
the

model
as
a
web
or
message
service.

•  Import
the
models
into
a
database

How
quickly
can
the
model
be
updated?

•  Model
parameters?

•  New
features?

•  New
pre-‐
&
post-‐
processing?

The
not-‐For-‐proﬁt
DMG

develops
and
supports

standards.

www.dmg.org

PMML

PFA

What
is
an
Analy*c
Engine?

•  An
analy*c
engine
is
a
component
that
is
integrated
into

products
or
enterprise
IT
that
deploys
analy*c
models
in

opera*onal
workﬂows
for
products
and
services.

•  A
Model
Interchange
Format
is
a
format
that
supports

the
expor*ng
of
a
model
by
one
applica*on
and
the

impor*ng
of
a
model
by
another
applica*on.

•  Model
Interchange
Formats
include
the
Predic*ve
Model

Markup
Language
(PMML),
the
Portable
Format
for

Analy*cs
(PFA),
and
various
in-‐house
or
custom
formats.

•  Analy*c
engines
are
integrated
once,
but
allow

applica*ons
to
update
models
as
quickly
as
reading
a
a

model
interchange
format
ﬁle.

7

Deploying
analy*c
models

Model

Consumer

Model

Producer

Export

model

Import

model

PMML

PMML
Philosophy

•  PMML
is
a
XML
speciﬁca/on
of
a
model,
not
an

implementa/on
of
a
model

•  PMML
provides
a
simple
means
of
binding

parameters
to
values
for
an
agreed
upon
set
of

data
mining
models
&
transforma*ons
in
a
safe

way.

9

Deploying
analy*c
models
and
workflows

Analy*c

Engines

Analy*c

Workflow

Producers

Export

analy*c

workflows

Import

analy*c

workflows

PFA

PFA
Philosophy

•  Deﬁne
primi*ves
for
data
transforma*ons,
data

aggrega*ons,
and
sta*s*cal
and
analy*c
models.

•  Support
composi*on
of
data
mining
primi*ves

(which
makes
it
easy
to
specify
machine
learning

algorithms
and
pre-‐/post-‐
processing
of
data).

•  Be
extensible.

•  Designed
to
be
“safe”
to
deploy
in
enterprise
IT

opera*onal
environments.

•  This
is
a
philosophy
that
is
diﬀerent
and

complementary
to
Predic*ve
Model
Markup

Language
(PMML).

11

PFA
Case
Study
1

•  20+
person
data
science
group
developing
models
in

R,
Python,
Scikit-‐learn,
MATLAB,
...

•  All
the
data
scien*sts
export
their
model
in
PFA.

•  The
company’s
product
imports
models
in
PFA
and

runs
on
their
customers
data
as
required.

Export
PFA
Import
PFA

Widget

records

Widget

scores

PFA
Func*onality

•  PFA
codes
arbitrary
mathema*cal
algorithms
in
a

*ghtly
controlled
environment.

•  PFA
has
all
the
standard
ﬂow
control
of
a

programming
language:
if/then/else
&

for/while

loops.

•  PFA
has
func*on
calls
and
func*on
call
backs

•  PFA
has
algebraic
data
types.

•  PFA
is
encoded
as
func*on
calls
in
JSON

{func*on:
[arg
1,
arg
2,
…,
arg
n]
}

13

Beneﬁts
of
PFA

•  PFA
is
based
upon
JSON
and
Avro
and
integrates

easily
into
modern
big
data
environments.

•  PFA
allows
models
to
be
easily
chained
and

composed.

•  PFA
allows
developers
and
users
users
of
analy*c

systems
to
pre-‐process
inputs
and
to
post-‐process

outputs
to
models.

•  PFA
is
easily
integrated
with
Hadoop,
Spark,
etc.

•  PFA
is
easily
integrated
with
Kaoa,
Storm,
Akka
and

other
streaming
environments.

•  PFA
can
be
used
to
integrate
mul*ple

tools

applica*ons
within
an
analy*c
ecosystem.

Example:
Scoring
Clusters

15

Source:
dmg.org/pfa

Source:
dmg.org/pfa

16

PFA
Case
Study
2

•  Two
teams
of
data
scien*sts
develop
analy*c
models
for
an

adversarial
analy*cs
project.

•  Models
developed
in
Hadoop
and
exported
in
PFA
every
4

weeks.

•  Models
updated
in
client
systems
every
2
weeks.

•  It’s
not
quite
this
simple,
but
that’s
the
general
idea.

Export
PFA
Import
PFA

Event

records

Event

scores

Weeks
1-‐4,
5-‐8,
…

Weeks
3-‐6,
7-‐10,
…
Weeks
4,
6,
8,
10,
…

Case
Study
3:
Gaussian
Process
Model
(1
of
5)

Gaussian
Process
Model
(2
of
5)

input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
input
and
output
of
scoring
engine

expressed
as
Avro
schemas

Source:
dmg.org/pfa

Gaussian
Process
Model
(3
of
5)

cells:
table:
type:
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
- input
- {cell: table}
- null
type

(also
Avro)

and
value

(as
JSON,

truncated)

Gaussian
Process

model
parameters

Source:
dmg.org/pfa

Gaussian
Process
Model
(4
of
5)

cells:
table:
type:
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
- input
- {cell: table}
- null
calling
method:
parameters

expressed
as
JSON

input:
get
interpola*on
point
from
input

{cell:
table}:
get
parameters
from
table

null:
no
explicit
Kriging
weight
(universal)

{fcn:
…}:
kernel
func*on

Source:
dmg.org/pfa

Gaussian
Process
Model
(5
of
5)

•  Appears
declara*ve,
but
this
is
a
func*on
call.

–  Fourth
parameter
is
another
func*on:
m.kernel.rbf
(radial
basis

kernel,
a.k.a.
squared
exponen*al).

– 
m.kernel.rbf
was
intended
for
SVM,
but
is
reusable
anywhere.

–  One
argument
(gamma)
preapplied
so
that
it
ﬁts
the
signature

for
model.reg.gaussianProcess.

•  Any
kernel
func*on
could
be
used,
including
user-‐deﬁned
func*ons

wriuen
with
PFA
“code.”

•  The
Gaussian
Process
could
be
used
anywhere,
even
as
a
pre-‐
processing
or
post-‐processing
step.

- input
- {cell: table}
- null
Source:
dmg.org/pfa

Genomics
dataset:
2.5+

PB
consis*ng
of
577,878

ﬁles
about
14,052
cases

(pa*ents),
in
42
cancer

types,
across
29
primary

sites.

2.5+
PB

of
cancer

genomics
data

+

Bionimbus
data
commons

technology
running
mul*ple

community
developed
variant

calling
pipelines.

Over
12,000

cores
and
10
PB
of
raw
storage
in

18+
racks
running
for
months.

Case
Study
4:
Analy*cOps
for
the

Genomic
Data
Commons*

Source:
Based
in
part
on
“The

Genomic
Data
Commons”,
the
GDC
team,
ms
in
prepara*on.

Dev Ops
•  Virtualiza*on
and
the
requirement
for
massive
scale
out

spawned
infrastructure
automa*on
(“infrastructure
as

code”).

•  Requirement
for
reducing
the
*me
to
deploying
code

created
tools
for
con*nuous
integra*on
and
tes*ng.

ModelDev AnalyticOps
•  Use
virtualiza*on
/
containers,
infrastructure

automa*on
and
scale
out
to
support
large
scale

analy*cs.

•  Requirement:
reduce
the
*me
and
cost
to
do
high

quality
analy*cs

over
large
amounts
of
data.

Sowware

Development

Quality

Assurance

Opera*ons

DevOps

The
goal
of
DevOps
is
to
establish
a
culture
and
an
environment

where
building,
tes*ng,
releasing,
and
opera*ng
sowware
can

happen
rapidly,
frequently,
and
more
reliably.*

*Adapted
from
Wikipedia,
en.wikipedia.org/wiki/DevOps.

Analy*c

Workﬂows

Quality

Assurance

Analy*c

Opera*ons

Analy*cOps

The
goal
of
Analy*cOps
is
to
establish
a
culture
and
an

environment
where
building,
valida*ng,
deploying,
and
running

analy*c
models
happen
rapidly,
frequently,
and
reliably.

•  Sowware

•  Model

•  Data

Building
the

right
analy*c

model.

Is
the
analy*c

model
running

right?

Source:
Robert
L.
Grossman,
A
Quick
Introduc*on
to
Analy*cOps.

•  Model
quality

(confusion
matrix)

•  Data
quality

(six
dimensions)

•  Lack
of
ground
truth

•  Sowware
errors

•  Monitoring
system

•  Quality
of
workflow

and
scheduling

•  Boulenecks,
stragglers,
hot
spots,
etc.

•  Analy*c
configura*ons
problems

•  System
failures

•  Human
errors

Ten
Factors
Effec*ng
Analy*cOps

Source:
Based
in
part
on
“The

Genomic
Data
Commons”,
the
GDC
team,
ms
in
prepara*on.

Summary

•  Deploying
analy*c
models
is
core
technical
competency.

•  The
Portable
Format
for
Analy*cs
(PFA)
is
a
model

interchange
format
for
building
analy*c
models
in
one

environment
and
deploying
them
in
another
one.

•  PFA
is
based
upon
data
mining
primi*ves
&
supports

pre-‐processing,
common
analy*c
models,
post-‐
processing,
&
composi*on
of
primi*ves
and
models.

•  It
is
easy
to
add
your
own
PFA
func*ons
and
models.

•  There
is
reference
implementa*on
&
compliance
tests.

•  PFA
is
being
developed
by
the
not-‐for-‐proﬁt
DMG.

•  A
discipline
of
analy*cOps
is
emerging
that
supports

more
complex
analy*c
ﬂows
at
greater
scale.

Ques*ons?

30

For
more
informa*on
about
PFA,
see:
dmg.org/pfa

rgrossman.com

@BobGrossman

AnalyticOps - Chicago PAW 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AnalyticOps - Chicago PAW 2016

Similar to AnalyticOps - Chicago PAW 2016 (20)

More from Robert Grossman

More from Robert Grossman (11)

Recently uploaded

Recently uploaded (20)

AnalyticOps - Chicago PAW 2016