SlideShare a Scribd company logo
1 of 30
Download to read offline
Best	
  Prac*ces	
  for	
  Deploying	
  Analy*c	
  
Models	
  into	
  Opera*ons	
  
Robert	
  L.	
  Grossman	
  
Open	
  Data	
  Group	
  
and	
  	
  
University	
  of	
  Chicago	
  
Predic*ve	
  Analy*c	
  World	
  Chicago	
  
June	
  21,	
  2016	
  
rgrossman.com	
  
@bobgrossman	
  
Exploratory	
  Data	
  
Analysis	
  
Get	
  and	
  	
  
clean	
  the	
  
data	
  Build	
  model	
  in	
  dev/
modeling	
  environment	
  
Deploy	
  model	
  in	
  
opera*onal	
  systems	
  
with	
  scoring	
  
applica*on	
  	
  
Monitor	
  performance	
  
and	
  employ	
  
champion-­‐challenger	
  
methodology	
  
Analy*c	
  modeling	
  
Analy*c	
  opera*ons	
  
Deploy	
  
model	
  
Re*re	
  model	
  and	
  deploy	
  
improved	
  model	
  
Select	
  analy*c	
  
problem	
  &	
  
approach	
  
Scale	
  up	
  	
  
deployment	
  
Model Env
Deployment Env
Perf.	
  
data	
  
Life	
  Cycle	
  of	
  an	
  Analy*c	
  Model	
  
Differences	
  Between	
  the	
  Modeling	
  and	
  
Deployment	
  Environments	
  
•  Typically	
  modelers	
  use	
  specialized	
  languages	
  such	
  as	
  
SAS,	
  SPSS	
  or	
  R.	
  
•  Usually,	
  developers	
  responsible	
  for	
  products	
  and	
  
services	
  use	
  languages	
  such	
  as	
  Java,	
  JavaScript,	
  
Python,	
  C++,	
  etc.	
  
•  This	
  can	
  result	
  in	
  significant	
  effort	
  and	
  significant	
  
delays	
  moving	
  the	
  model	
  from	
  the	
  modeling	
  
environment	
  to	
  the	
  deployment	
  environment.	
  
Would you minding writing
all your models in Java?
Alice,	
  Data	
  Scien*st	
   Bob,	
  Data	
  Scien*st	
  
Joe,	
  IT	
  
I write all my models
in R, why don’t you
do the same?
I write all my
models in scikit-
learn, why don’t you
do the same?
Ways	
  to	
  Deploy	
  Models	
  into	
  	
  
Products/Services/Opera*ons	
  
•  Push	
  code.	
  
•  Embed	
  a	
  model	
  into	
  a	
  product	
  or	
  service.	
  
•  Export	
  and	
  import	
  tables	
  of	
  scores	
  
•  Export	
  and	
  import	
  tables	
  of	
  parameters	
  
•  Have	
  the	
  product/service	
  interact	
  with	
  the	
  
model	
  as	
  a	
  web	
  or	
  message	
  service.	
  
•  Import	
  the	
  models	
  into	
  a	
  database	
  
How	
  quickly	
  can	
  the	
  model	
  be	
  updated?	
  
•  Model	
  parameters?	
  
•  New	
  features?	
  	
  	
  	
  
•  New	
  pre-­‐	
  &	
  post-­‐	
  processing?	
  
The	
  not-­‐For-­‐profit	
  DMG	
  
develops	
  and	
  supports	
  
standards.	
  
www.dmg.org	
  
PMML	
  
PFA	
  
What	
  is	
  an	
  Analy*c	
  Engine?	
  
•  An	
  analy*c	
  engine	
  is	
  a	
  component	
  that	
  is	
  integrated	
  into	
  
products	
  or	
  enterprise	
  IT	
  that	
  deploys	
  analy*c	
  models	
  in	
  
opera*onal	
  workflows	
  for	
  products	
  and	
  services.	
  
•  A	
  Model	
  Interchange	
  Format	
  is	
  a	
  format	
  that	
  supports	
  
the	
  expor*ng	
  of	
  a	
  model	
  by	
  one	
  applica*on	
  and	
  the	
  
impor*ng	
  of	
  a	
  model	
  by	
  another	
  applica*on.	
  	
  	
  
•  Model	
  Interchange	
  Formats	
  include	
  the	
  Predic*ve	
  Model	
  
Markup	
  Language	
  (PMML),	
  the	
  Portable	
  Format	
  for	
  
Analy*cs	
  (PFA),	
  and	
  various	
  in-­‐house	
  or	
  custom	
  formats.	
  
•  Analy*c	
  engines	
  are	
  integrated	
  once,	
  but	
  allow	
  
applica*ons	
  to	
  update	
  models	
  as	
  quickly	
  as	
  reading	
  a	
  a	
  
model	
  interchange	
  format	
  file.	
  
7	
  
Deploying	
  analy*c	
  models	
  
Model	
  
Consumer	
  
Model	
  
Producer	
  
Export	
  
model	
  
Import	
  
model	
  
PMML	
  
PMML	
  Philosophy	
  
•  PMML	
  is	
  a	
  XML	
  specifica/on	
  of	
  a	
  model,	
  not	
  an	
  
implementa/on	
  of	
  a	
  model	
  
•  PMML	
  provides	
  a	
  simple	
  means	
  of	
  binding	
  
parameters	
  to	
  values	
  for	
  an	
  agreed	
  upon	
  set	
  of	
  
data	
  mining	
  models	
  &	
  transforma*ons	
  in	
  a	
  safe	
  
way.	
  
9	
  
Deploying	
  analy*c	
  models	
  and	
  workflows	
  
Analy*c	
  
Engines	
  
Analy*c	
  
Workflow	
  
Producers	
  
Export	
  
analy*c	
  
workflows	
  
Import	
  
analy*c	
  
workflows	
  
PFA	
  
PFA	
  Philosophy	
  
•  Define	
  primi*ves	
  for	
  data	
  transforma*ons,	
  data	
  
aggrega*ons,	
  and	
  sta*s*cal	
  and	
  analy*c	
  models.	
  
•  Support	
  composi*on	
  of	
  data	
  mining	
  primi*ves	
  
(which	
  makes	
  it	
  easy	
  to	
  specify	
  machine	
  learning	
  
algorithms	
  and	
  pre-­‐/post-­‐	
  processing	
  of	
  data).	
  
•  Be	
  extensible.	
  
•  Designed	
  to	
  be	
  “safe”	
  to	
  deploy	
  in	
  enterprise	
  IT	
  
opera*onal	
  environments.	
  
•  This	
  is	
  a	
  philosophy	
  that	
  is	
  different	
  and	
  
complementary	
  to	
  Predic*ve	
  Model	
  Markup	
  
Language	
  (PMML).	
  
11	
  
PFA	
  Case	
  Study	
  1	
  
•  20+	
  person	
  data	
  science	
  group	
  developing	
  models	
  in	
  
R,	
  Python,	
  Scikit-­‐learn,	
  MATLAB,	
  ...	
  	
  
•  All	
  the	
  data	
  scien*sts	
  export	
  their	
  model	
  in	
  PFA.	
  
•  The	
  company’s	
  product	
  imports	
  models	
  in	
  PFA	
  and	
  
runs	
  on	
  their	
  customers	
  data	
  as	
  required.	
  
Export	
  PFA	
   Import	
  PFA	
  
Widget	
  
records	
  
Widget	
  
scores	
  
PFA	
  Func*onality	
  
•  PFA	
  codes	
  arbitrary	
  mathema*cal	
  algorithms	
  in	
  a	
  
*ghtly	
  controlled	
  environment.	
  
•  PFA	
  has	
  all	
  the	
  standard	
  flow	
  control	
  of	
  a	
  
programming	
  language:	
  if/then/else	
  &	
  	
  for/while	
  
loops.	
  
•  PFA	
  has	
  func*on	
  calls	
  and	
  func*on	
  call	
  backs	
  	
  
•  PFA	
  has	
  algebraic	
  data	
  types.	
  
•  PFA	
  is	
  encoded	
  as	
  func*on	
  calls	
  in	
  JSON	
  
	
   	
  {func*on:	
  [arg	
  1,	
  arg	
  2,	
  …,	
  arg	
  n]	
  }	
  
13	
  
Benefits	
  of	
  PFA	
  
•  PFA	
  is	
  based	
  upon	
  JSON	
  and	
  Avro	
  and	
  integrates	
  
easily	
  into	
  modern	
  big	
  data	
  environments.	
  
•  PFA	
  allows	
  models	
  to	
  be	
  easily	
  chained	
  and	
  
composed.	
  
•  PFA	
  allows	
  developers	
  and	
  users	
  users	
  of	
  analy*c	
  
systems	
  to	
  pre-­‐process	
  inputs	
  and	
  to	
  post-­‐process	
  
outputs	
  to	
  models.	
  
•  PFA	
  is	
  easily	
  integrated	
  with	
  Hadoop,	
  Spark,	
  etc.	
  
•  PFA	
  is	
  easily	
  integrated	
  with	
  Kaoa,	
  Storm,	
  Akka	
  and	
  
other	
  streaming	
  environments.	
  
•  PFA	
  can	
  be	
  used	
  to	
  integrate	
  mul*ple	
  	
  tools	
  
applica*ons	
  within	
  an	
  analy*c	
  ecosystem.	
  
Example:	
  Scoring	
  Clusters	
  
15	
  
Source:	
  dmg.org/pfa	
  
Source:	
  dmg.org/pfa	
  
16	
  
PFA	
  Case	
  Study	
  2	
  
•  Two	
  teams	
  of	
  data	
  scien*sts	
  develop	
  analy*c	
  models	
  for	
  an	
  
adversarial	
  analy*cs	
  project.	
  
•  Models	
  developed	
  in	
  Hadoop	
  and	
  exported	
  in	
  PFA	
  every	
  4	
  
weeks.	
  	
  
•  Models	
  updated	
  in	
  client	
  systems	
  every	
  2	
  weeks.	
  
•  It’s	
  not	
  quite	
  this	
  simple,	
  but	
  that’s	
  the	
  general	
  idea.	
  
Export	
  PFA	
   Import	
  PFA	
  
Event	
  
records	
  
Event	
  
scores	
  
Weeks	
  1-­‐4,	
  5-­‐8,	
  …	
  	
   Weeks	
  3-­‐6,	
  7-­‐10,	
  …	
   Weeks	
  4,	
  6,	
  8,	
  10,	
  …	
  
Case	
  Study	
  3:	
  Gaussian	
  Process	
  Model	
  (1	
  of	
  5)	
  
Gaussian	
  Process	
  Model	
  (2	
  of	
  5)	
  
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
input	
  and	
  output	
  of	
  scoring	
  engine	
  
expressed	
  as	
  Avro	
  schemas	
  
Source:	
  dmg.org/pfa	
  
Gaussian	
  Process	
  Model	
  (3	
  of	
  5)	
  
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
type	
  
(also	
  Avro)	
  
and	
  value	
  
(as	
  JSON,	
  
truncated)	
  
Gaussian	
  Process	
  
model	
  parameters	
  
Source:	
  dmg.org/pfa	
  
Gaussian	
  Process	
  Model	
  (4	
  of	
  5)	
  
input: {type: array, items: double}
output: {type: array, items: double}
cells:
table:
type:
{type: array, items: {type: record, name: GP, fields: [
- {name: x, type: {type: array, items: double}}
- {name: to, type: {type: array, items: double}}
- {name: sigma, type: {type: array, items: double}}]}}
init:
- {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]}
- {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]}
- {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]}
...
- {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]}
action:
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
calling	
  method:	
  parameters	
  
expressed	
  as	
  JSON	
  
input:	
  get	
  interpola*on	
  point	
  from	
  input	
  
{cell:	
  table}:	
  get	
  parameters	
  from	
  table	
  
null:	
  no	
  explicit	
  Kriging	
  weight	
  (universal)	
  
{fcn:	
  …}:	
  kernel	
  func*on	
  
Source:	
  dmg.org/pfa	
  
Gaussian	
  Process	
  Model	
  (5	
  of	
  5)	
  
•  Appears	
  declara*ve,	
  but	
  this	
  is	
  a	
  func*on	
  call.	
  
–  Fourth	
  parameter	
  is	
  another	
  func*on:	
  m.kernel.rbf	
  (radial	
  basis	
  
kernel,	
  a.k.a.	
  squared	
  exponen*al).	
  
–  	
  m.kernel.rbf	
  was	
  intended	
  for	
  SVM,	
  but	
  is	
  reusable	
  anywhere.	
  
–  One	
  argument	
  (gamma)	
  preapplied	
  so	
  that	
  it	
  fits	
  the	
  signature	
  
for	
  model.reg.gaussianProcess.	
  
•  Any	
  kernel	
  func*on	
  could	
  be	
  used,	
  including	
  user-­‐defined	
  func*ons	
  
wriuen	
  with	
  PFA	
  “code.”	
  
•  The	
  Gaussian	
  Process	
  could	
  be	
  used	
  anywhere,	
  even	
  as	
  a	
  pre-­‐
processing	
  or	
  post-­‐processing	
  step.	
  
model.reg.gaussianProcess:
- input
- {cell: table}
- null
- {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
Source:	
  dmg.org/pfa	
  
Genomics	
  dataset:	
  2.5+	
  
PB	
  consis*ng	
  of	
  577,878	
  
files	
  about	
  14,052	
  cases	
  
(pa*ents),	
  in	
  42	
  cancer	
  
types,	
  across	
  29	
  primary	
  
sites.	
  	
  
	
  
2.5+	
  PB	
  	
  
of	
  cancer	
  
genomics	
  data	
  
+	
  
Bionimbus	
  data	
  commons	
  
technology	
  running	
  mul*ple	
  
community	
  developed	
  variant	
  
calling	
  pipelines.	
  	
  Over	
  12,000	
  
cores	
  and	
  10	
  PB	
  of	
  raw	
  storage	
  in	
  
18+	
  racks	
  running	
  for	
  months.	
  
Case	
  Study	
  4:	
  Analy*cOps	
  for	
  the	
  	
  
Genomic	
  Data	
  Commons*	
  
Source:	
  Based	
  in	
  part	
  on	
  “The	
  	
  Genomic	
  Data	
  Commons”,	
  the	
  GDC	
  team,	
  ms	
  in	
  prepara*on.	
  
Dev Ops
•  Virtualiza*on	
  and	
  the	
  requirement	
  for	
  massive	
  scale	
  out	
  
spawned	
  infrastructure	
  automa*on	
  (“infrastructure	
  as	
  
code”).	
  
•  Requirement	
  for	
  reducing	
  the	
  *me	
  to	
  deploying	
  code	
  
created	
  tools	
  for	
  con*nuous	
  integra*on	
  and	
  tes*ng.	
  
ModelDev AnalyticOps
•  Use	
  virtualiza*on	
  /	
  containers,	
  infrastructure	
  
automa*on	
  and	
  scale	
  out	
  to	
  support	
  large	
  scale	
  
analy*cs.	
  
•  Requirement:	
  reduce	
  the	
  *me	
  and	
  cost	
  to	
  do	
  high	
  
quality	
  analy*cs	
  	
  over	
  large	
  amounts	
  of	
  data.	
  
Sowware	
  
Development	
  
Quality	
  
Assurance	
  
Opera*ons	
  
DevOps	
  
The	
  goal	
  of	
  DevOps	
  is	
  to	
  establish	
  a	
  culture	
  and	
  an	
  environment	
  
where	
  building,	
  tes*ng,	
  releasing,	
  and	
  opera*ng	
  sowware	
  can	
  
happen	
  rapidly,	
  frequently,	
  and	
  more	
  reliably.*	
  
*Adapted	
  from	
  Wikipedia,	
  en.wikipedia.org/wiki/DevOps.	
  
Analy*c	
  
Workflows	
  
Quality	
  
Assurance	
  
Analy*c	
  
Opera*ons	
  
Analy*cOps	
  
The	
  goal	
  of	
  Analy*cOps	
  is	
  to	
  establish	
  a	
  culture	
  and	
  an	
  
environment	
  where	
  building,	
  valida*ng,	
  deploying,	
  and	
  running	
  
analy*c	
  models	
  happen	
  rapidly,	
  frequently,	
  and	
  reliably.	
  
•  Sowware	
  
•  Model	
  
•  Data	
  
Building	
  the	
  
right	
  analy*c	
  
model.	
  
Is	
  the	
  analy*c	
  
model	
  running	
  
right?	
  
Source:	
  Robert	
  L.	
  Grossman,	
  A	
  Quick	
  Introduc*on	
  to	
  Analy*cOps.	
  
•  Model	
  quality	
  
(confusion	
  matrix)	
  
•  Data	
  quality	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
(six	
  dimensions)	
  
•  Lack	
  of	
  ground	
  truth	
  
•  Sowware	
  errors	
  
•  Monitoring	
  system	
  
•  Quality	
  of	
  workflow	
  
and	
  scheduling	
  
•  Boulenecks,	
  stragglers,	
  hot	
  spots,	
  etc.	
  
•  Analy*c	
  configura*ons	
  problems	
  
•  System	
  failures	
  	
  
•  Human	
  errors	
  
Ten	
  Factors	
  Effec*ng	
  Analy*cOps	
  
Source:	
  Based	
  in	
  part	
  on	
  “The	
  	
  Genomic	
  Data	
  Commons”,	
  the	
  GDC	
  team,	
  ms	
  in	
  prepara*on.	
  
Summary	
  
•  Deploying	
  analy*c	
  models	
  is	
  core	
  technical	
  competency.	
  
•  The	
  Portable	
  Format	
  for	
  Analy*cs	
  (PFA)	
  is	
  a	
  model	
  
interchange	
  format	
  for	
  building	
  analy*c	
  models	
  in	
  one	
  
environment	
  and	
  deploying	
  them	
  in	
  another	
  one.	
  
•  PFA	
  is	
  based	
  upon	
  data	
  mining	
  primi*ves	
  &	
  supports	
  
pre-­‐processing,	
  common	
  analy*c	
  models,	
  post-­‐
processing,	
  &	
  composi*on	
  of	
  primi*ves	
  and	
  models.	
  
•  It	
  is	
  easy	
  to	
  add	
  your	
  own	
  PFA	
  func*ons	
  and	
  models.	
  
•  There	
  is	
  reference	
  implementa*on	
  &	
  compliance	
  tests.	
  	
  
•  PFA	
  is	
  being	
  developed	
  by	
  the	
  not-­‐for-­‐profit	
  DMG.	
  
•  A	
  discipline	
  of	
  analy*cOps	
  is	
  emerging	
  that	
  supports	
  
more	
  complex	
  analy*c	
  flows	
  at	
  greater	
  scale.	
  
Ques*ons?	
  
30	
  
For	
  more	
  informa*on	
  about	
  PFA,	
  see:	
  dmg.org/pfa	
  
rgrossman.com	
  
@BobGrossman	
  

More Related Content

What's hot

Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
Databricks
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 

What's hot (20)

AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
Building A Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and ...
 
Ai use cases
Ai use casesAi use cases
Ai use cases
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 

Viewers also liked

Viewers also liked (20)

Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
Clouds and Commons for the Data Intensive Science Community (June 8, 2015)
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)Processing Big Data (Chapter 3, SC 11 Tutorial)
Processing Big Data (Chapter 3, SC 11 Tutorial)
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
Practical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large DatasetsPractical Methods for Identifying Anomalies That Matter in Large Datasets
Practical Methods for Identifying Anomalies That Matter in Large Datasets
 
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
Bionimbus: Towards One Million Genomes (XLDB 2012 Lecture)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
On the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) modelsOn the representation and reuse of machine learning (ML) models
On the representation and reuse of machine learning (ML) models
 
PMML - Predictive Model Markup Language
PMML - Predictive Model Markup LanguagePMML - Predictive Model Markup Language
PMML - Predictive Model Markup Language
 
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery DataThe Matsu Project - Open Source Software for Processing Satellite Imagery Data
The Matsu Project - Open Source Software for Processing Satellite Imagery Data
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 

Similar to AnalyticOps - Chicago PAW 2016

Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
Stepan Pushkarev
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
DataWorks Summit
 

Similar to AnalyticOps - Chicago PAW 2016 (20)

How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learningUtilisation de MLflow pour le cycle de vie des projet Machine learning
Utilisation de MLflow pour le cycle de vie des projet Machine learning
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Hydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations AutomationHydrosphere.io Platform for AI/ML Operations Automation
Hydrosphere.io Platform for AI/ML Operations Automation
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
Monitoring AI with AI
Monitoring AI with AIMonitoring AI with AI
Monitoring AI with AI
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018World Artificial Intelligence Conference Shanghai 2018
World Artificial Intelligence Conference Shanghai 2018
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 

More from Robert Grossman

More from Robert Grossman (11)

Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Some Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data PlatformsSome Proposed Principles for Interoperating Cloud Based Data Platforms
Some Proposed Principles for Interoperating Cloud Based Data Platforms
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 
What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?What is Data Commons and How Can Your Organization Build One?
What is Data Commons and How Can Your Organization Build One?
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011Bionimbus - Northwestern CGI Workshop 4-21-2011
Bionimbus - Northwestern CGI Workshop 4-21-2011
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

AnalyticOps - Chicago PAW 2016

  • 1. Best  Prac*ces  for  Deploying  Analy*c   Models  into  Opera*ons   Robert  L.  Grossman   Open  Data  Group   and     University  of  Chicago   Predic*ve  Analy*c  World  Chicago   June  21,  2016   rgrossman.com   @bobgrossman  
  • 2. Exploratory  Data   Analysis   Get  and     clean  the   data  Build  model  in  dev/ modeling  environment   Deploy  model  in   opera*onal  systems   with  scoring   applica*on     Monitor  performance   and  employ   champion-­‐challenger   methodology   Analy*c  modeling   Analy*c  opera*ons   Deploy   model   Re*re  model  and  deploy   improved  model   Select  analy*c   problem  &   approach   Scale  up     deployment   Model Env Deployment Env Perf.   data   Life  Cycle  of  an  Analy*c  Model  
  • 3. Differences  Between  the  Modeling  and   Deployment  Environments   •  Typically  modelers  use  specialized  languages  such  as   SAS,  SPSS  or  R.   •  Usually,  developers  responsible  for  products  and   services  use  languages  such  as  Java,  JavaScript,   Python,  C++,  etc.   •  This  can  result  in  significant  effort  and  significant   delays  moving  the  model  from  the  modeling   environment  to  the  deployment  environment.  
  • 4. Would you minding writing all your models in Java? Alice,  Data  Scien*st   Bob,  Data  Scien*st   Joe,  IT   I write all my models in R, why don’t you do the same? I write all my models in scikit- learn, why don’t you do the same?
  • 5. Ways  to  Deploy  Models  into     Products/Services/Opera*ons   •  Push  code.   •  Embed  a  model  into  a  product  or  service.   •  Export  and  import  tables  of  scores   •  Export  and  import  tables  of  parameters   •  Have  the  product/service  interact  with  the   model  as  a  web  or  message  service.   •  Import  the  models  into  a  database   How  quickly  can  the  model  be  updated?   •  Model  parameters?   •  New  features?         •  New  pre-­‐  &  post-­‐  processing?  
  • 6. The  not-­‐For-­‐profit  DMG   develops  and  supports   standards.   www.dmg.org   PMML   PFA  
  • 7. What  is  an  Analy*c  Engine?   •  An  analy*c  engine  is  a  component  that  is  integrated  into   products  or  enterprise  IT  that  deploys  analy*c  models  in   opera*onal  workflows  for  products  and  services.   •  A  Model  Interchange  Format  is  a  format  that  supports   the  expor*ng  of  a  model  by  one  applica*on  and  the   impor*ng  of  a  model  by  another  applica*on.       •  Model  Interchange  Formats  include  the  Predic*ve  Model   Markup  Language  (PMML),  the  Portable  Format  for   Analy*cs  (PFA),  and  various  in-­‐house  or  custom  formats.   •  Analy*c  engines  are  integrated  once,  but  allow   applica*ons  to  update  models  as  quickly  as  reading  a  a   model  interchange  format  file.   7  
  • 8. Deploying  analy*c  models   Model   Consumer   Model   Producer   Export   model   Import   model   PMML  
  • 9. PMML  Philosophy   •  PMML  is  a  XML  specifica/on  of  a  model,  not  an   implementa/on  of  a  model   •  PMML  provides  a  simple  means  of  binding   parameters  to  values  for  an  agreed  upon  set  of   data  mining  models  &  transforma*ons  in  a  safe   way.   9  
  • 10. Deploying  analy*c  models  and  workflows   Analy*c   Engines   Analy*c   Workflow   Producers   Export   analy*c   workflows   Import   analy*c   workflows   PFA  
  • 11. PFA  Philosophy   •  Define  primi*ves  for  data  transforma*ons,  data   aggrega*ons,  and  sta*s*cal  and  analy*c  models.   •  Support  composi*on  of  data  mining  primi*ves   (which  makes  it  easy  to  specify  machine  learning   algorithms  and  pre-­‐/post-­‐  processing  of  data).   •  Be  extensible.   •  Designed  to  be  “safe”  to  deploy  in  enterprise  IT   opera*onal  environments.   •  This  is  a  philosophy  that  is  different  and   complementary  to  Predic*ve  Model  Markup   Language  (PMML).   11  
  • 12. PFA  Case  Study  1   •  20+  person  data  science  group  developing  models  in   R,  Python,  Scikit-­‐learn,  MATLAB,  ...     •  All  the  data  scien*sts  export  their  model  in  PFA.   •  The  company’s  product  imports  models  in  PFA  and   runs  on  their  customers  data  as  required.   Export  PFA   Import  PFA   Widget   records   Widget   scores  
  • 13. PFA  Func*onality   •  PFA  codes  arbitrary  mathema*cal  algorithms  in  a   *ghtly  controlled  environment.   •  PFA  has  all  the  standard  flow  control  of  a   programming  language:  if/then/else  &    for/while   loops.   •  PFA  has  func*on  calls  and  func*on  call  backs     •  PFA  has  algebraic  data  types.   •  PFA  is  encoded  as  func*on  calls  in  JSON      {func*on:  [arg  1,  arg  2,  …,  arg  n]  }   13  
  • 14. Benefits  of  PFA   •  PFA  is  based  upon  JSON  and  Avro  and  integrates   easily  into  modern  big  data  environments.   •  PFA  allows  models  to  be  easily  chained  and   composed.   •  PFA  allows  developers  and  users  users  of  analy*c   systems  to  pre-­‐process  inputs  and  to  post-­‐process   outputs  to  models.   •  PFA  is  easily  integrated  with  Hadoop,  Spark,  etc.   •  PFA  is  easily  integrated  with  Kaoa,  Storm,  Akka  and   other  streaming  environments.   •  PFA  can  be  used  to  integrate  mul*ple    tools   applica*ons  within  an  analy*c  ecosystem.  
  • 15. Example:  Scoring  Clusters   15   Source:  dmg.org/pfa  
  • 17. PFA  Case  Study  2   •  Two  teams  of  data  scien*sts  develop  analy*c  models  for  an   adversarial  analy*cs  project.   •  Models  developed  in  Hadoop  and  exported  in  PFA  every  4   weeks.     •  Models  updated  in  client  systems  every  2  weeks.   •  It’s  not  quite  this  simple,  but  that’s  the  general  idea.   Export  PFA   Import  PFA   Event   records   Event   scores   Weeks  1-­‐4,  5-­‐8,  …     Weeks  3-­‐6,  7-­‐10,  …   Weeks  4,  6,  8,  10,  …  
  • 18. Case  Study  3:  Gaussian  Process  Model  (1  of  5)  
  • 19. Gaussian  Process  Model  (2  of  5)   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} input  and  output  of  scoring  engine   expressed  as  Avro  schemas   Source:  dmg.org/pfa  
  • 20. Gaussian  Process  Model  (3  of  5)   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} type   (also  Avro)   and  value   (as  JSON,   truncated)   Gaussian  Process   model  parameters   Source:  dmg.org/pfa  
  • 21. Gaussian  Process  Model  (4  of  5)   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} calling  method:  parameters   expressed  as  JSON   input:  get  interpola*on  point  from  input   {cell:  table}:  get  parameters  from  table   null:  no  explicit  Kriging  weight  (universal)   {fcn:  …}:  kernel  func*on   Source:  dmg.org/pfa  
  • 22. Gaussian  Process  Model  (5  of  5)   •  Appears  declara*ve,  but  this  is  a  func*on  call.   –  Fourth  parameter  is  another  func*on:  m.kernel.rbf  (radial  basis   kernel,  a.k.a.  squared  exponen*al).   –   m.kernel.rbf  was  intended  for  SVM,  but  is  reusable  anywhere.   –  One  argument  (gamma)  preapplied  so  that  it  fits  the  signature   for  model.reg.gaussianProcess.   •  Any  kernel  func*on  could  be  used,  including  user-­‐defined  func*ons   wriuen  with  PFA  “code.”   •  The  Gaussian  Process  could  be  used  anywhere,  even  as  a  pre-­‐ processing  or  post-­‐processing  step.   model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} Source:  dmg.org/pfa  
  • 23. Genomics  dataset:  2.5+   PB  consis*ng  of  577,878   files  about  14,052  cases   (pa*ents),  in  42  cancer   types,  across  29  primary   sites.       2.5+  PB     of  cancer   genomics  data   +   Bionimbus  data  commons   technology  running  mul*ple   community  developed  variant   calling  pipelines.    Over  12,000   cores  and  10  PB  of  raw  storage  in   18+  racks  running  for  months.   Case  Study  4:  Analy*cOps  for  the     Genomic  Data  Commons*   Source:  Based  in  part  on  “The    Genomic  Data  Commons”,  the  GDC  team,  ms  in  prepara*on.  
  • 24. Dev Ops •  Virtualiza*on  and  the  requirement  for  massive  scale  out   spawned  infrastructure  automa*on  (“infrastructure  as   code”).   •  Requirement  for  reducing  the  *me  to  deploying  code   created  tools  for  con*nuous  integra*on  and  tes*ng.  
  • 25. ModelDev AnalyticOps •  Use  virtualiza*on  /  containers,  infrastructure   automa*on  and  scale  out  to  support  large  scale   analy*cs.   •  Requirement:  reduce  the  *me  and  cost  to  do  high   quality  analy*cs    over  large  amounts  of  data.  
  • 26. Sowware   Development   Quality   Assurance   Opera*ons   DevOps   The  goal  of  DevOps  is  to  establish  a  culture  and  an  environment   where  building,  tes*ng,  releasing,  and  opera*ng  sowware  can   happen  rapidly,  frequently,  and  more  reliably.*   *Adapted  from  Wikipedia,  en.wikipedia.org/wiki/DevOps.  
  • 27. Analy*c   Workflows   Quality   Assurance   Analy*c   Opera*ons   Analy*cOps   The  goal  of  Analy*cOps  is  to  establish  a  culture  and  an   environment  where  building,  valida*ng,  deploying,  and  running   analy*c  models  happen  rapidly,  frequently,  and  reliably.   •  Sowware   •  Model   •  Data   Building  the   right  analy*c   model.   Is  the  analy*c   model  running   right?   Source:  Robert  L.  Grossman,  A  Quick  Introduc*on  to  Analy*cOps.  
  • 28. •  Model  quality   (confusion  matrix)   •  Data  quality                           (six  dimensions)   •  Lack  of  ground  truth   •  Sowware  errors   •  Monitoring  system   •  Quality  of  workflow   and  scheduling   •  Boulenecks,  stragglers,  hot  spots,  etc.   •  Analy*c  configura*ons  problems   •  System  failures     •  Human  errors   Ten  Factors  Effec*ng  Analy*cOps   Source:  Based  in  part  on  “The    Genomic  Data  Commons”,  the  GDC  team,  ms  in  prepara*on.  
  • 29. Summary   •  Deploying  analy*c  models  is  core  technical  competency.   •  The  Portable  Format  for  Analy*cs  (PFA)  is  a  model   interchange  format  for  building  analy*c  models  in  one   environment  and  deploying  them  in  another  one.   •  PFA  is  based  upon  data  mining  primi*ves  &  supports   pre-­‐processing,  common  analy*c  models,  post-­‐ processing,  &  composi*on  of  primi*ves  and  models.   •  It  is  easy  to  add  your  own  PFA  func*ons  and  models.   •  There  is  reference  implementa*on  &  compliance  tests.     •  PFA  is  being  developed  by  the  not-­‐for-­‐profit  DMG.   •  A  discipline  of  analy*cOps  is  emerging  that  supports   more  complex  analy*c  flows  at  greater  scale.  
  • 30. Ques*ons?   30   For  more  informa*on  about  PFA,  see:  dmg.org/pfa   rgrossman.com   @BobGrossman