Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

1,895 views

Published on

A talk that I gave at Strata on AnalyticOps on March 30, 2016.

Published in: Technology

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production Environments

  1. 1. How  to  Make  Analy.c  Opera.ons  Look  More  Like   DevOps:  Lessons  learned  Moving  Machine-­‐ Learning  Algorithms  to  Produc.on  Environments   Robert  L.  Grossman   University  of  Chicago   and   Open  Data  Group   O’Reilly  Strata  Conference   March  30,  2016   rgrossman.com   @bobgrossman  
  2. 2. Introduc.on  to  Analy.cOps    
  3. 3. SoRware   Development   Quality   Assurance   Opera.ons   DevOps   The  goal  of  DevOps  is  to  establish  a  culture  and  an  environment   where  building,  tes.ng,  releasing,  and  opera.ng  soRware  can   happen  rapidly,  frequently,  and  more  reliably.*   *Adapted  from  Wikipedia,  en.wikipedia.org/wiki/DevOps.  
  4. 4. Analy.c   Modeling   Quality   Assurance   Analy.c   Opera.ons   Analy.cOps   The  goal  of  Analy.cOps  is  to  establish  a  culture  and  an   environment  where  building,  valida.ng,  deploying,  and  running   analy.c  models  happen  rapidly,  frequently,  and  reliably.  
  5. 5. Analy.c   Modeling   Quality   Assurance   Analy.c   Opera.ons   Analy.cOps   The  goal  of  Analy.cOps  is  to  establish  a  culture  and  an   environment  where  building,  valida.ng,  deploying,  and  running   analy.c  models  happen  rapidly,  frequently,  and  reliably.   •  SoRware   •  Model   •  Data  
  6. 6. Analy.c  strategy   and  planning   Analy.c  models  &   algorithms   Analy.c  opera.ons   Analy.c  Infrastructure   *Source:  Robert  L.  Grossman,  The  Strategy  and  Prac.ce  of  Analy.cs,  O’Reilly,  2016,  to  appear.  
  7. 7. A  Problem   There  are  plaZorms  and  tools  for  managing  and  processing  big  data   (Hadoop),  for  building  analy.cs  (SAS,  SPSS,  R,  Sta.s.ca,  Spark,   Skytree,  Mahout),  but  few  op.ons  for  deploying  analy.cs  into   opera.ons  or  for  embedding  analy.cs  into  products  and  services.   Data  scien.sts   developing  analy.c   models  &  algorithms   Analy.c  infrastructure   Enterprise  IT   deploying  analy.cs   into  products,  services   and  opera.ons   Deploying  analy.cs   7  
  8. 8. More  Problems   Data  scien.sts   developing  analy.c   models  &  algorithms   Analy.c  infrastructure   Enterprise  IT   deploying  analy.cs   into  products,  services   and  opera.ons   Deploying  analy.cs   8   Monitoring   opera.onal  analy.cs   ETL  and  datamarts  for   the  modelers  
  9. 9. Case  Study  1:  Scoring  Engines  for  Cri.cal   Systems  
  10. 10. Life  Cycle  of  Predic.ve  Model   Exploratory  Data  Analysis   Get  and     clean  the  data   Build  model  in  dev/ modeling  environment   Deploy  model  in   opera.onal  systems  with   scoring  applica.on     Monitor  performance  and   employ  champion-­‐ challenger  methodology  to   develop  improved  model   Analy.c  modeling   Analy.c  opera.ons   Deploy   model   Perf.   data   Re.re  model  and  deploy   improved  model   Select  analy.c   problem  &   approach   Scale  up     deployment  
  11. 11. Exploratory  Data  Analysis   Get  and     clean  the  data   Build  model  in  dev/ modeling  environment   Deploy  model  in   opera.onal  systems  with   scoring  applica.on     Monitor  performance  and   employ  champion-­‐ challenger  methodology  to   develop  improved  model   Analy.c  modeling   Analy.c  opera.ons   Deploy   model   Re.re  model  and  deploy   improved  model   Select  analy.c   problem  &   approach   Scale  up     deployment   ModelDev AnalyticOps Perf.   data  
  12. 12. Differences  Between  the  Modeling  and   Deployment  Environments   •  Typically  modelers  use  specialized  languages  such  as   SAS,  SPSS  or  R.   •  Usually,  developers  responsible  for  products  and   services  use  languages  such  as  Java,  JavaScript,   Python,  C++,  etc.   •  This  can  result  in  significant  effort  moving  the  model   from  the  modeling  environment  to  the  deployment   environment.  
  13. 13. Ways  to  Deploy  Models  into     Products/Services/Opera.ons   •  Export  and  import  tables  of  scores   •  Export  and  import  tables  of  parameters   •  Have  the  product/service  interact  with  the   model  as  a  web  or  message  service.   •  Import  the  models  into  a  database   •  Embed  the  model  into  a  product  or  service.   •  Push  code.   How  quickly  can  the  model  be  updated?   •  Model  parameters?   •  New  features?         •  New  pre-­‐  &  post-­‐  processing?  
  14. 14. What  is  a  Scoring  Engine?   •  A  scoring  engine  is  a  component  that  is  integrated  into   products  or  enterprise  IT  that  deploys  analy.c  models  in   opera.onal  workflows  for  products  and  services.   •  A  Model  Interchange  Format  is  a  format  that  supports   the  expor.ng  of  a  model  by  one  applica.on  and  the   impor.ng  of  a  model  by  another  applica.on.       •  Model  Interchange  Formats  include  the  Predic.ve  Model   Markup  Language  (PMML),  the  Portable  Format  for   Analy.cs  (PFA),  and  various  in-­‐house  or  custom  formats.   •  Scoring  engines  are  integrated  once,  but  allow   applica.ons  to  update  models  as  quickly  as  reading  a  a   model  interchange  format  file.   14  
  15. 15. Analy.c  algorithms   &  models   Analy.c  opera.ons   Deploying  analy.c  models   Model   Consumer   Model   Producer   Analy.c  Infrastructure   Export   model   Import   model   PMML  &  PFA  
  16. 16. Case  Study  2:    Scaling  Bioinforma.cs   Pipelines  for  the  Genomic  Data  Commons*   This  case  study  describes  work  by  the  NCI  Genomic  Data  Commons  Project  and  the   University  of  Chicago  Center  for  Data  Intensive  Science.  
  17. 17. TCGA  dataset:  1.54  PB   consis.ng  of  577,878   files  about  14,052  cases   (pa.ents),  in  42  cancer   types,  across  29  primary   sites.       2.5+  PB     of  cancer   genomics  data   +   Bionimbus  data  commons   technology  running  mul.ple   community  developed  variant   calling  pipelines.    Over  12,000   cores  and  10  PB  of  raw  storage  in   18+  racks  running  for  months.   Analy.cOps  for  the  Genomic  Data  Commons  
  18. 18. Dev Ops •  Virtualiza.on  and  the  requirement  for  massive  scale  out   spawned  infrastructure  automa.on  (“infrastructure  as   code”).   •  Requirement  for  reducing  the  .me  to  deploying  code   created  tools  for  con.nuous  integra.on  and  tes.ng.  
  19. 19. ModelDev AnalyticOps •  Use  virtualiza.on  /  containers,  infrastructure   automa.on  and  scale  out  to  support  large  scale   analy.cs.   •  Requirement:  reduce  the  .me  and  cost  to  do  high   quality  analy.cs    over  large  amounts  of  data.  
  20. 20. Genomic  Data  Commons  (GDC)  Files  Vary  Over  9   Orders  of  Magnitude  in  Size  
  21. 21. GDC  Pipelines  Are  Complex     and  are  Mostly  Wriqen  by  Others  
  22. 22. Computa.ons  for  a  Single     Genome  Can  Take  Over  a  Week   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  23. 23. System  Loads  Vary  Significantly  
  24. 24. •  Model  quality   (confusion  matrix)   •  Data  quality                           (six  dimensions)   •  Lack  of  ground  truth   •  SoRware  errors   •  Workflow  with   monitoring   •  Scheduling   •  Boqlenecks,  stragglers,  hot  spots,  etc.   •  Analy.c  configura.ons  problems*   •  System  failures     •  Human  errors   Ten  Factors  Effec.ng  Analy.cOps   *DMS  =  data-­‐model-­‐system  
  25. 25. Monitor  Data  Quality  and  Model  Performance   and  Summarize  With  Dashboards   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  26. 26. Analy.cOps  Dashboard   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  27. 27. Data  Quality:  Batch  Effects  Can  Be  Significant   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  28. 28. Model  Quality:  Differences  in  Three   Soma.c  Muta.on  Detec.on  Algorithms   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  29. 29. ORen  SoRware  Must  Be  Wriqen  so  that  It  Can   Be  Run  Efficiently  in  Automated  Enivronments   •  Generally,  community  soRware  in  bioinforma.cs  is   designed  to  be  run  manually  over  local  clusters.   •  Example   – We  patched  one  piece  of  soRware  over  400  .mes   so  that  it  could  run  over  12,000  genomes     – Although  only  3.3%  of  genomes  had  problems,  it   required  significant  manual  effort.   •  Analy.cOps  requires  opera.ng  the  soRware  in   automated  environments.  
  30. 30. Decide  What  Not  to  Compute   VarScan Rate Rate (GB/hour) Frequency 0.0 0.5 1.0 1.5 2.0 020040060080010001200 Manage  these   cases  carefully.  
  31. 31. Model  Expected  Performance   Processing  .me   Tumor  BAM  size  (GB)   Source:  University  of  Chicago  Center  for  Data  Intensive  Science  Bioinforma.cs  Group.  
  32. 32. Case  Study  3:  Deploying  Gaussian  Process   Models  to  the  Industrial  Internet*   *Thanks  to  the  DMG  PMML  and  PFA  Working  Groups.    
  33. 33. Portable  Format  for  Analy.cs  (PFA)  Standard   www.dmg.org  
  34. 34. PFA  is  Based  Upon  Defining  Primi.ves  for   Analy.c  Models   •  What  would  a  standard  look  like  that…   – Defines  primi.ves  for  data  transforma.ons,  data   aggrega.ons,  and  sta.s.cal  and  analy.c  models.   – Supports  composi.on  of  data  mining  primi.ves   (which  makes  it  easy  to  specify  machine  learning   algorithms  and  pre-­‐/post-­‐  processing  of  data).   – Is  extensible.   – Is  “safe”  to  deploy  in  enterprise  IT  opera.onal   environments.   •  This  is  a  different  philosophy  that  is  different  and   complementary  to  Predic.ve  Model  Markup   Language  (PMML).   34  
  35. 35. Benefits  of  PFA   •  PFA  is  based  upon  JSON  and  Avro  and  integrates   easily  into  modern  big  data  environments.   •  PFA  allows  models  to  be  easily  chained  and   composed   •  PFA  allows  developers  and  users  users  of  analy.c   systems  to  pre-­‐process  inputs  and  to  post-­‐process   outputs  to  models   •  PFA  is  easily  integrated  with  Storm,  Akka  and  other   streaming  environments   •  PFA  can  be  used  to  integrate  mul.ple    tools   applica.ons  within  an  analy.c  ecosystem.  
  36. 36. Gaussian  Process  Model  
  37. 37. Example  of  a  PFA  model   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} input  and  output  of  scoring  engine   expressed  as  Avro  schemas  
  38. 38. Example  of  a  PFA  model   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} type   (also  Avro)   and  value   (as  JSON,   truncated)   Gaussian  Process   model  parameters  
  39. 39. Example  of  a  PFA  model   input: {type: array, items: double} output: {type: array, items: double} cells: table: type: {type: array, items: {type: record, name: GP, fields: [ - {name: x, type: {type: array, items: double}} - {name: to, type: {type: array, items: double}} - {name: sigma, type: {type: array, items: double}}]}} init: - {x: [ 0, 0], to: [0.01870587, 0.96812508], sigma: [0.2, 0.2]} - {x: [ 0, 36], to: [0.00242101, 0.95369720], sigma: [0.2, 0.2]} - {x: [ 0, 72], to: [0.13131668, 0.53822666], sigma: [0.2, 0.2]} ... - {x: [324, 324], to: [-0.6815587, 0.82271760], sigma: [0.2, 0.2]} action: model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}} calling  method:  parameters   expressed  as  JSON   input:  get  interpola.on  point  from  input   {cell:  table}:  get  parameters  from  table   null:  no  explicit  Kriging  weight  (universal)   {fcn:  …}:  kernel  func.on  
  40. 40. Example  of  a  PFA  model   •  Appears  declara.ve,  but  this  is  a  func.on  call.   –  Fourth  parameter  is  another  func.on:  m.kernel.rbf  (radial  basis   kernel,  a.k.a.  squared  exponen.al).   –   m.kernel.rbf  was  intended  for  SVM,  but  is  reusable  anywhere.   –  One  argument  (gamma)  preapplied  so  that  it  fits  the  signature   for  model.reg.gaussianProcess.   •  Any  kernel  func.on  could  be  used,  including  user-­‐defined  func.ons   wriqen  with  PFA  “code.”   •  The  Gaussian  Process  could  be  used  anywhere,  even  as  a  pre-­‐ processing  or  post-­‐processing  step.   model.reg.gaussianProcess: - input - {cell: table} - null - {fcn: m.kernel.rbf, fill: {gamma: 2.0}}
  41. 41. Summary  
  42. 42. Ten  Analy.cOps  Rules   1.  Team  a  modeler,  soRware  engineer,  and  systems  engineer.   2.  Instrument  and  monitor  analy.cs,  soRware  and  systems  and   populate  and  Analy.cOps  dashboard.     3.  Use  an  automated  tes.ng  and  deployment  environment  to   improve  the  model  quality.     4.  Use  scoring  engines  with  languages  such  as  PFA  &  PMML.   5.  Put  in  place  a  data  quality  program.     6.  For  complex  workloads,  use  workflow  and  schedulers  (even  if   you  think  you  don’t  need  them  ini.ally)  and  model  scale  up.   7.  Op.mize  the  end  to  end  performance  of  the  Analy.cOps,  not   individual  analy.cs.   8.  Dis.nguish  scores  from  ac.ons.   9.  Iden.fy  and  eliminate  performance  hot  spots,  system  stragglers,   etc.   10.  Invest  in  root  cause  analysis  of  Analy.cOps  problems.  
  43. 43. Ques.ons?   43   rgrossman.com   @bobgrossman  

×