Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best practices for building and deploying predictive models over big data presentation


Published on

O'Reilley Strata Conference

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
    Are you sure you want to  Yes  No
    Your message goes here

Best practices for building and deploying predictive models over big data presentation

  1. 1. Best  Prac?ces  for  Building  and  Deploying   Predic?ve  Models  Over  Big  Data   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago     Collin  Benne=   Open  Data  Group   October  23,  2012  
  2. 2. 1.  Introduc?on   7.      Building  Models  over  Hadoop  2.  Building  Predic?ve  Models   Using  R   –  EDA,  Features   8.      Case  Study:  Building  Trees  3.  Case  Study:  MalStone   over  Big  Data   9.        Analy?c  Opera?ons  –  SAMS  4.  Deploying  Predic?ve   Models   10.   Case  Study:  AdReady  5.  Working  with  Mul?ple   11.   Building  Predic?ve  Models  –   Models   Li  of  a  Model  6.  Case  Study:  CTA   12.   Case  Study:  Matsu  The  tutorial  is  divided  into  12  Modules.    You  can  download  all  the  modules  and  related  materials  from  
  3. 3. Best  Prac?ces  for  Building  and    Deploying  Predic?ve  Models  Over  Big  Data     Module  1:  Introduc?on     Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012      
  4. 4. What  Is  Small  Data?  •  100  million  movie  ra?ngs  •  480  thousand  customers  •  17,000  movies  •  From  1998  to  2005  •  Less  than  2  GB  data.  •  Fits  into  memory,  but   very  sophis?cated   models  required  to  win.  
  5. 5. What  is  Big  Data?   •  The  leaders  in  big  data   measure  data  in  Megawa=s.         –  As  in,  Facebook’s  leased  data   centers  are  typically  between  2.5   MW  and  6.0  MW.   –  Facebook’s  new  Pineville  data   center  is  30  MW.   •  At  the  other  end  of  the   spectrum,  databases  can   manage  the  data  required  for   many  projects.  
  6. 6. An  algorithm  and  compu?ng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  of  data  (and  corresponding  processors)  allows  you  to  compute  over  more  data  in  the  same  ?me.      
  7. 7. What  is  Predic?ve  Analy?cs?  Short  Defini5on  •  Using  data  to  make  decisions.  Longer  Defini5on  •  Using  data  to  take  ac?ons  and  make  decisions   using  models  that  are  sta?s?cally  valid  and   empirically  derived.   7  
  8. 8. Predic?ve   Analy?cs   Data  Mining   &  KDD  Computa?onally  Intensive  Sta?s?cs   2004   1993   1984   8  
  9. 9. Key  Modeling  Ac?vi?es  §  Exploring  data  analysis  –  goal  is  to  understand  the   data  so  that  features  can  be  generated  and  model   evaluated  §  Genera?ng  features  –  goal  is  to  shape  events  into   meaningful  features  that  enable  meaningful   predic?on  of  the  dependent  variable  §  Building  models  –  goal  is  to  derive  sta?s?cally  valid   models  from  a  learning  set  §  Evalua?ng  models  –  goal  is  to  develop  meaningful   report  so  that  the  performance  of  one  model  can  be   compared  to  another  model    
  10. 10. Analy?c     strategy  &   governance   Analy?c     Diamond  Analy?c  algorithms   Analy?c  opera?ons,  &  models   security  &  compliance   Analy?c  Infrastructure  
  11. 11. IT  Organiza?on  Analy?c  Infrastructure   11  
  12. 12. What  Is  Analy?c  Infrastructure?   1 D A T A M I N I N G G R O U PAnaly&c  infrastructure  refers  to  the  soware  components,  soware  services,  applica?ons  and  plahorms  for  managing  data,  processing  data,  producing  analy?c  models,  and  using  analy?c  models  to  generate  alerts,  take  ac?ons  and  make  decisions.   12  
  13. 13. Build  models  Analy?c  models  &  algorithms  Modeling  Group   13  
  14. 14. Building  models:   Model   Data   Model   Producer   CART,  SVM,  k-­‐means,  etc.   PMML   14  
  15. 15. Opera?ons,  product  Analy?c  Opera?ons   Deployed  models   15  
  16. 16. Measures   Dashboards   Ac?ons  Models   Scores   SAM   Deployed  model  
  17. 17. ETL   Deployment   Measures   Ac?ons   Validated  Model   Scores   Candidate  Model   Build  Features   Analy?c   Datamart   Cleaning  Data  &   Exploratory  Data   Data  Warehouse   Analysis  Opera?onal  data        Analy?c  Infrastructure   Analy?c  Modeling   Analy?c  Opera?ons  
  18. 18. Life  Cycle  of  Predic?ve  Model   Exploratory  Data  Analysis  (R)   Process  the  data  (Hadoop)   Build  model  in  dev/ modeling  environment   PMML   Analy?c  modeling   Opera?ons  Deploy  model  in   Log  opera?onal  systems  with   files  scoring  applica?on   Refine  model  (Hadoop,  streams  &  Augustus)   Re?re  model  and  deploy   improved  model  
  19. 19. Key  Abstrac?ons  Abstrac5on   Credit  Card  Fraud   On  line   Change  Detec5on   adver5sing  Events   Credit  card   Impressions  and   Status  updates   transac?ons   clicks  En??es   Credit  card   Cookies,  user  IDs   Computers,  devices,  etc.   accounts  Features   #  transac?ons  past   #  impressions  per   #  events  in  a  category  (examples)   n  minutes   category  past  n   previous  ?me  period   minutes,  RFM   related,  etc.  Models   CART   Clusters,  trees,   Baseline  models   recommenda?on,   etc.    Scores   Indicates  likelihood   Likelihood  of   Indicates  likelihood  that  a   of  fraudulent   clicking   change  has  taken  place   account  
  20. 20. Event  Based  Upda?ng  of     Feature  Vectors     events FVs eventsEvent  loop:   updates1.  take  next  event  2.  update  one  or  more   associated  FVs   R n3.  rescore  updated  FVs   to  compute  alerts   scores
  21. 21. Example  Architecture  •  Data  lives  in  Hadoop  Distributed  File  System   (HDFS)  or  other  distributed  file  systems.  •  Hadoop  streams  and  MapReduce  are  used  to   process  the  data  •  R  is  used  for  Exploratory  Data  Analysis  (EDA)   and  models  are  exported  as  PMML.  •  PMML  models  are  imported  and  Hadoop,   Python  Streams  and  Augustus  are  used  to   score  data  in  produc?on  
  22. 22. Ques?ons?  For  the  most  current  version  of  these  slides,  please  see  
  23. 23. Best  Prac?ces  for  Building  and     Deploying  Predic?ve  Models  Over  Big  Data    Module  2:  Building  Models  –  Data  Explora?on     and  Features  Genera?on   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  24. 24. Key  Modeling  Ac?vi?es  §  Exploring  data  analysis  –  goal  is  to  understand  the   data  so  that  features  can  be  generated  and  model   evaluated  §  Genera?ng  features  –  goal  is  to  shape  events  into   meaningful  features  that  enable  meaningful   predic?on  of  the  dependent  variable  §  Building  models  –  goal  is  to  derive  sta?s?cally  valid   models  from  learning  set  §  Evalua?ng  models  –  goal  is  to  develop  meaningful   report  so  that  the  performance  of  one  model  can  be   compared  to  another  model    
  25. 25. ?      Naïve  Bayes   ?      K  nearest  neighbor   1957     K-­‐means   1977    EM  Algorithm   Beware  of  any  vendor   1984    CART   whose  unique  value   1993    C4.5   proposi?on  includes   1994    A  Priori   some  “secret  analy?c   1995    SVM   sauce.”   1997    AdaBoost   1998    PageRank  Source:  X.  Wu,  et.  al.,  Top  10  algorithms  in  data  mining,  Knowledge  and  Informa?on  Systems,  Volume  14,  2008,  pages  1-­‐37.   25  
  26. 26. Pessimis?c  View:  We  Get  a  Significantly   New  Algorithm  Every  Decade  or  So  1970’s      neural  networks  1980’s      classifica?ons  and  regression  trees  1990’s      support  vector  machine  2000’s      graph  algorithms  
  27. 27. In  general,  understanding  the  data  through  exploratory  analysis  and  genera?ng  good  features  is  much  more  important  than  the  type  of  predic?ve  model  you  use.  
  28. 28. Ques?ons  About  Datasets  •  Number  of  records  •  Number  of  data  fields  •  Size   –  Does  it  fit  into  memory?   –  Does  it  fit  on  a  single  disk  or  disk  array?  •  How  many  missing  values?  •  How  many  duplicates?  •  Are  there  labels  (for  the  dependent  variable)?  •  Are  there  keys?  
  29. 29. Ques?ons  About  Data  Fields  •  Data  fields   –  What  is  the  mean,  variance,  …?   –  What  is  the  distribu?on?    Plot  the  histograms.   –  What  are  extreme  values,  missing  values,  unusual   values?  •  Pairs  of  data  fields   –  What  is  the  correla?on?    Plot  the  sca=er  plots.  •  Visualize  the  data   –  Histograms,  sca=er  plots,  etc.  
  30. 30. …  
  31. 31. Compu?ng  Count  Distribu?ons  def  valueDistrib(b,  field,  select=None):      #  given  a  dataframe  b  and  an  a=ribute  field,      #  returns  the  class  distribu?on  of  the  field      ccdict  =  {}    if  select:        for  i  in  select:        count  =  ccdict.get(b[i][field],  0)          ccdict[b[i][field]]  =  count  +  1      else:          for  i  in  range(len(b)):          count  =  ccdict.get(b[i][field],  0)          ccdict[b[i][field]]  =  count  +  1      return  ccdict  
  32. 32. Features  •  Normalize   –  [0,  1]  or  [-­‐1,  1]  •  Aggregate  •  Discre?ze  or  bin  •  Transform  con?nuous  to  con?nuous  or   discrete  to  discrete  •  Compute  ra?os  •  Use  percen?les  •  Use  indicator  variables  
  33. 33. Predic?ng  Office  Building  Renewal     Using  Text-­‐based  Features  Classification Avg R Classification Avg Rimports 5.17 technology 2.03clothing 4.17 apparel 2.03van 3.18 law 2.00fashion 2.50 casualty 1.91investment 2.42 bank 1.88marketing 2.38 technologies 1.84oil 2.11 trading 1.82air 2.09 associates 1.81system 2.06 staffing 1.77foundation 2.04 securities 1.77
  34. 34. Twi=er  Inten?on  &  Predic?on   •  Use  machine  learning   classifica?on   techniques:   –  Support  Vector   Machines   –  Trees   –  Naïve  Bayes   •  Models  require  clean   data  and  lots  of  hand-­‐ scored  examples    
  35. 35. What  is  the  Size  of  Your  Data?  •  Small   –  Fits  into  memory  •  Medium   –  Too  large  for  memory   –  But  fits  into  a  database  •  Large   –  To  large  for  a  database   –  But  you  can  use  NoSQL  database  (HBase,  etc.)   –  Or  DFS  (Hadoop  DFS)   36  
  36. 36. If  You  Can  Ask  For  Something  •  Ask  for  more  data  •  Ask  for  “orthogonal  data”  
  37. 37. Ques?ons?  For  the  most  current  version  of  these  slides,  please  see  
  38. 38. Best  Prac?ces  for  Building  and    Deploying  Predic?ve  Models  Over  Big  Data     Module  3:  Case  Study  -­‐  MalStone   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  39. 39. Part  1.  Sta?s?cs  over  Log  Files  •  Log  files  are  everywhere  •  Adver?sing  systems  •  Network  and  system  logs  for  intrusion   detec?on  •  Health  and  status  monitoring  
  40. 40. What  are  the  Common  Elements?  •  Time  stamps  •  Sites   –  e.g.  Web  sites,  computers,  network  devices  •  En??es   –  e.g.  visitors,  users,  flows  •  Log  files  fill  disks,  many,  many  disks  •  Behavior  occurs  at  all  scales  •  Want  to  iden?fy  phenomena  at  all  scales  •  Need  to  group  “similar  behavior”  •  Need  to  do  sta?s?cs  (not  just  sor?ng)  
  41. 41. Abstract  the  Problem  Using  Site-­‐En?ty  Logs  Example   Sites   En55es  Measuring  online   Web  sites   Consumers  adver?sing  Drive-­‐by  exploits   Web  sites   Computers   (iden?fied  by   cookies  or  IP)  Compromised   Compromised   User  accounts  systems   computers   42  
  42. 42. MalStone  Schema  •  Event  ID  •  Time  stamp  •  Site  ID  •  En?ty  ID  •  Mark  (categorical  variable)  •  Fit  into  100  bytes  
  43. 43. Toy  Example   map/shuffle   reduce  Events  collected   Map  events  by   For  each  site,  by  device  or   site   compute  counts  processor  in  ?me   and  ra?os  of  order   events  by  type   44  
  44. 44. Distribu?ons  •  Tens  of  millions  of  sites  •  Hundreds  of  millions  of  en??es  •  Billions  of  events  •  Most  sites  have  a  few  number  of  events  •  Some  sites  have  many  events  •  Most  en??es  visit  a  few  sites  •  Some  visitors  visit  many  sites  
  45. 45. MalStone  B   sites   en??es  dk-­‐2   dk-­‐1   dk   ?me   46  
  46. 46. The  Mark  Model  •  Some  sites  are  marked  (percent  of  mark  is  a   parameter  and  type  of  sites  marked  is  a  draw   from  a  distribu?on)  •  Some  en??es  become  marked  aer  visi?ng  a   marked  site  (this  is  a  draw  from  a  distribu?on)  •  There  is  a  delay  between  the  visit  and  the  when   the  en?ty  becomes  marked  (this  is  a  draw  from  a   distribu?on)  •  There  is  a  background  process  that  marks  some   en??es  independent  of  visit  (this  adds  noise  to   problem)  
  47. 47. Exposure  Window   Monitor  Window  dk-­‐2   dk-­‐1   dk   ?me   48  
  48. 48. Nota?on  •  Fix  a  site  s[j]  •  Let  A[j]  be  en??es  that  transact  during   ExpWin  and  if  en?ty  is  marked,  then  visit   occurs  before  mark  •  Let  B[j]  be  all  en??es  in  A[j]  that  become   marked  some?me  during  the  MonWin  •  Subsequent  propor?on  of  marks  is                                          r[j]  =  |  B[j]  |    /    |  A[j]    |  
  49. 49. ExpWin     MonWin  1   MonWin  2  B[j,  t]  are  en??es  that  become  marked  during  MonWin[j]  r[j,  t]  =  |  B[j,  t]  |    /    |  A[j]    |   dk-­‐2   dk-­‐1   dk   ?me   50  
  50. 50. Part  2.    MalStone  Benchmarks  MalGen  and  MalStone  implementa?ons  are  open  source  
  51. 51. MalStone  Benchmark  •  Benchmark  developed  by  Open  Cloud   Consor?um  for  clouds  suppor?ng  data   intensive  compu?ng.  •  Code  to  generate  synthe?c  data  required  is   available  from  •  Stylized  analy?c  computa?on  that  is  easy  to   implement  in  MapReduce  and  its   generaliza?ons.   52  
  52. 52. MalStone  A  &  B  Benchmark   Sta5s5c   #  records   Size  MalStone  A-­‐10   r[j]   10  billion     1  TB  MalStone  A-­‐100   r[j]   100  billion     10  TB  MalStone  A-­‐1000   r[j]   1  trillion   100  TB  MalStone  B-­‐10   r[j,  t]   10  billion     1  TB  MalStone  B-­‐100   r[j,  t]   100  billion     10  TB  MalStone  B-­‐1000   r[j,  t]   1  trillion   100  TB  
  53. 53. •  MalStone  B  running  on  10  Billion  100  byte  records  •  Hadoop  version  0.18.3  •  20  nodes  in  the  Open  Cloud  Testbed  •  MapReduce  required  799  minutes  •  Hadoop  streams  required  142  minutes  
  54. 54. Importance  of  Hadoop  Streams   MalStone  B  Hadoop    v0.18.3   799  min  Hadoop  Streaming  v0.18.3   142  min  Sector/Sphere   44  min  #  Nodes   20  nodes  #  Records   10  Billion  Size  of  Dataset     1  TB   55  
  55. 55. Ques?ons?  For  the  most  current  version  of  these  slides,  please  see  
  56. 56. Best  Prac?ces  for  Building  and    Deploying  Predic?ve  Models  Over  Big  Data     Module  4:  Deploying  Analy?cs   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  57. 57. Deploying  analy?c  models   Analy?c     Diamond  Analy?c  algorithms   Analy?c  opera?ons,  &  models   security  &  compliance   Analy?c  Infrastructure  
  58. 58. Problems  with  Current  Techniques  •  Analy?cs  built  in  modeling  group  and  deployed   by  IT.  •  Time  required  to  deploy  models  and  to   integrate  models  with  other  applica?ons  can   be  long.  •  Models  are  deployed  in  proprietary  formats  •   Models  are  applica?on  dependent  •   Models  are  system  dependent  •   Models  are  architecture  dependent    
  59. 59. Model  Independence  •  A  model  should  be  independent  of  the   environment:   o Produc?on  /  Staging  /  Development   o MapReduce  /  Elas?c  MapReduce  /  Standalone   o Developer’s  desktop  /  Analyst’s  laptop   o Produc?on  Workflow  /  Historical  Data  Store  •  A  model  should  be  independent  of  coding   language.  
  60. 60. (Very  Simplified)  Architectural  View   Model   PMML   Data   Producer   Model   CART,  SVM,  k-­‐means,  etc.  •  The  Predic?ve  Model  Markup  Language   (PMML)  is  an  XML  language  for  sta?s?cal  and   data  mining  models  (  •  With  PMML,  it  is  easy  to  move  models   between  applica?ons  and  plahorms.   61  
  61. 61. Predic?ve  Model  Markup     Language  (PMML)  •   Based  on  XML    •   Benefits  of  PMML   –  Open  standard  for  Data  Mining  &  Sta?s?cal    Models     –  Provides  independence  from  applica?on,  plahorm,   and  opera?ng  system  •  PMML  Producers   –  Some  applica?ons  create  predic?ve  models  •  PMML  Consumers   –  Other  applica?ons  read  or  consume  models  
  62. 62. PMML  •  PMML  is  the  leading  standard  for  sta?s?cal   and  data  mining  models  and  supported  by   over  20  vendors  and  organiza?ons.  •  With  PMML,  it  is  easy  to  develop  a  model  on   one  system  using  one  applica?on  and  deploy   the  model  on  another  system  using  another   applica?on  •  Applica?ons  exist  in  R,  Python,  Java,  SAS,  SPSS   …  
  63. 63. Vendor  Support  •  SPSS   •  R  •  SAS   •  Augustus  •  IBM   •  KNIME  •  Microstrategy   •  &  other  open  source  •  Salford   applica?ons  •  Tibco  •  Zemen?s  •  &  other  vendors    
  64. 64. Philosophy  •   Very  important  to  understand  what  PMML   is  not  concerned  with  …  •   PMML  is  a  specifica?on  of  a  model,  not  an   implementa?on  of  a  model  •   PMML  allows  a  simple  means  of  binding   parameters  to  values  for  an  agreed  upon   set  of  data  mining  models  &   transforma?ons  •   Also,  PMML  includes  the  metadata   required  to  deploy  models  
  65. 65. PMML  Document  Components  •  Data  dic?onary  •  Mining  schema  •  Transforma?on  Dic?onary  •  Mul?ple  models,  including  segments  and   ensembles.  •  Model  verifica?on,  …  •  Univariate  Sta?s?cs  (ModelStats)  •  Op?onal  Extensions    
  66. 66. PMML  Models  •   trees   •   polynomial  •   associa?ons   regression  •   neural  nets   •   logis?c  regression  •   naïve  Bayes     •   general  regression  •   sequences   •   center  based  •   text  models   clusters  •   support  vector   •   density  based   machines   clusters  •   ruleset  
  67. 67. PMML  Data  Flow  •  Data  Dic?onary  defines  data  •  Mining  Schema  defines  specific  inputs   (MiningFields)  required  for  model  •  Transforma?on  Dic?onary  defines  op?onal   addi?onal  derived  fields  •  A=ributes  of  feature  vectors  are  of  two   types     –  a=ributes  defined  by  the  mining  schema   –  derived  a=ributes  defined  via  transforma?ons  •  Models  themselves  can  also  support  certain   transforma?ons  
  68. 68. (Simplified)  Architectural  View   algorithms  to   es?mate  models   Data  Pre-­‐ Model   Data   processing   features   Producer  •  PMML  also  supports  XML  elements   PMML   to  describe  data  preprocessing.   Model   69  
  69. 69. Three  Important  Interfaces  Modeling  Environment   2   1   1   Data  Pre-­‐ Model   PMML  Data   processing   Producer   Model   2   Deployment  Environment     PMML   Model   1   3   3   Model   Post   Consumer   Processing   data   scores   ac?ons   70  
  70. 70. Single  Model  PMML   Order  of  Evalua?on:   DD  àMS  à  TD  à  LT  à  Model   PMML  Document  (Global)   Model  ABC   DD   MS   Model  Execu?on   Data   Data   1-­‐to-­‐1   MF   Targets   Field   Target   Processing  of   TD   elements  for     TD   DF   PMML  supported   DF   models,  such  as   trees,  regression,   Outputs   LT   Naïve  Bayes,   Output   OF   DF   baselines,  etc.    Model      Verifica?on   MV  Fields   <  71  -­‐  PMML  Field  Diagram  (v8)  71  
  71. 71. Augustus  PMML  Producer   Producer  Schema   (part  of  PMML)   Model   PMML   Producer   Model   Producer     Configura?on  File  various  data  formats,  including  csv  files,  h=p  streams,  &  databases     Augusuts  (   72  
  72. 72. Augustus  PMML  Consumer   PMML   Model   Post   Model   Processing   Consumer   scores   Consumer   ac?ons,  alerts,  various  data  formats,   Configura?on  File   reports,  etc.  including  csv  files,  h=p  streams,  &  databases     Augusuts  (   73  
  73. 73. Best  Prac?ces  for  Building  and    Deploying  Predic?ve  Models  Over  Big  Data     Module  5:  Mul?ple  Models   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  74. 74. 1.  Trees  For CART trees: L. Breiman, J. Friedman, R. A. Olshen, C. J. Stone,Classification and Regression Trees, 1984, Chapman & Hall.
  75. 75. Classifica?on  Trees  Petal Len. Petal Width Sepal Len. Sepal Width Species02      14      33      50          A  24      56      31      67          C  23      51      31      69          C  13      45      28      57          B   •  Want  a  func?on  Y  =  g(X),  which  predicts  the   red  variable  Y  using  one  or  more  of  the  blue   variables  X[1],  …,  X[4]   •   Assume  each  row  is  classified  A,  B,  or  C  
  76. 76. Trees  Par??on  Feature  Space   Petal  Width   C   Petal  Length  <  2.45  1.75   A   B   C   Petal  Width  <  1.75?   Petal  Length   A   Petal  Len  <  4.95?   C   2.45   4.95   B   C  •  Trees  par??on  the  feature  space  into  regions  by  asking  whether  an   a=ribute  is  less  than  a  threshold.  
  77. 77. 2.  Mul?ple  Models:  Ensembles  78  
  78. 78. Key  Idea:  Combine  Weak  Learners   Model 2 Model 1 Model 3•  Average  several  weak  models,    rather  than  build   one  complex  model.  
  79. 79. Combining  Weak  Learners  1 Classifier 3 Classifiers 5 Classifiers 55% 57.40% 59.30% 60% 64.0% 68.20% 65% 71.00% 76.50%                                                    1                                                          1                      1                                          1                        2                        1                              1                      3                        3                          1                1                      4                        6                        4                        1        1                  5                    10                      10                      5                  1  
  80. 80. Ensembles  •  Ensembles  are  collec?ons  of  classifiers  with  a   rule  for  combining  the  results.  •  The  simplest  combining  rule:     – Use  average  for  regression  trees.       – Use  majority  vo?ng  for  classifica?on.  •  Ensembles  are  unreasonably  effec?ve  at   building  models.    
  81. 81. Building ensembles over clusters ofcommodity workstations has been usedsince the 1990’s.
  82. 82. Ensemble Models2. buildmodel 3. gather 1.  Partition data and scatter models 2.  Build models (e.g. tree-based model) 3.  Gather models 4. form 4.  Form collection of 1. partition data ensemble models into ensemble (e.g. majority vote for classification & model averaging for regression)
  83. 83. Example:  Fraud  •  Ensembles  of  trees  have  proved  remarkably   effec?ve  in  detec?ng  fraud  •  Trees  can  be  very  large  •  Very  fast  to  score  •  Lots  of  variants   –  Random  forests   –  Random  trees  •  Some?mes  need  to  reverse  engineer  reason   codes.  
  84. 84. 3.  CUSUM   Baseline   Observed   Distribu?on   f0(x)     f1(x)       f0(x)  =  baseline  density    §  Assume  that  we  have  a  mean  and   f1(x)  =  observed  density   standard  devia?on  for  both   g(x)  =  log(f1(x)/f0(x))   distribu?ons.     Z0  =  0   Zn+1  =  max{Zn+g(x),  0}   Alert  if  Zn+1>threshold   85    
  85. 85. 4.  Cubes  of  Models  
  86. 86. Example  1:  Cubes  of  Models  for  Change   Detec?on  for  Payments  System  Build  separate  segmented  model   Geospatial for  every  en?ty  of  interest   region Divide  &  conquer  data  (segment)   using  mul?dimensional  data   cubes  For  each  dis?nct  cube,  establish   Type of separate  baselines  for  each   Transaction quan&fy  of  interest   Entity (bank, etc.) Detect  changes  from  baselines     Estimate separate baselines for each quantify of interest 87  
  87. 87. Example  2  –  Cubes  of  Models     for  Detec?ng  Anomalies  in  Traffic   •  Highway  Traffic  Data   –  each  day  (7)  x  each  hour  (24)  x  each  sensor   (hundreds)  x  each  weather  condi?on  (5)  x  each   special  event  (dozens)   –  50,000  baselines  models  used  in  current   testbed  88  
  88. 88. 5.  Mul?ple  Models  in  Hadoop  •  Trees  in  Mapper  •  Build  lots  of  small  trees  (over  historical  data  in   Hadoop)  •  Load  all  of  them  into  the  mapper  •  Mapper   –  For  each  event,  score  against  each  tree   –  Map  emits   •  Key  t  value  =  event  id  t  tree  id,  score  •  Reducer   –  Aggregate  scores  for  each  event   –  Select  a  “winner”  for  each  event  
  89. 89. Ques?ons?  For  the  most  current  version  of  these  slides,  please  see  
  90. 90. Best  Prac?ces  for  Building  and    Deploying  Predic?ve  Models  Over  Big  Data     Module  6:  Case  Study  -­‐  CTA   Robert  Grossman   Open  Data  Group  &  Univ.  of  Chicago   Collin  Benne=   Open  Data  Group     October  23,  2012    
  91. 91. Chicago  Transit  Authority   •  The  Chicago  Transit  Authority  provides   service  to  Chicago  and  surrounding   suburbs   •  CTA  operates  24  hours  each  day  and  on   an  average  weekday  provides  1.7   million  rides  on  buses  and  trains   •  It  has  1,800  buses,  opera?ng  over  140   routes,  along  2,230  route  miles   •  Buses  provide  approximately  1  million   rides  day  and  serve  more  than  12,000   posted  bus  stops  
  92. 92. CTA  Data  •  The  City  of  Chicago  has  a  public  data  portal   •  h=ps://  •  Some  of  available  categories  are:   •  Administra?on  &  Finance   •  Community  &  Economic  Development   •  Educa?on   •  Facili?es  &  Geographic  Boundaries   •  Transporta?on  
  93. 93. Moving  to  Hadoop  •  For  an  example,  we  model  ridership  by  bus  route            Data  set:  CTA  -­‐  Ridership  -­‐  Bus  Routes  -­‐  Daily  Totals                by  Route  •  MapReduce   –  Map  Input:   •  Route  number,  Day  type,  Number  of  riders   –  Map  Output  /  Reduce  Input:   •  Key  t  value  =  Route  t    day,  date,  riders   –  Reduce  Output:   •  Model  for  each  route  
  94. 94. Example  Data  Set  •  Data  goes  back  to  2001   –  Route  Number   –  Date   –  Day  of  Week  Type   –  Number  of  riders  •  Demo  size:   –  11  MB   –  534,208  records    •  Saved  as  csv  file  in  HDFS  
  95. 95. Workflow  
  96. 96. Workflow  
  97. 97. Workflow  
  98. 98. Mapper  •  Using  Hadoop’s  Streaming  Interface,  we  can  use  a   language  appropriate  for  each  step  •  The  mapper  needs  to  read  the  records  quickly  •  We  use  Python  
  99. 99. Reducer  •  For  this  example,  we  calculate  a  simple   sta?s?c:   –  Gaussian  Distribu?on  of  riders  for  route,  day  of   week  pairs  described  by  the  mean  and  variance  •  We  use  the  Baseline  Model  in  PMML  to  describe  this.  •  The  Baseline  model  is  oen  useful  for  Change   Detec?on.   •  Do  the  records  matching  a  segment  fit  the  expected   distribu?on  describing  the  segment  
  100. 100. Describing  a  Segment  •  Baseline  Model     <BaselineModel functionName="regression”> <MiningSchema> <MiningField usageType="active" name="rides" /> </MiningSchema> <TestDistributions field="rides" testStatistic="zValue”> <Baseline> <GaussianDistribution variance="64308.1625973" mean="1083.88950936”> </GaussianDistribution> </Baseline> </BaselineModel>
  101. 101. Building  PMML  in  the  Reducer  •  You  can  embed  a  PMML  engine  into  the   reducer  if  you  are  using   –  Java   –  R   –  Python  •  We  present  a  light-­‐weight  solu?on  which   builds  valid  models,  but  also  allows  you  to   construct  invalid  ones  
  102. 102. Using  a  Valid  Model  Template  
  103. 103. Template  in  R  •  Similar  but  a  li=le  less  elegant  in  R  
  104. 104. Popula?ng  the  Template  (Python)  •  This  is  done  in  an  XML  /  XPATH  way,  rather  than  as   PMML  
  105. 105. Popula?ng  the  Template  (R)  
  106. 106. Reducer  Trade  Offs  •  Standard  trade  offs  apply   –  Holding  events  in  the  reducer  (easier,  RAM)  vs.   –  Compu?ng  running  sta?s?cs  (more  coding,  CPU)  •  We  present  two  techniques  using  R   –  Holding  all  events  for  a  given  key  in  a  dataframe   and  then  calcula?ng  the  sta?s?c  when  a  key   transi?on  is  iden?fied   –  Holding  the  minimum  data  to  calculate  running   sta?s?cs.    Computed  as  each  event  is  seen  
  107. 107. Two  Approaches  •  Holding  events  requires  extra  memory  and   may  get  slow  as  the  number  of  events  shuffled   to  the  reducer  for  a  given  keys  grows  large   –  But,  it  there  are  a  lot  of  sta?s?cal  and  data  mining   methods  available  in  R  that  operate  on  a   dataframe  •  Tracking  running  sta?s?cs  requires  more   coding   –  But  may  be  the  only  way  to  process  a  batch   without  spilling  to  disk  
  108. 108. Dataframe  •  Accumulate  each  mapped  record  •  For  each  event,  add  it  to  a  temporary  table   doAny <- function(v, date, rides) { # add a column to the temporary table v[["rows"]][[length(v[["rows"]]) + 1]] <- list(date, rides) }•  Create  the  dataframe  when  we  have  them  all
  109. 109. Dataframe  con?nued  •  When  you  have  seen  all  of  the,  process  and  and  compute  doLast <- function(v) { # make a data frame out of the temporary table dataframe <-"rbind", v[["rows"]]) colnames(dataframe) <- c("date", "rides") # pull a column out and call the function on it rides <- unlist(dataframe[,"rides"]) result <- ewup2(rides, alpha) count <- result[["count"]][[length(rides)]] mean <- result[["mean"]][[length(rides)]] variance <- result[["variance"]][[length(rides)]] varn <- result[["varn"]][[length(rides)]]}
  110. 110. Running  Totals  •  Update  sta?s?cs  as  each  event  is  encountered   doNonFirst <- function(v, x) { # update the counters v[["count"]] <- v[["count"]] + 1 diff <- v[["prevx"]] - v[["mean"]] incr <- alpha * diff v[["mean"]] <- v[["mean"]] + incr v[["varn"]] <- (1.0 - alpha)* (v[["varn"]] + incr*diff) v[["prevx"]] <- x }
  111. 111. Running  Totals  con?nued  •  Update  sta?s?cs  as  each  event  is  encountered   doLast <- function(v) { if (v[["count"]] >= 2) { # when N >= 2, correct for 1/(N - 1) variance <- v[["varn"]] / (v[["count"]] - 1.0) } else { variance <- v[["varn"]] } }
  112. 112. Results   •  As  new  data  are  seen,   models  can  be  updated   Model   Model   Model   •  New  routes  added  as   new  segments   Model   •  Model  is  not  bound  to   Model   Hadoop  Model   Model  
  113. 113. EDA  •  Write  plots  to  HDFS  as  SVG  files  (text,  xml)  •  Graphical  view  of  the  reduce  calcula?on  or  sta?s?c   <?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" " svg11.dtd"> <svg style="stroke-linejoin:miter; stroke:black; stroke-width:2.5; text-anchor:middle; fill:none" xmlns="" font-family="Helvetica, Arial, FreeSans, Sans, sans, sans-serif" width="2000px" height="500px" version="1.1" xmlns:xlink=" 1999/xlink" viewBox="0 0 2000 500"> <defs> <clipPath id="Legend_0_clip"> <rect x="1601.2" y="25" width="298.8" height="215" /> </clipPath> <clipPath id="TimeSeries_0_clip"> <rect x="240" y="25" width="1660" height="435" /> </clipPath> <clipPath id="TimeSeries_10_clip"> <rect x="240" y="25" width="1660" height="435" /> </clipPath> <clipPath id="TimeSeries_11_clip"> <rect x="240" y="25" width="1660" height="435" /> …  
  114. 114. EDA  
  115. 115. Ques?ons?  For  the  most  current  version  of  these  slides,  please  see  
  116. 116. About  Robert  Grossman  Robert  Grossman  is  the  Founder  and  a  Partner  of  Open  Data  Group,  which  specializes  in  building  predic?ve  models  over  big  data.  He  is  the  Director  of  Informa?cs  at  the  Ins?tute  for  Genomics  and  Systems  Biology  (IGSB)  and  a  faculty  member  in  the  Computa?on  Ins?tute  (CI)  at  the  University  of  Chicago.  His  areas  of  research  include:  big  data,  predic?ve  analy?cs,  bioinforma?cs,  data  intensive  compu?ng  and  analy?c  infrastructure.  He  has  led  the  development  of  new  open  source  soware  tools  for  analyzing  big  data,  cloud  compu?ng,  data  mining,  distributed  compu?ng  and  high  performance  networking.  Prior  to  star?ng  Open  Data  Group,  he  founded  Magnify,  Inc.  in  1996,  which  provides  data  mining  solu?ons  to  the  insurance  industry.  Grossman  was  Magnify’s  CEO  un?l  2001  and  its  Chairman  un?l  it  was  sold  to  ChoicePoint  in  2005.  He  blogs  about  big  data,  data  science,  and  data  engineering  at  
  117. 117. About  Robert  Grossman  (cont’d)   •  You  can  find  more  informa?on  about  big  data  and  related   areas  on  my  blog:   •  Some  of  my  technical  papers  are  available  online  at       •  I  just  finished  a  book  for  the  non-­‐technical  reader  called  The   Structure  of  Digital  Compu&ng:  From  Mainframes  to  Big  Data,   which  is  available  from  Amazon.    
  118. 118. About  Collin  Benne=  Collin  Benne=  is  a  principal  at  Open  Data  Group.  In  three  and  a  half  years  with  the  company,  Collin  has  worked  on  the  open  source  Augustus  scoring  engine  and  a  cloud-­‐based  environment  for  rapid  analy?c  prototyping  called  RAP.  Addi?onally,  he  has  released  open  source  projects  for  the  Open  Cloud  Consor?um.  One  of  these,  MalGen,  has  been  used  to  benchmark  several  parallel  computa?on  frameworks.  Previously,  he  led  soware  development  for  the  Product  Development  Team  at  Acquity  Group,  an  IT  consul?ng  firm  head-­‐quartered  in  Chicago.  He  also  worked  at  startups  Orbitz  (when  it  was  s?ll  was  one)  and  Business  Logic  Corpora?on.  He  has  co-­‐authored  papers  on  Weyl  tensors,  large  data  clouds,  and  high  performance  wide  area  cloud  testbeds.  He  holds  degrees  in  English,  Mathema?cs  and  Computer  Science.      
  119. 119. About  Open  Data    •  Open  Data  began  opera?ons  in  2001  and  has   built  predic?ve  models  for  companies  for  over   ten  years  •  Open  Data  provides  management  consul?ng,   outsourced  analy?c  services,  &  analy?c  staffing   •  For  more  informa?on   •   •    
  120. 120. About  Open  Data  Group  (cont’d)  Open  Data  Group  has  been  building  predic?ve  models  over  big  data  for  its  clients  over  ten  years.    Open  Data  Group  is  one  of  the  pioneers  using  technologies  such  as  Hadoop  and  NoSQL  databases  so  that  companies  can  build  predic?ve  models  efficiently  over  all  of  their  data.    Open  Data  has  also  par?cipated  in  the  development  of  the  Predic?ve  Model  Markup  Language  (PMML)  so  that  companies  can  more  quickly  deploy  analy?c  models  into  their  opera?onal  systems.  For  the  past  several  years,  Open  Data  has  also  helped  companies  build  predic?ve  models  and  score  data  in  real  ?me  using  elas?c  clouds,  such  as  provided  with  Amazon’s  AWS.  Open  Data’s  consul?ng  philosophy  is  nicely  summed  up  by  Marvin  Bower,  one  of  McKinsey  &  Company’s  founders.  Bower  led  his  firm  using  three  rules:  (1)  put  the  interests  of  the  client  ahead  of  revenues;  (2)  tell  the  truth  and  don’t  be  afraid  to  challenge  a  client’s  opinion;  and  (3)  only  agree  to  perform  work  that  is  necessary  and  something  [you]  can  do  well.