Rethinking classical approaches to analysis and predictive modeling


Published on

The speaker will address the need to rethink classical approaches to analysis and predictive modeling. He will examine "iterative analytics" and extremely fine grained segmentation down to a single customer -- ultimately building one model per customer or millions of predictive models delivering on the promise of "segment of one" . The speaker will also address the speed at which all this has to work to maintain a competitive advantage for innovative businesses.

Afshin Goodarzi, Chief Analyst 1010data

A veteran of analytics, Goodarzi has led several teams in designing, building and delivering predictive analytics and business analytical products to a diverse set of industries. Prior to joining 1010data, Goodarzi was the Managing Director of Mortgage at Equifax, responsible for the creation of new data products and supporting analytics to the financial industry. Previously, he led the development of various classes of predictive models aimed at the mortgage industry during his tenure at Loan Performance (Core Logic). Earlier on he had worked at BlackRock, the research center for NYNEX (present day Verizon) and Norkom Technologies. Goodarzi's publications span the fields of data mining, data visualization, optimization and artificial intelligence.

1010Data [ ]
Microsoft NERD [ ]
Cognizeus [ ]

Published in: Technology

Rethinking classical approaches to analysis and predictive modeling

  1. 1. 1     Predic(ve  Analy(cs  on  a  Big  Data  Scale! Afshin  Goodarzi     April, 2014
  2. 2. 2   About  1010data   •  Founded  in  2000     •  Based  in  NYC   •  Big  Data  analyAcs  plaCorm  in  the  cloud   •  Library  of  pre-­‐built  analyAcal  applicaAons   •  Speed,  power  and  flexibility  second  to  none  
  3. 3. 3   We  Host/Analyze  14+  Trillion  Rows  of  Data   All Quotes and Trades since 2003 on NYSE are done on 1010data All mortgages ever issued are analyzed on 1010data Nearly all real-estate transactions are completed on 1010data Big Data - Granular Data - Time series Data   All data for ~35,000 Retail outlets across the US are analyzed on 1010data
  4. 4. 4   A  Typical  BI  Technology  Stack   Administrators   Data Sources ETL   Inter-­‐Enterprise  Users   EDW   Data  Cubes/     Marts   ReporAng  /   VisualizaAon   Analysis  /   Modeling  
  5. 5. 5   The  Stack  Has  Fallen!  
  6. 6. 6   The  Analy(cs  Con(nuum  &                A  Single  Version  of  the  Truth  
  7. 7. 7   Intui(ve  Access  to  Unlimited  Amounts  of  Data   Partner   Data   3rd  Party   Data   1010data  Cloud   Corporate   Data   425,369,127,325   Rows!  
  8. 8. 8   The  code:    Chart  1   <layout  background_="white"  border_="1"  height_="525"  name="candlesAck_layout"  relpos_="0,50"  width_="650">          <widget  base_="nyse.trades.hist.all"  class_="graphics"  invmode_="hide"  name="candlesAck"  relpos_="25,25"  update_="manual"  width_="600">              <sel  value="between(date;'{@startdate}';'{@enddate}')"/>              <sel  value="(symbol='{@symbol}')"/>              <tabu  label="Candle  SAck"  breaks="date">                  <break  col="date"  sort="up"/>                  <tcol  source="prc"  fun="wavg"  name="vwap"  weight="vol"  label="VWAP"/>                  <tcol  source="prc"  fun="hi"  name="high"  label="High"/>                  <tcol  source="prc"  fun="lo"  name="low"  label="Low"/>                  <tcol  source="prc"  fun="first"  name="open"  label="Open"/>                  <tcol  source="prc"  fun="last"  name="close"  label="Close"/>              </tabu>              <graphspec>                  <chart  type="candlesAck"  Atle="CandlesAck  Chart  for  {@symbol}">                      <axes  xlabel="Date"  ylabel="Trading  Price"/>                  </chart>              </graphspec>          </widget>          <widget  class_="bulon"  name="candlesAck_refresh"  relpos_="475,475"  submit_="candlesAck"  text_="Refresh"  type_="submit"/>          <widget  class_="field"  label_="Choose  Symbol:"  name="symbol_input"  relpos_="125,475"  value_="@symbol"/>      </layout>   Query  Chart  Spec  
  9. 9. 9   Predic(ve  Analy(cs  on  a  Big  Data  Scale!     Big  Data  mandated  AnalyAcs  and  predicAve  modeling  -­‐  an   example:   The  larger  data  sets  have  mandated  more  rigorous  sampling   strategies  as  tradiAonal  systems  have  not  kept  up  with  the   computaAonal  needs  of    predicAve  analyAc  soluAons  on  Big  Data.       •  Can  we  use  all  but  a  small  holdout  set  in  predicAve  modeling?     •  What  are  the  challenges?   •  What  is  an  approach  that  works?     •  Are  the  results  any  good?   •  Is  this  soluAon  only  applicable  to  one  industry?    
  10. 10. 10   Common  Predic(ve  Modeling  Approach   " CPU  intensive  &  error  prone   steps:     »  Data  selecAon   »  IV  to  DV  relaAonship   »  TransformaAons   »  Sampling  and  validaAon   »  Model  esAmaAon   »  Model  tesAng   »  Repeat   10   hlp://­‐22/v2chapter5.html  
  11. 11. 11   “One  Segment”  =>  “A  Segment  of  One”   “Any  customer  can  have  a  car  painted  any  color  that  he  wants  so  long  as  it  is  black.”     re:  the  Model-­‐T  in  1909  (from  My  Life  and  Work  ,  Henry  Ford,  1922,  Chap.  4,  p.71)  
  12. 12. 12   Harry  Truman  displays  a  copy  of  the  Chicago  Daily  Tribune  newspaper  that  erroneously  reported   the  elecAon  of  Thomas  Dewey  in  1948.  Truman’s  narrow  victory  embarrassed  pollsters,  members   of  his  own  party,  and  the  press  who  had  predicted  a  Dewey  landslide.  
  13. 13. 13   Build  A  30  Day  Shopping  List  For     Each  Loyal  Shopper  at  a  Retail  Chain   Shopper   SKU   Probability  of   purchase  in  the  next   30  days   A.  Smith   12345   90%   A.  Smith   23567   85%   A.  Smith   ….   A.  Smith   87996   30%   POS   Loyalty   Econ  House  prices   Mortgage  Rates   BLS  -­‐  Unemployment   Inventory   With  Permission  from  A&P    
  14. 14. 14   If  The  Shopper  Bought  “It”  Before  Will  They  Buy   “It”  Again?   " Classical  modeling:   variables  as  either   posiAvely  or  negaAvely   correlated  with  target   " Shoppers  don’t  behave  the   same!   " The  demographics   alributes  have   distribuAons  for  each   variable!  
  15. 15. 15   Subscribers  are  “A  Segment  Of  One”!  
  16. 16. 16   All  sources  of  Prepay  as  analyzed  in  1989   D   R   M   Interest  Rates   House  prices   Unemployment   Loan  Age   Cost  of  opAon   Regional  economy   I   hlp://   hlp://­‐states/unemployment-­‐rate   hlp://   hlp://  
  17. 17. 17   Quality  Measures  :  Lia  =>  AUC  
  18. 18. 18   Fine  vs.  Coarse:  Cash  flows  
  19. 19. 19   InQuery  analy(cs  –          User  Defined  Group  Func(ons     •  User  defined   −  KNN   −  Naïve  Bayes   −  ARCH/AR   −  PCA   −  Kernel   −  Decision  Tree   −  LogisAcs  trees   −  FFT   −  Etc……..  
  20. 20. 20   Ques(ons?