SlideShare a Scribd company logo
1 of 20
Download to read offline
1	
  
	
  
Predic(ve	
  Analy(cs	
  on	
  a	
  Big	
  Data	
  Scale!
Afshin	
  Goodarzi	
  
afshin@1010data.com	
  
	
  
April, 2014
2	
  
About	
  1010data	
  
•  Founded	
  in	
  2000	
  	
  
•  Based	
  in	
  NYC	
  
•  Big	
  Data	
  analyAcs	
  plaCorm	
  in	
  the	
  cloud	
  
•  Library	
  of	
  pre-­‐built	
  analyAcal	
  applicaAons	
  
•  Speed,	
  power	
  and	
  flexibility	
  second	
  to	
  none	
  
3	
  
We	
  Host/Analyze	
  14+	
  Trillion	
  Rows	
  of	
  Data	
  
All Quotes and Trades since 2003 on NYSE are done on 1010data
All mortgages ever issued are analyzed on 1010data
Nearly all real-estate transactions are completed on 1010data
Big Data - Granular Data - Time series Data	
  
All data for ~35,000 Retail outlets across the US are analyzed on 1010data
4	
  
A	
  Typical	
  BI	
  Technology	
  Stack	
  
Administrators	
  
Data Sources
ETL	
  
Inter-­‐Enterprise	
  Users	
  
EDW	
  
Data	
  Cubes/	
  	
  
Marts	
  
ReporAng	
  /	
  
VisualizaAon	
  
Analysis	
  /	
  
Modeling	
  
5	
  
The	
  Stack	
  Has	
  Fallen!	
  
6	
  
The	
  Analy(cs	
  Con(nuum	
  &	
  	
  
	
   	
   	
  	
  	
  	
  A	
  Single	
  Version	
  of	
  the	
  Truth	
  
7	
  
Intui(ve	
  Access	
  to	
  Unlimited	
  Amounts	
  of	
  Data	
  
Partner	
  
Data	
  
3rd	
  Party	
  
Data	
  
1010data	
  Cloud	
  
Corporate	
  
Data	
  
425,369,127,325	
  
Rows!	
  
8	
  
The	
  code:	
  	
  Chart	
  1	
  
<layout	
  background_="white"	
  border_="1"	
  height_="525"	
  name="candlesAck_layout"	
  relpos_="0,50"	
  width_="650">	
  
	
  	
  	
  	
  <widget	
  base_="nyse.trades.hist.all"	
  class_="graphics"	
  invmode_="hide"	
  name="candlesAck"	
  relpos_="25,25"	
  update_="manual"	
  width_="600">	
  
	
  	
  	
  	
  	
  	
  <sel	
  value="between(date;'{@startdate}';'{@enddate}')"/>	
  
	
  	
  	
  	
  	
  	
  <sel	
  value="(symbol='{@symbol}')"/>	
  
	
  	
  	
  	
  	
  	
  <tabu	
  label="Candle	
  SAck"	
  breaks="date">	
  
	
  	
  	
  	
  	
  	
  	
  	
  <break	
  col="date"	
  sort="up"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="wavg"	
  name="vwap"	
  weight="vol"	
  label="VWAP"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="hi"	
  name="high"	
  label="High"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="lo"	
  name="low"	
  label="Low"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="first"	
  name="open"	
  label="Open"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <tcol	
  source="prc"	
  fun="last"	
  name="close"	
  label="Close"/>	
  
	
  	
  	
  	
  	
  	
  </tabu>	
  
	
  	
  	
  	
  	
  	
  <graphspec>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <chart	
  type="candlesAck"	
  Atle="CandlesAck	
  Chart	
  for	
  {@symbol}">	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  <axes	
  xlabel="Date"	
  ylabel="Trading	
  Price"/>	
  
	
  	
  	
  	
  	
  	
  	
  	
  </chart>	
  
	
  	
  	
  	
  	
  	
  </graphspec>	
  
	
  	
  	
  	
  </widget>	
  
	
  	
  	
  	
  <widget	
  class_="bulon"	
  name="candlesAck_refresh"	
  relpos_="475,475"	
  submit_="candlesAck"	
  text_="Refresh"	
  type_="submit"/>	
  
	
  	
  	
  	
  <widget	
  class_="field"	
  label_="Choose	
  Symbol:"	
  name="symbol_input"	
  relpos_="125,475"	
  value_="@symbol"/>	
  
	
  	
  </layout>	
  
Query	
  Chart	
  Spec	
  
9	
  
Predic(ve	
  Analy(cs	
  on	
  a	
  Big	
  Data	
  Scale!	
  
	
  
Big	
  Data	
  mandated	
  AnalyAcs	
  and	
  predicAve	
  modeling	
  -­‐	
  an	
  
example:	
  
The	
  larger	
  data	
  sets	
  have	
  mandated	
  more	
  rigorous	
  sampling	
  
strategies	
  as	
  tradiAonal	
  systems	
  have	
  not	
  kept	
  up	
  with	
  the	
  
computaAonal	
  needs	
  of	
  	
  predicAve	
  analyAc	
  soluAons	
  on	
  Big	
  Data.	
  	
  
	
  
•  Can	
  we	
  use	
  all	
  but	
  a	
  small	
  holdout	
  set	
  in	
  predicAve	
  modeling?	
  	
  
•  What	
  are	
  the	
  challenges?	
  
•  What	
  is	
  an	
  approach	
  that	
  works?	
  	
  
•  Are	
  the	
  results	
  any	
  good?	
  
•  Is	
  this	
  soluAon	
  only	
  applicable	
  to	
  one	
  industry?	
  	
  
10	
  
Common	
  Predic(ve	
  Modeling	
  Approach	
  
" CPU	
  intensive	
  &	
  error	
  prone	
  
steps:	
  
	
  
»  Data	
  selecAon	
  
»  IV	
  to	
  DV	
  relaAonship	
  
»  TransformaAons	
  
»  Sampling	
  and	
  validaAon	
  
»  Model	
  esAmaAon	
  
»  Model	
  tesAng	
  
»  Repeat	
  
10	
  
hlp://onlinepubs.trb.org/onlinepubs/nchrp/cd-­‐22/v2chapter5.html	
  
11	
  
“One	
  Segment”	
  =>	
  “A	
  Segment	
  of	
  One”	
  
“Any	
  customer	
  can	
  have	
  a	
  car	
  painted	
  any	
  color	
  that	
  he	
  wants	
  so	
  long	
  as	
  it	
  is	
  black.”	
  	
  
re:	
  the	
  Model-­‐T	
  in	
  1909	
  (from	
  My	
  Life	
  and	
  Work	
  ,	
  Henry	
  Ford,	
  1922,	
  Chap.	
  4,	
  p.71)	
  
12	
  
Harry	
  Truman	
  displays	
  a	
  copy	
  of	
  the	
  Chicago	
  Daily	
  Tribune	
  newspaper	
  that	
  erroneously	
  reported	
  
the	
  elecAon	
  of	
  Thomas	
  Dewey	
  in	
  1948.	
  Truman’s	
  narrow	
  victory	
  embarrassed	
  pollsters,	
  members	
  
of	
  his	
  own	
  party,	
  and	
  the	
  press	
  who	
  had	
  predicted	
  a	
  Dewey	
  landslide.	
  
13	
  
Build	
  A	
  30	
  Day	
  Shopping	
  List	
  For	
  	
  
Each	
  Loyal	
  Shopper	
  at	
  a	
  Retail	
  Chain	
  
Shopper	
   SKU	
   Probability	
  of	
  
purchase	
  in	
  the	
  next	
  
30	
  days	
  
A.	
  Smith	
   12345	
   90%	
  
A.	
  Smith	
   23567	
   85%	
  
A.	
  Smith	
   ….	
  
A.	
  Smith	
   87996	
   30%	
  
POS	
  
Loyalty	
  
Econ	
  House	
  prices	
  
Mortgage	
  Rates	
  
BLS	
  -­‐	
  Unemployment	
  
Inventory	
  
With	
  Permission	
  from	
  A&P	
  	
  
14	
  
If	
  The	
  Shopper	
  Bought	
  “It”	
  Before	
  Will	
  They	
  Buy	
  
“It”	
  Again?	
  
" Classical	
  modeling:	
  
variables	
  as	
  either	
  
posiAvely	
  or	
  negaAvely	
  
correlated	
  with	
  target	
  
" Shoppers	
  don’t	
  behave	
  the	
  
same!	
  
" The	
  demographics	
  
alributes	
  have	
  
distribuAons	
  for	
  each	
  
variable!	
  
15	
  
Subscribers	
  are	
  “A	
  Segment	
  Of	
  One”!	
  
16	
  
All	
  sources	
  of	
  Prepay	
  as	
  analyzed	
  in	
  1989	
  
D	
  
R	
  
M	
  
Interest	
  Rates	
  
House	
  prices	
  
Unemployment	
  
Loan	
  Age	
  
Cost	
  of	
  opAon	
  
Regional	
  economy	
  
I	
  
hlp://www.freeusandworldmaps.com/html/US_CounAes/US_CounAes.html	
  
hlp://www.tradingeconomics.com/united-­‐states/unemployment-­‐rate	
  
hlp://www.wfa.gov/	
  
hlp://www.richmondfed.org/banking/markets_trends_and_staAsAcs/trends/pdf/delinquency_and_foreclosure_rates.pdf	
  
17	
  
Quality	
  Measures	
  :	
  Lia	
  =>	
  AUC	
  
18	
  
Fine	
  vs.	
  Coarse:	
  Cash	
  flows	
  
19	
  
InQuery	
  analy(cs	
  –	
  	
  
	
   	
   	
  User	
  Defined	
  Group	
  Func(ons	
  
	
  
•  User	
  defined	
  
−  KNN	
  
−  Naïve	
  Bayes	
  
−  ARCH/AR	
  
−  PCA	
  
−  Kernel	
  
−  Decision	
  Tree	
  
−  LogisAcs	
  trees	
  
−  FFT	
  
−  Etc……..	
  
20	
  
Ques(ons?	
  

More Related Content

Viewers also liked

Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4Motheral
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingAnalyticsWeek
 
Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...Haute Autorité de Santé
 
Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)Adam Nash
 
Data Discovery vs BI Webinar
Data Discovery vs BI WebinarData Discovery vs BI Webinar
Data Discovery vs BI WebinarBirst
 
Yellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slidesYellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slidesYellowfin
 
Market Access Database Spain 2013
Market Access Database Spain 2013Market Access Database Spain 2013
Market Access Database Spain 2013Josep Darba
 
K3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, DelhiK3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, DelhiFujitsu India
 
Pharmaceutical selling skills
Pharmaceutical selling skills Pharmaceutical selling skills
Pharmaceutical selling skills Sash P
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20Étienne Garbugli
 
24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased Productivity24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased ProductivityIulian Olariu
 

Viewers also liked (12)

Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4Teradata Apps Salesforce Quick Overview for SI's 2013 v4
Teradata Apps Salesforce Quick Overview for SI's 2013 v4
 
Data-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reportingData-As-A-Service to enable compliance reporting
Data-As-A-Service to enable compliance reporting
 
Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...Making use of various information systems to disseminate HTA knowledge in Fra...
Making use of various information systems to disseminate HTA knowledge in Fra...
 
Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)Personal Finance for Palantir (June 2015)
Personal Finance for Palantir (June 2015)
 
Data Discovery vs BI Webinar
Data Discovery vs BI WebinarData Discovery vs BI Webinar
Data Discovery vs BI Webinar
 
Yellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slidesYellowfin 7.1 launch webinar slides
Yellowfin 7.1 launch webinar slides
 
Sempo big data & the new 4 ps
Sempo big data & the new 4 psSempo big data & the new 4 ps
Sempo big data & the new 4 ps
 
Market Access Database Spain 2013
Market Access Database Spain 2013Market Access Database Spain 2013
Market Access Database Spain 2013
 
K3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, DelhiK3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
K3.Fujitsu World Tour India 2016-Customer Presentation, Delhi
 
Pharmaceutical selling skills
Pharmaceutical selling skills Pharmaceutical selling skills
Pharmaceutical selling skills
 
26 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 2026 Time Management Hacks I Wish I'd Known at 20
26 Time Management Hacks I Wish I'd Known at 20
 
24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased Productivity24 Time Management Hacks to Develop for Increased Productivity
24 Time Management Hacks to Develop for Increased Productivity
 

Similar to Rethinking classical approaches to analysis and predictive modeling

Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsFujio Turner
 
Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Wil Davis
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer DataWSO2
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013nkabra
 
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...Rod King, Ph.D.
 
Dr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New OilDr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New OilStefan Schwarz
 
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
a2c Boston Big Data Meet-up:  Agile Data Warehouse Designa2c Boston Big Data Meet-up:  Agile Data Warehouse Design
a2c Boston Big Data Meet-up: Agile Data Warehouse Designa2c
 
Webinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDBWebinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDBMongoDB
 
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalyticsconf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalyticsTom LaGatta
 
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...Authoritas
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewNagaraj Yerram
 
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubEnable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubCloudera, Inc.
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesRevolution Analytics
 
Sales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learningSales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learningdataalcott
 

Similar to Rethinking classical approaches to analysis and predictive modeling (20)

A6 big data_in_the_cloud
A6 big data_in_the_cloudA6 big data_in_the_cloud
A6 big data_in_the_cloud
 
Big Data for Small Businesses & Startups
Big Data for Small Businesses & StartupsBig Data for Small Businesses & Startups
Big Data for Small Businesses & Startups
 
Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...Price optimization for high-mix, low-volume environments | Using R and Tablea...
Price optimization for high-mix, low-volume environments | Using R and Tablea...
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
Making the Most of Customer Data
Making the Most of Customer DataMaking the Most of Customer Data
Making the Most of Customer Data
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
 
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
Steve Blank’s Petal Diagram vs. Rod King’s Value Engine Map: Visual Tools for...
 
Big data
Big dataBig data
Big data
 
Dr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New OilDr. Stefan Schwarz - Data is the New Oil
Dr. Stefan Schwarz - Data is the New Oil
 
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
a2c Boston Big Data Meet-up:  Agile Data Warehouse Designa2c Boston Big Data Meet-up:  Agile Data Warehouse Design
a2c Boston Big Data Meet-up: Agile Data Warehouse Design
 
Data Mining
Data MiningData Mining
Data Mining
 
Webinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDBWebinar: Making A Single View of the Customer Real with MongoDB
Webinar: Making A Single View of the Customer Real with MongoDB
 
LTV Predictions: How do real-life companies use them & what can you learn fro...
LTV Predictions: How do real-life companies use them & what can you learn fro...LTV Predictions: How do real-life companies use them & what can you learn fro...
LTV Predictions: How do real-life companies use them & what can you learn fro...
 
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalyticsconf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
conf2015_TLaGatta_CHarris_Splunk_BusinessAnalytics_DeliveringHighLevelAnalytics
 
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
Big Data graph Clustering with Laurence O'Toole - Digital Marketing Show, Nov...
 
Traditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overviewTraditional Data-warehousing / BI overview
Traditional Data-warehousing / BI overview
 
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data HubEnable Advanced Analytics with Hadoop and an Enterprise Data Hub
Enable Advanced Analytics with Hadoop and an Enterprise Data Hub
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
Sales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learningSales prediction on black friday dataset using machine learning
Sales prediction on black friday dataset using machine learning
 

More from AnalyticsWeek

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataAnalyticsWeek
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsAnalyticsWeek
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in governmentAnalyticsWeek
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of RAnalyticsWeek
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataAnalyticsWeek
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3AnalyticsWeek
 

More from AnalyticsWeek (7)

Understanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big DataUnderstanding Customer Buying Journey with Big Data
Understanding Customer Buying Journey with Big Data
 
Making sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into thingsMaking sense of unstructured data by turning strings into things
Making sense of unstructured data by turning strings into things
 
Reimagining the role of data in government
Reimagining the role of data in governmentReimagining the role of data in government
Reimagining the role of data in government
 
The History and Use of R
The History and Use of RThe History and Use of R
The History and Use of R
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Big Data Introduction to D3
Big Data Introduction to D3Big Data Introduction to D3
Big Data Introduction to D3
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Rethinking classical approaches to analysis and predictive modeling

  • 1. 1     Predic(ve  Analy(cs  on  a  Big  Data  Scale! Afshin  Goodarzi   afshin@1010data.com     April, 2014
  • 2. 2   About  1010data   •  Founded  in  2000     •  Based  in  NYC   •  Big  Data  analyAcs  plaCorm  in  the  cloud   •  Library  of  pre-­‐built  analyAcal  applicaAons   •  Speed,  power  and  flexibility  second  to  none  
  • 3. 3   We  Host/Analyze  14+  Trillion  Rows  of  Data   All Quotes and Trades since 2003 on NYSE are done on 1010data All mortgages ever issued are analyzed on 1010data Nearly all real-estate transactions are completed on 1010data Big Data - Granular Data - Time series Data   All data for ~35,000 Retail outlets across the US are analyzed on 1010data
  • 4. 4   A  Typical  BI  Technology  Stack   Administrators   Data Sources ETL   Inter-­‐Enterprise  Users   EDW   Data  Cubes/     Marts   ReporAng  /   VisualizaAon   Analysis  /   Modeling  
  • 5. 5   The  Stack  Has  Fallen!  
  • 6. 6   The  Analy(cs  Con(nuum  &                A  Single  Version  of  the  Truth  
  • 7. 7   Intui(ve  Access  to  Unlimited  Amounts  of  Data   Partner   Data   3rd  Party   Data   1010data  Cloud   Corporate   Data   425,369,127,325   Rows!  
  • 8. 8   The  code:    Chart  1   <layout  background_="white"  border_="1"  height_="525"  name="candlesAck_layout"  relpos_="0,50"  width_="650">          <widget  base_="nyse.trades.hist.all"  class_="graphics"  invmode_="hide"  name="candlesAck"  relpos_="25,25"  update_="manual"  width_="600">              <sel  value="between(date;'{@startdate}';'{@enddate}')"/>              <sel  value="(symbol='{@symbol}')"/>              <tabu  label="Candle  SAck"  breaks="date">                  <break  col="date"  sort="up"/>                  <tcol  source="prc"  fun="wavg"  name="vwap"  weight="vol"  label="VWAP"/>                  <tcol  source="prc"  fun="hi"  name="high"  label="High"/>                  <tcol  source="prc"  fun="lo"  name="low"  label="Low"/>                  <tcol  source="prc"  fun="first"  name="open"  label="Open"/>                  <tcol  source="prc"  fun="last"  name="close"  label="Close"/>              </tabu>              <graphspec>                  <chart  type="candlesAck"  Atle="CandlesAck  Chart  for  {@symbol}">                      <axes  xlabel="Date"  ylabel="Trading  Price"/>                  </chart>              </graphspec>          </widget>          <widget  class_="bulon"  name="candlesAck_refresh"  relpos_="475,475"  submit_="candlesAck"  text_="Refresh"  type_="submit"/>          <widget  class_="field"  label_="Choose  Symbol:"  name="symbol_input"  relpos_="125,475"  value_="@symbol"/>      </layout>   Query  Chart  Spec  
  • 9. 9   Predic(ve  Analy(cs  on  a  Big  Data  Scale!     Big  Data  mandated  AnalyAcs  and  predicAve  modeling  -­‐  an   example:   The  larger  data  sets  have  mandated  more  rigorous  sampling   strategies  as  tradiAonal  systems  have  not  kept  up  with  the   computaAonal  needs  of    predicAve  analyAc  soluAons  on  Big  Data.       •  Can  we  use  all  but  a  small  holdout  set  in  predicAve  modeling?     •  What  are  the  challenges?   •  What  is  an  approach  that  works?     •  Are  the  results  any  good?   •  Is  this  soluAon  only  applicable  to  one  industry?    
  • 10. 10   Common  Predic(ve  Modeling  Approach   " CPU  intensive  &  error  prone   steps:     »  Data  selecAon   »  IV  to  DV  relaAonship   »  TransformaAons   »  Sampling  and  validaAon   »  Model  esAmaAon   »  Model  tesAng   »  Repeat   10   hlp://onlinepubs.trb.org/onlinepubs/nchrp/cd-­‐22/v2chapter5.html  
  • 11. 11   “One  Segment”  =>  “A  Segment  of  One”   “Any  customer  can  have  a  car  painted  any  color  that  he  wants  so  long  as  it  is  black.”     re:  the  Model-­‐T  in  1909  (from  My  Life  and  Work  ,  Henry  Ford,  1922,  Chap.  4,  p.71)  
  • 12. 12   Harry  Truman  displays  a  copy  of  the  Chicago  Daily  Tribune  newspaper  that  erroneously  reported   the  elecAon  of  Thomas  Dewey  in  1948.  Truman’s  narrow  victory  embarrassed  pollsters,  members   of  his  own  party,  and  the  press  who  had  predicted  a  Dewey  landslide.  
  • 13. 13   Build  A  30  Day  Shopping  List  For     Each  Loyal  Shopper  at  a  Retail  Chain   Shopper   SKU   Probability  of   purchase  in  the  next   30  days   A.  Smith   12345   90%   A.  Smith   23567   85%   A.  Smith   ….   A.  Smith   87996   30%   POS   Loyalty   Econ  House  prices   Mortgage  Rates   BLS  -­‐  Unemployment   Inventory   With  Permission  from  A&P    
  • 14. 14   If  The  Shopper  Bought  “It”  Before  Will  They  Buy   “It”  Again?   " Classical  modeling:   variables  as  either   posiAvely  or  negaAvely   correlated  with  target   " Shoppers  don’t  behave  the   same!   " The  demographics   alributes  have   distribuAons  for  each   variable!  
  • 15. 15   Subscribers  are  “A  Segment  Of  One”!  
  • 16. 16   All  sources  of  Prepay  as  analyzed  in  1989   D   R   M   Interest  Rates   House  prices   Unemployment   Loan  Age   Cost  of  opAon   Regional  economy   I   hlp://www.freeusandworldmaps.com/html/US_CounAes/US_CounAes.html   hlp://www.tradingeconomics.com/united-­‐states/unemployment-­‐rate   hlp://www.wfa.gov/   hlp://www.richmondfed.org/banking/markets_trends_and_staAsAcs/trends/pdf/delinquency_and_foreclosure_rates.pdf  
  • 17. 17   Quality  Measures  :  Lia  =>  AUC  
  • 18. 18   Fine  vs.  Coarse:  Cash  flows  
  • 19. 19   InQuery  analy(cs  –          User  Defined  Group  Func(ons     •  User  defined   −  KNN   −  Naïve  Bayes   −  ARCH/AR   −  PCA   −  Kernel   −  Decision  Tree   −  LogisAcs  trees   −  FFT   −  Etc……..