SlideShare a Scribd company logo
How Predictive Modelers Should Think about Big Data
Dean Abbott
Co-Founder and Chief Data Scientist, SmarterHQ
dabbott@smarterhq.com
Twitter: @deanabb
2
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
3
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
4
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
From Olap.com
5
The Usual Big Data Talk Track
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
http://whatis.techtarget.com/definition/3Vs
6
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
7
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
8
Big Data: Google Trends
Jan 1, 2004 Aug 1, 2008 Mar 1, 2013
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
9
What is Big Data?
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
10
What is Big Data?
https://www.pinterest.com/pin/30962316158410859/
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
11
How Much Data is Big?
More data than you can
process efficiently
ISBN-13: 978-1118824825
12
Big Data Contains….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
13
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
15
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
“70 percent of US
millennials say they would
appreciate a brand or
retailer using AI technology
to show more interesting
products. And 72
percent believe that as the
technology develops, brands
using AI will be able to
accurately predict what they
want.”
16
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
https://venturebeat.com/2016/09/26/how-a-i-is-helping-retailers/
“The future of retail technology lies in solutions
that are powered by machine learning, which
can provide fast and intelligent automation as
well as dynamic scalability. Machine learning
unleashes powerful self-adapting algorithms to
uncover latent patterns of behavior that are
difficult or impossible for decision-makers to
discover on their own. “
17
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
“Extreme Personalization
“…modern commerce continues to evolve
from ‘what’s new’ to the ‘next-new’ player on
the block. To compete, every company —
brick-and-mortar, e-commerce, and modern
commerce — needs to perpetually innovate
on every front.”
“Engagement, not reach: AI and machine
learning is advancing engagement tools to
scale cross-channel, personalized messaging in
the moments that matter in the channel
customers prefer.”
18
Big Data Means Integrating Lots of Sources
Database
CRM
Flat
Files
IoT
ETL
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
19
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
20
“The vast majority of the
challenges companies struggle
as they operationalize Big
Data are related to people,
not technology: issues like
organizational alignment,
business process and
adoption, and change
management.”
https://hbr.org/2016/02/just-using-big-data-isnt-enough-anymore
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
21
Big Data is….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
22
Big Data is….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
23
Big Data Contains….
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
24
Big Data Can Cause RAM Problems
Rows Columns GB
250,000 100 0.19
250,000 1,000 1.87
1,000,000 1,000 7.48
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
25
Big Data Can Cause RAM Problems
Rows Columns GB
250,000 100 0.19
250,000 1,000 1.87
1,000,000 1,000 7.48
10,000,000 1,000 74.77
10,000,000 10,000 747.66
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
26
Big Data Can Overwhelm -> Width
• Adding features & interactions make big data bigger (worse computationally!)
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
27
Big Data can Mislead
2X
8X
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
29
The Answer is…
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
30
Be Judicious
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
31
Leverage Scalable Environments
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
32
Teradata
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
33
Amazon AWS
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
34
Azure
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
35
Google
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
36
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
http://www.kdnuggets.com/2015/08/big-data-question-hadoop-spark.html
37
Parallelize Record Operations
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
38
Parallelize Column Operations
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
39
Parallelize Building Predictive Models Themselves
• The Target: Column
Days to Next Purchase <= 7 days
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
40
Parallelize Building Predictive Models Themselves
• The Target(s): Columns
– Suitable for same types of models for multiple target variables
Days to Next Purchase <= 1 day
Days to Next Purchase <= 3 days
Days to Next Purchase <= 7 days
Days to Next Purchase <= 15 days
Days to Next Purchase <= 30 days
Days to Next Purchase 30-60
days
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
41
NY City Taxi Data
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
• 5,199,911 observations
• 19 variables
• 1.05 GB
Field Name Description
VendorID A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle.
This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude Longitude where the meter was engaged.
Pickup_latitude Latitude where the meter was engaged.
RateCodeID The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle
memory before sending to the vendor, aka “store and forward,”
because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
Dropoff_longitude Longitude where the meter was disengaged.
Dropoff_ latitude Latitude where the meter was disengaged.
Payment_type A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter
Extra Miscellaneous extras and surcharges. Currently, this only includes
the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered
rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop.
Tip_amount Tip amount – This field is automatically populated for credit card
tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.
Thanks to Joshua Adams for the Azure test results
https://www.linkedin.com/in/joshuaadams3/
42
Cores Algorithm Rows Features Elapsed Time
Single Random Forest 25000 19 0:02:58
Single Random Forest 50000 19 0:07:08
Single Random Forest 100000 19 1:11:48
Single Random Forest 200000 19 1:43:05
Single Random Forest 400000 19 5:25:05
Single Random Forest 800000 19 19:25:50
Multiple Random Forest 25000 19 0:01:32
Multiple Random Forest 50000 19 0:03:47
Multiple Random Forest 100000 19 0:34:12
Multiple Random Forest 200000 19 0:57:16
Multiple Random Forest 400000 19 1:48:23
Multiple Random Forest 800000 19 3:48:03
Processing Results in Azure
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
Thanks to Joshua Adams for the Azure test results
https://www.linkedin.com/in/joshuaadams3/
43
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
44
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
45
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
• Neural Networks
– Architecture selection: # hidden layers and # neurons per hidden layer
– Learning parameters (learning rate, momentum for backprop)
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
46
Algorithm searches we don’t always have “time” to do
• K-Means Clustering
– # clusters, optimal set of inputs
• k-NN
– k, find optimal set of inputs
• Neural Networks
– Architecture selection: # hidden layers and # neurons per hidden layer
– Learning parameters (learning rate, momentum for backprop)
• Logistic Regression
– Factorial design / interaction effects
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
47
• The Good: Big data + AI is here and decision-makers care
• The Bad: Big data is big, but not smart; requires company buy-in
• The Ugly: Big data stresses infrastructure
• One Solution: cloud computing and parallelization
Conclusions
© Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
THANK YOU!
SmarterHQ.com | @deanabb | dabbott@SmarterHQ.com

More Related Content

What's hot

Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Thoughtworks
 
Why Your Product Needs A Data & Analytics Strategy
Why Your Product Needs A Data & Analytics StrategyWhy Your Product Needs A Data & Analytics Strategy
Why Your Product Needs A Data & Analytics Strategy
AIPMM Administration
 

What's hot (20)

Radical Analytics, Superweek Hungary, January 2017
Radical Analytics, Superweek Hungary, January 2017Radical Analytics, Superweek Hungary, January 2017
Radical Analytics, Superweek Hungary, January 2017
 
TLabs - deutsche telekom
TLabs -  deutsche telekomTLabs -  deutsche telekom
TLabs - deutsche telekom
 
925 plenary rexer_using our laptop
925 plenary rexer_using our laptop925 plenary rexer_using our laptop
925 plenary rexer_using our laptop
 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
 
Seagate
SeagateSeagate
Seagate
 
Operationalizing Data Science: The Right Architecture and Tools
Operationalizing Data Science: The Right Architecture and ToolsOperationalizing Data Science: The Right Architecture and Tools
Operationalizing Data Science: The Right Architecture and Tools
 
Data Science Salon: Enabling self-service predictive analytics at Bidtellect
Data Science Salon: Enabling self-service predictive analytics at BidtellectData Science Salon: Enabling self-service predictive analytics at Bidtellect
Data Science Salon: Enabling self-service predictive analytics at Bidtellect
 
1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop1645 track 1 bress_using his laptop
1645 track 1 bress_using his laptop
 
Five Pitfalls when Operationalizing Data Science and a Strategy for Success
Five Pitfalls when Operationalizing Data Science and a Strategy for SuccessFive Pitfalls when Operationalizing Data Science and a Strategy for Success
Five Pitfalls when Operationalizing Data Science and a Strategy for Success
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
H2O World - Translating Advanced Analytics for Business Users - Conor Jensen
H2O World - Translating Advanced Analytics for Business Users - Conor JensenH2O World - Translating Advanced Analytics for Business Users - Conor Jensen
H2O World - Translating Advanced Analytics for Business Users - Conor Jensen
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
940 diamond sponsor sengupta
940 diamond sponsor sengupta940 diamond sponsor sengupta
940 diamond sponsor sengupta
 
1140 track 1 weiss_using his mac
1140 track 1 weiss_using his mac1140 track 1 weiss_using his mac
1140 track 1 weiss_using his mac
 
Why Your Product Needs A Data & Analytics Strategy
Why Your Product Needs A Data & Analytics StrategyWhy Your Product Needs A Data & Analytics Strategy
Why Your Product Needs A Data & Analytics Strategy
 
1030 track 3 rolleston_using our laptop
1030 track 3 rolleston_using our laptop1030 track 3 rolleston_using our laptop
1030 track 3 rolleston_using our laptop
 
Data Science Salon: Building smart AI: How Deep Learning Can Get You Into Dee...
Data Science Salon: Building smart AI: How Deep Learning Can Get You Into Dee...Data Science Salon: Building smart AI: How Deep Learning Can Get You Into Dee...
Data Science Salon: Building smart AI: How Deep Learning Can Get You Into Dee...
 
Notilyze SAS
Notilyze SASNotilyze SAS
Notilyze SAS
 
H2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientistsH2O World - Machine Learning for non-data scientists
H2O World - Machine Learning for non-data scientists
 
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market ShareData Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
Data Science Salon: Adopting Machine Learning to Drive Revenue and Market Share
 

Similar to 900 keynote abbott

Predicting Customer Behavior With Big Data
Predicting Customer Behavior With Big Data Predicting Customer Behavior With Big Data
Predicting Customer Behavior With Big Data
Pactera_US
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
Ulf Mattsson
 

Similar to 900 keynote abbott (20)

Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...
 
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 
Building IoT Analytics (IOT327-R1) - AWS re:Invent 2018
Building IoT Analytics (IOT327-R1) - AWS re:Invent 2018Building IoT Analytics (IOT327-R1) - AWS re:Invent 2018
Building IoT Analytics (IOT327-R1) - AWS re:Invent 2018
 
Predicting Customer Behavior With Big Data
Predicting Customer Behavior With Big Data Predicting Customer Behavior With Big Data
Predicting Customer Behavior With Big Data
 
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
SPS Utah 2016 - Unlock your big data with analytics and BI on Office 365
 
1705 keynote abbott
1705 keynote abbott1705 keynote abbott
1705 keynote abbott
 
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix WebinarFuture-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
Future-Proof Your Streaming Analytics Architecture- StreamAnalytix Webinar
 
SPT 104 Unlock your big data with analytics and BI on Office 365
SPT 104 Unlock your big data with analytics and BI on Office 365SPT 104 Unlock your big data with analytics and BI on Office 365
SPT 104 Unlock your big data with analytics and BI on Office 365
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security Analytics
 
Privacy preserving computing and secure multi party computation
Privacy preserving computing and secure multi party computationPrivacy preserving computing and secure multi party computation
Privacy preserving computing and secure multi party computation
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
 
ISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
ISC2 Privacy-Preserving Analytics and Secure Multiparty ComputationISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
ISC2 Privacy-Preserving Analytics and Secure Multiparty Computation
 
The Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science TeamThe Data Lake: Empowering Your Data Science Team
The Data Lake: Empowering Your Data Science Team
 
The LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity ModelThe LCG Digital Transformation Maturity Model
The LCG Digital Transformation Maturity Model
 
New technologies for data protection
New technologies for data protectionNew technologies for data protection
New technologies for data protection
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
ANIn Pune July 2023 |Prompt Engineering and AI first SDLC by Abhijit Shah
ANIn Pune July 2023 |Prompt Engineering and AI first SDLC by Abhijit ShahANIn Pune July 2023 |Prompt Engineering and AI first SDLC by Abhijit Shah
ANIn Pune July 2023 |Prompt Engineering and AI first SDLC by Abhijit Shah
 
Just ask Watson Seminar
Just ask Watson SeminarJust ask Watson Seminar
Just ask Watson Seminar
 

More from Rising Media, Inc.

More from Rising Media, Inc. (20)

1415 track 1 wu_using his laptop
1415 track 1 wu_using his laptop1415 track 1 wu_using his laptop
1415 track 1 wu_using his laptop
 
Matt gershoff
Matt gershoffMatt gershoff
Matt gershoff
 
Keynote adam greco
Keynote adam grecoKeynote adam greco
Keynote adam greco
 
1620 keynote olson_using our laptop
1620 keynote olson_using our laptop1620 keynote olson_using our laptop
1620 keynote olson_using our laptop
 
1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop1530 track 2 stuart_using our laptop
1530 track 2 stuart_using our laptop
 
1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop1530 track 1 fader_using our laptop
1530 track 1 fader_using our laptop
 
1415 track 2 richardson
1415 track 2 richardson1415 track 2 richardson
1415 track 2 richardson
 
1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop1215 daa lunch owusu_using our laptop
1215 daa lunch owusu_using our laptop
 
1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop1215 daa lunch a bos intro slides_using our laptop
1215 daa lunch a bos intro slides_using our laptop
 
915 e metrics_claudia perlich
915 e metrics_claudia perlich915 e metrics_claudia perlich
915 e metrics_claudia perlich
 
855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop855 sponsor movassate_using our laptop
855 sponsor movassate_using our laptop
 
1615 plack using our laptop
1615 plack using our laptop1615 plack using our laptop
1615 plack using our laptop
 
1530 rimmele do not share
1530 rimmele do not share1530 rimmele do not share
1530 rimmele do not share
 
1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable1325 keynote yale_pdf shareable
1325 keynote yale_pdf shareable
 
1115 fiztgerald schuchardt
1115 fiztgerald schuchardt1115 fiztgerald schuchardt
1115 fiztgerald schuchardt
 
1000 kondic do not share
1000 kondic do not share1000 kondic do not share
1000 kondic do not share
 
905 keynote peele_using our laptop
905 keynote peele_using our laptop905 keynote peele_using our laptop
905 keynote peele_using our laptop
 
Stephen morse sharable
Stephen morse sharableStephen morse sharable
Stephen morse sharable
 
Elder shareable
Elder shareableElder shareable
Elder shareable
 
1115 ramirez using our laptop
1115 ramirez using our laptop1115 ramirez using our laptop
1115 ramirez using our laptop
 

Recently uploaded

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Uber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis ReportUber Ride Supply Demand Gap Analysis Report
Uber Ride Supply Demand Gap Analysis Report
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 

900 keynote abbott

  • 1. How Predictive Modelers Should Think about Big Data Dean Abbott Co-Founder and Chief Data Scientist, SmarterHQ dabbott@smarterhq.com Twitter: @deanabb
  • 2. 2 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 3. 3 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 4. 4 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved From Olap.com
  • 5. 5 The Usual Big Data Talk Track © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved http://whatis.techtarget.com/definition/3Vs
  • 6. 6 Big Data: Google Trends Jan 1, 2004 Aug 1, 2008 Mar 1, 2013 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 7. 7 Big Data: Google Trends Jan 1, 2004 Aug 1, 2008 Mar 1, 2013 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 8. 8 Big Data: Google Trends Jan 1, 2004 Aug 1, 2008 Mar 1, 2013 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 9. 9 What is Big Data? © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 10. 10 What is Big Data? https://www.pinterest.com/pin/30962316158410859/ © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 11. 11 How Much Data is Big? More data than you can process efficiently ISBN-13: 978-1118824825
  • 12. 12 Big Data Contains…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 13. 13 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 14.
  • 15. 15 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved “70 percent of US millennials say they would appreciate a brand or retailer using AI technology to show more interesting products. And 72 percent believe that as the technology develops, brands using AI will be able to accurately predict what they want.”
  • 16. 16 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved https://venturebeat.com/2016/09/26/how-a-i-is-helping-retailers/ “The future of retail technology lies in solutions that are powered by machine learning, which can provide fast and intelligent automation as well as dynamic scalability. Machine learning unleashes powerful self-adapting algorithms to uncover latent patterns of behavior that are difficult or impossible for decision-makers to discover on their own. “
  • 17. 17 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved “Extreme Personalization “…modern commerce continues to evolve from ‘what’s new’ to the ‘next-new’ player on the block. To compete, every company — brick-and-mortar, e-commerce, and modern commerce — needs to perpetually innovate on every front.” “Engagement, not reach: AI and machine learning is advancing engagement tools to scale cross-channel, personalized messaging in the moments that matter in the channel customers prefer.”
  • 18. 18 Big Data Means Integrating Lots of Sources Database CRM Flat Files IoT ETL © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 19. 19 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 20. 20 “The vast majority of the challenges companies struggle as they operationalize Big Data are related to people, not technology: issues like organizational alignment, business process and adoption, and change management.” https://hbr.org/2016/02/just-using-big-data-isnt-enough-anymore © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 21. 21 Big Data is…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 22. 22 Big Data is…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 23. 23 Big Data Contains…. © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 24. 24 Big Data Can Cause RAM Problems Rows Columns GB 250,000 100 0.19 250,000 1,000 1.87 1,000,000 1,000 7.48 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 25. 25 Big Data Can Cause RAM Problems Rows Columns GB 250,000 100 0.19 250,000 1,000 1.87 1,000,000 1,000 7.48 10,000,000 1,000 74.77 10,000,000 10,000 747.66 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 26. 26 Big Data Can Overwhelm -> Width • Adding features & interactions make big data bigger (worse computationally!) © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 27. 27 Big Data can Mislead 2X 8X © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 28. 29 The Answer is… © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 29. 30 Be Judicious © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 30. 31 Leverage Scalable Environments © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 31. 32 Teradata © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 32. 33 Amazon AWS © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 33. 34 Azure © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 34. 35 Google © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 35. 36 © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved http://www.kdnuggets.com/2015/08/big-data-question-hadoop-spark.html
  • 36. 37 Parallelize Record Operations © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 37. 38 Parallelize Column Operations © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 38. 39 Parallelize Building Predictive Models Themselves • The Target: Column Days to Next Purchase <= 7 days © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 39. 40 Parallelize Building Predictive Models Themselves • The Target(s): Columns – Suitable for same types of models for multiple target variables Days to Next Purchase <= 1 day Days to Next Purchase <= 3 days Days to Next Purchase <= 7 days Days to Next Purchase <= 15 days Days to Next Purchase <= 30 days Days to Next Purchase 30-60 days © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 40. 41 NY City Taxi Data © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved • 5,199,911 observations • 19 variables • 1.05 GB Field Name Description VendorID A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc. tpep_pickup_datetime The date and time when the meter was engaged. tpep_dropoff_datetime The date and time when the meter was disengaged. Passenger_count The number of passengers in the vehicle. This is a driver-entered value. Trip_distance The elapsed trip distance in miles reported by the taximeter. Pickup_longitude Longitude where the meter was engaged. Pickup_latitude Latitude where the meter was engaged. RateCodeID The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip Dropoff_longitude Longitude where the meter was disengaged. Dropoff_ latitude Latitude where the meter was disengaged. Payment_type A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip Fare_amount The time-and-distance fare calculated by the meter Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges. MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use. Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. Tolls_amount Total amount of all tolls paid in trip. Total_amount The total amount charged to passengers. Does not include cash tips. Thanks to Joshua Adams for the Azure test results https://www.linkedin.com/in/joshuaadams3/
  • 41. 42 Cores Algorithm Rows Features Elapsed Time Single Random Forest 25000 19 0:02:58 Single Random Forest 50000 19 0:07:08 Single Random Forest 100000 19 1:11:48 Single Random Forest 200000 19 1:43:05 Single Random Forest 400000 19 5:25:05 Single Random Forest 800000 19 19:25:50 Multiple Random Forest 25000 19 0:01:32 Multiple Random Forest 50000 19 0:03:47 Multiple Random Forest 100000 19 0:34:12 Multiple Random Forest 200000 19 0:57:16 Multiple Random Forest 400000 19 1:48:23 Multiple Random Forest 800000 19 3:48:03 Processing Results in Azure © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved Thanks to Joshua Adams for the Azure test results https://www.linkedin.com/in/joshuaadams3/
  • 42. 43 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 43. 44 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs • k-NN – k, find optimal set of inputs © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 44. 45 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs • k-NN – k, find optimal set of inputs • Neural Networks – Architecture selection: # hidden layers and # neurons per hidden layer – Learning parameters (learning rate, momentum for backprop) © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 45. 46 Algorithm searches we don’t always have “time” to do • K-Means Clustering – # clusters, optimal set of inputs • k-NN – k, find optimal set of inputs • Neural Networks – Architecture selection: # hidden layers and # neurons per hidden layer – Learning parameters (learning rate, momentum for backprop) • Logistic Regression – Factorial design / interaction effects © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 46. 47 • The Good: Big data + AI is here and decision-makers care • The Bad: Big data is big, but not smart; requires company buy-in • The Ugly: Big data stresses infrastructure • One Solution: cloud computing and parallelization Conclusions © Abbott Analytics and SmarterHQ, Inc., All Rights Reserved
  • 47. THANK YOU! SmarterHQ.com | @deanabb | dabbott@SmarterHQ.com