SlideShare a Scribd company logo
1 of 17
Cutting Big Data Down to Size
Michael Kelly, PhD
Elaine Zanutto, PhD
July, 2016
1
Navigating a Big (Data) Universe
Finding High Value Data
Integrating Diverse Data
Sources
Cleaning, Organizing, Tagging
Data
Working with Vast Amounts of Data
Bits of Big Data =
Stars in Universe
2
Tying Down Big Data: Big Price Tag
“Free and open source” but…
• Infrastructure: Compute, storage, networking
• Set-up costs, including design considerations for parallel
architecture
• Talent: Developers, database engineers, data scientists
~$500K+ up front, $50K+ ongoing per month
• Infrastructure: Scalable
compute & storage resources
• Software: Database engine,
analytics, Hadoop framework
• Connectivity: Networking
investments
• Talent: Developers, database
engineers, data scientists
$50K+ per month
Big Data
in the
Cloud
3
Taking a Sampling Approach to Big Data
In survey research, we don’t need a
census to learn about a population
– a small, representative sample
will do nicely
HH Population
Sample
Data Population
1010111001100010110101100010
1101000101100010101010111010
0100110100100001110100110011
0011101100010110011100101001
Sample
011011
001011
100011
011001
Likewise, we can apply sampling
techniques to Big Data to learn
accurate characteristics of the
population quickly and cost
effectively
4
Illustrate Using Database of NYC Yellow Cab Taxi Rides
 Information on hundreds of millions of NYC
yellow cab rides per year since 2009
– Available here
– Various interesting analyses with dataset
(e.g., How long does it take to get to JFK?)
 Using latitude and longitude information in
raw data, we mapped to NYC boroughs and
neighborhoods
Type of Information Available
• Pickup & dropoff time
• Pickup & dropoff coordinates
(latitude, longitude)
• Trip distance
• Number of passengers
• Fare amount
• Cash or credit card payment
• Tip amount (for credit card
payments)
 We also created new variables based on existing ones
– Such as trip duration based on pickup and dropoff time
– Tip percent on top of base fare
 And merged in external information (weather and Dow Jones
Industrial Average at close of each business day) for modeling
5
Overview of What Follows
Demonstrate how a small random sample will agree well with measures
based on the ride population
Discuss why we might want to stratify the sample, and the need for post-
sampling weighting adjustments to align with population
Move beyond discerning and describing patterns to predicting them
6
2014 Taxi Dropoffs by Hour: All 163 Million Rides
7
2014 Taxi Dropoffs by Hour: 6K Random Sample vs. All 163M Rides
A mere 0.004% random
sample aligns tightly with
the Big Data population
8
From Simple to Stratified Random Sample
86.51%
5.45%
5.07%
0.53%
0.02% 2.43%
Manhattan
Brooklyn
Queens
Bronx
Staten Island
Other
Reflects population: More
than 85% of 163M taxi
rides in 2014 dropped off in
Manhattan
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
0 2000 4000 6000
Manhattan
Brooklyn
Queens
Bronx
Staten Island
Other*
Manhattan dropoffs
dominate our simple
random sample; no sample
at all from Staten Island
Stratify random sample by borough
(1,000 rides each in new 6K sample)
*Other: Dropoffs outside the
five NYC boroughs (e.g., NJ)
9
2014 Taxi Dropoffs by Hour: Adding a Stratified Random Sample
• Alignment with population
not nearly as good as
simple random sample
• Can address by weighting
stratified sample based on
dropoffs per borough
10
2014 Taxi Dropoffs by Hour: Impact of a Weighted Stratified Sample
11
Looking at Other Ride Metrics: Average Fare
Average Fare:
All 163M 2014 Rides
$12.66
Average Fare: 6K
Stratified Random
Sample (unweighted)
Comparison to All
2014 Rides
$27.83
$27.83 / $12.66 =
2.20
Average Fare: 6K
Random Sample
Comparison to All
2014 Rides
$12.30
$12.30 / $12.66 =
0.97
Average Fare: 6K
Stratified Random
Sample (weighted)
Comparison to All
2014 Rides
$12.50
$12.50 / $12.66 =
0.99
12
Robustness of Big Data Sampling
6K Random Sample
6K Stratified Random
Sample (Unweighted)
6K Stratified Random
Sample (Weighted)
1 = Perfect alignment between
sample and population values
13
Fitting Big Data with Small Models
 Move beyond discerning and describing patterns to predicting them
– Illustrate value of ensemble modeling in which averaging over a number
of small models agrees closely with results from a single population
model
– Just as we sample multiple respondents to infer population
characteristics, so we can sample multiple models for greater accuracy
 We’ll use two approaches to predict total fare amount from
characteristics of the pickup (e.g., where, when)
– First approach: Build a model on a population of rides (defined as 5
million rides in this example)
– Second approach: Build 100 models on 5,000 randomly selected rides
each and average the results
14
Pickup Characteristics that Predict Fare Amount in a Ride Population
Predicted fare for:
• Brooklyn pickup
• Between 12 and 6am
• On Friday:
$14.82
15
A Sample of Smaller Models Aligns with a Single Population Model
• Correlation between
population and sample
models: .99997
• Average difference: 1.7%
16
Conclusions
Gain market insights
faster and less
expensively
Small samples deliver
accurate insights about a
Big Data Population
Appropriate weighting
may be needed
An ensemble of small
models accurately
predicts characteristics
of a Big Data population

More Related Content

Similar to MRMW N America 2016 presentation kelly and zanutto naxion

Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series AnalysisAmanda Reed
 
Ausenco Logistics Supply Chain Modelling
Ausenco Logistics Supply Chain ModellingAusenco Logistics Supply Chain Modelling
Ausenco Logistics Supply Chain ModellingJoel Shirriff
 
Clustering big spatiotemporal interval data
Clustering big spatiotemporal interval dataClustering big spatiotemporal interval data
Clustering big spatiotemporal interval dataNexgen Technology
 
Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...
Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...
Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...STEP_scotland
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsIstituto nazionale di statistica
 
Data Science by Chappuis Halder & Co.
Data Science by Chappuis Halder & Co.Data Science by Chappuis Halder & Co.
Data Science by Chappuis Halder & Co.Genest Benoit
 
Estimating default risk in fund structures
Estimating default risk in fund structuresEstimating default risk in fund structures
Estimating default risk in fund structuresIFMR
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .tsysglobalsolutions
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...Joel Saltz
 
Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...
Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...
Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...BCS Data Management Specialist Group
 
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdfChapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdfAndresBelloAvila
 
CV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLCV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLHan Yang
 
COBWEB A quality assurance workflow authoring tool for citizen science and cr...
COBWEB A quality assurance workflow authoring tool for citizen science and cr...COBWEB A quality assurance workflow authoring tool for citizen science and cr...
COBWEB A quality assurance workflow authoring tool for citizen science and cr...COBWEB Project
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079ibankuk
 

Similar to MRMW N America 2016 presentation kelly and zanutto naxion (20)

Time Series Analysis
Time Series AnalysisTime Series Analysis
Time Series Analysis
 
Ausenco Logistics Supply Chain Modelling
Ausenco Logistics Supply Chain ModellingAusenco Logistics Supply Chain Modelling
Ausenco Logistics Supply Chain Modelling
 
Big data
Big dataBig data
Big data
 
Clustering big spatiotemporal interval data
Clustering big spatiotemporal interval dataClustering big spatiotemporal interval data
Clustering big spatiotemporal interval data
 
Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...
Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...
Scottish Urban Air Quality Steering Group - Modelling & Monitoring Workshop -...
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
Data Science by Chappuis Halder & Co.
Data Science by Chappuis Halder & Co.Data Science by Chappuis Halder & Co.
Data Science by Chappuis Halder & Co.
 
Lead Media Manager - Alex Sofronas, DirecTV
Lead Media Manager  - Alex Sofronas, DirecTVLead Media Manager  - Alex Sofronas, DirecTV
Lead Media Manager - Alex Sofronas, DirecTV
 
Estimating default risk in fund structures
Estimating default risk in fund structuresEstimating default risk in fund structures
Estimating default risk in fund structures
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
Spatio-­‐temporal Sensor Integration, Analysis, Classification or Can Exascal...
 
Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...
Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...
Adopting a Situated Learning framework for (Big) Data Projects - Martin Dougl...
 
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdfChapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
 
CV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCLCV-Grace-DataAnalytics-UCL
CV-Grace-DataAnalytics-UCL
 
COBWEB A quality assurance workflow authoring tool for citizen science and cr...
COBWEB A quality assurance workflow authoring tool for citizen science and cr...COBWEB A quality assurance workflow authoring tool for citizen science and cr...
COBWEB A quality assurance workflow authoring tool for citizen science and cr...
 
IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079IBANK - Big data www.ibank.uk.com 07474222079
IBANK - Big data www.ibank.uk.com 07474222079
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop PDF
Hadoop PDFHadoop PDF
Hadoop PDF
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

MRMW N America 2016 presentation kelly and zanutto naxion

  • 1. Cutting Big Data Down to Size Michael Kelly, PhD Elaine Zanutto, PhD July, 2016
  • 2. 1 Navigating a Big (Data) Universe Finding High Value Data Integrating Diverse Data Sources Cleaning, Organizing, Tagging Data Working with Vast Amounts of Data Bits of Big Data = Stars in Universe
  • 3. 2 Tying Down Big Data: Big Price Tag “Free and open source” but… • Infrastructure: Compute, storage, networking • Set-up costs, including design considerations for parallel architecture • Talent: Developers, database engineers, data scientists ~$500K+ up front, $50K+ ongoing per month • Infrastructure: Scalable compute & storage resources • Software: Database engine, analytics, Hadoop framework • Connectivity: Networking investments • Talent: Developers, database engineers, data scientists $50K+ per month Big Data in the Cloud
  • 4. 3 Taking a Sampling Approach to Big Data In survey research, we don’t need a census to learn about a population – a small, representative sample will do nicely HH Population Sample Data Population 1010111001100010110101100010 1101000101100010101010111010 0100110100100001110100110011 0011101100010110011100101001 Sample 011011 001011 100011 011001 Likewise, we can apply sampling techniques to Big Data to learn accurate characteristics of the population quickly and cost effectively
  • 5. 4 Illustrate Using Database of NYC Yellow Cab Taxi Rides  Information on hundreds of millions of NYC yellow cab rides per year since 2009 – Available here – Various interesting analyses with dataset (e.g., How long does it take to get to JFK?)  Using latitude and longitude information in raw data, we mapped to NYC boroughs and neighborhoods Type of Information Available • Pickup & dropoff time • Pickup & dropoff coordinates (latitude, longitude) • Trip distance • Number of passengers • Fare amount • Cash or credit card payment • Tip amount (for credit card payments)  We also created new variables based on existing ones – Such as trip duration based on pickup and dropoff time – Tip percent on top of base fare  And merged in external information (weather and Dow Jones Industrial Average at close of each business day) for modeling
  • 6. 5 Overview of What Follows Demonstrate how a small random sample will agree well with measures based on the ride population Discuss why we might want to stratify the sample, and the need for post- sampling weighting adjustments to align with population Move beyond discerning and describing patterns to predicting them
  • 7. 6 2014 Taxi Dropoffs by Hour: All 163 Million Rides
  • 8. 7 2014 Taxi Dropoffs by Hour: 6K Random Sample vs. All 163M Rides A mere 0.004% random sample aligns tightly with the Big Data population
  • 9. 8 From Simple to Stratified Random Sample 86.51% 5.45% 5.07% 0.53% 0.02% 2.43% Manhattan Brooklyn Queens Bronx Staten Island Other Reflects population: More than 85% of 163M taxi rides in 2014 dropped off in Manhattan [CELLRA NGE] [CELLRA NGE] [CELLRA NGE] [CELLRA NGE] [CELLRA NGE] [CELLRA NGE] 0 2000 4000 6000 Manhattan Brooklyn Queens Bronx Staten Island Other* Manhattan dropoffs dominate our simple random sample; no sample at all from Staten Island Stratify random sample by borough (1,000 rides each in new 6K sample) *Other: Dropoffs outside the five NYC boroughs (e.g., NJ)
  • 10. 9 2014 Taxi Dropoffs by Hour: Adding a Stratified Random Sample • Alignment with population not nearly as good as simple random sample • Can address by weighting stratified sample based on dropoffs per borough
  • 11. 10 2014 Taxi Dropoffs by Hour: Impact of a Weighted Stratified Sample
  • 12. 11 Looking at Other Ride Metrics: Average Fare Average Fare: All 163M 2014 Rides $12.66 Average Fare: 6K Stratified Random Sample (unweighted) Comparison to All 2014 Rides $27.83 $27.83 / $12.66 = 2.20 Average Fare: 6K Random Sample Comparison to All 2014 Rides $12.30 $12.30 / $12.66 = 0.97 Average Fare: 6K Stratified Random Sample (weighted) Comparison to All 2014 Rides $12.50 $12.50 / $12.66 = 0.99
  • 13. 12 Robustness of Big Data Sampling 6K Random Sample 6K Stratified Random Sample (Unweighted) 6K Stratified Random Sample (Weighted) 1 = Perfect alignment between sample and population values
  • 14. 13 Fitting Big Data with Small Models  Move beyond discerning and describing patterns to predicting them – Illustrate value of ensemble modeling in which averaging over a number of small models agrees closely with results from a single population model – Just as we sample multiple respondents to infer population characteristics, so we can sample multiple models for greater accuracy  We’ll use two approaches to predict total fare amount from characteristics of the pickup (e.g., where, when) – First approach: Build a model on a population of rides (defined as 5 million rides in this example) – Second approach: Build 100 models on 5,000 randomly selected rides each and average the results
  • 15. 14 Pickup Characteristics that Predict Fare Amount in a Ride Population Predicted fare for: • Brooklyn pickup • Between 12 and 6am • On Friday: $14.82
  • 16. 15 A Sample of Smaller Models Aligns with a Single Population Model • Correlation between population and sample models: .99997 • Average difference: 1.7%
  • 17. 16 Conclusions Gain market insights faster and less expensively Small samples deliver accurate insights about a Big Data Population Appropriate weighting may be needed An ensemble of small models accurately predicts characteristics of a Big Data population

Editor's Notes

  1. By 2020, IDC estimates that digital bits will be about the same as # of stars in universe Finding high value data sometimes seems as difficult as the search for extraterrestrial life Data janitor work (Just as the universe contains dark matter, so corporate warehouses collect more and more dark data) Heard talk where speaker frustrated about time required to process all their data Some might also be frustrated about the potential cost….
  2. Excludes costs to clean up your data Example costs - Source: https://www.mobomo.com/2014/2/big-data-on-small-budget/ 10-node cluster with AWS Elastic Map/Reduce 10-node m2.4xlarge cluster: $16,435/month Need to budget ~1/5 of above to cover network i/o and storage costs Processing petabyte worth of data requires a 21-node hs1.8xlarge cluster, which costs: $88,000/month DIY build using AWS EC2 Instances 10-node m2.4xlarge cluster: $11,800/month Need to budget ~1/5 of above to cover network i/o and storage costs Processing petabyte worth of data requires 21-node hs1.8xlarge cluster, which costs $70,000/month
  3. “Early in the 20th Century, sampling for surveys was a radical idea. The notion that a thousand people selected from households throughout the United States could yield consistent and accurate estimates of characteristics of the entire population seemed to defy reason. Such sampling is now accepted as an essential cornerstone of the survey method.” D.A. Dillman
  4. Data processing and analyses conducted with commodity PC and extra external storage using open source software PostgreSQL, PostGIS, and R Like other types of Big Data, need to process it in various ways to make it more useful Time stamps decomposed into hour of day and day of week Mapping latitude/longitude to particular NYC areas was the most time consuming step (in terms of code run-time) Any variable to be used in sampling will need to be defined in the Big Data population External information didn’t come in as notable predictors – but easier to merge into sample than to entire Big Data population
  5. First analyses will look at the distribution of taxi drop-offs by hour Population defined as all yellow cab rides in 2014 0.8% records deleted because cash or credit payment information not present Ride rates pretty constant across the workday after rise during rush hour Pattern essentially identical for all rides between 2009 and 2014
  6. Correlation = .988; r-squared = .975 (i.e., patterns in random sample account of 97.5% of variance in total ride population)
  7. Random sample very tightly aligned with borough distribution in 2014 ride population (e.g., 86.55% of sample rides dropped off in Manhattan compared with 86.51% in population) Motivation to stratify: You may want to do some analyses at the borough level Stratified sample includes 1,000 randomly selected “Other” rides in which dropoff was not in one of the five NYC boroughs 1,000 per borough / segment provides excellent power (.97, .89) to detect a small effect size (d=.2) at p < .01 or p < .001; at n=500, for example, power drops to .72 and .45 respectively at p < .01 and p < .001
  8. Correlation with 2014 ride population drops from .988 in simple random sample to .485 in stratified random sample Each ride weighted in proportion to total drop-offs in the relevant borough
  9. Correlation with 2014 ride population rises from .485 in stratified unweighted random sample to .952 in stratified weighted random sample
  10. Let’s go beyond dropoff hour to other metrics
  11. Simple random sample consistently aligns closely with population scores on each metric (range: .899 to 1.003) Stratified random sample (unweighted) generally does pretty well in matching total 2014 ride population. But when it’s off, it’s way off (range: .923 to 2.636) Stratified weighted random sample consistently does as well as a simple random sample while allowing for borough level analysis if desired (range: .900 to 1.040)
  12. Linear regression model accounts for 25% of variance in fare, which is a reasonable model Bars show amount to add to starting fare (intercept in the regression model) if pickup has a particular characteristic
  13. Of course, some use cases may require a census-type approach to Big Data sets Sampling approach helps address reproducibility crisis in science