MRMW N America 2016 presentation kelly and zanutto naxion

Cutting Big Data Down to Size
Michael Kelly, PhD
Elaine Zanutto, PhD
July, 2016

1
Navigating a Big (Data) Universe
Finding High Value Data
Integrating Diverse Data
Sources
Cleaning, Organizing, Tagging
Data
Working with Vast Amounts of Data
Bits of Big Data =
Stars in Universe

2
Tying Down Big Data: Big Price Tag
“Free and open source” but…
• Infrastructure: Compute, storage, networking
• Set-up costs, including design considerations for parallel
architecture
• Talent: Developers, database engineers, data scientists
~$500K+ up front, $50K+ ongoing per month
• Infrastructure: Scalable
compute & storage resources
• Software: Database engine,
analytics, Hadoop framework
• Connectivity: Networking
investments
• Talent: Developers, database
engineers, data scientists
$50K+ per month
Big Data
in the
Cloud

3
Taking a Sampling Approach to Big Data
In survey research, we don’t need a
census to learn about a population
– a small, representative sample
will do nicely
HH Population
Sample
Data Population
1010111001100010110101100010
1101000101100010101010111010
0100110100100001110100110011
0011101100010110011100101001
Sample
011011
001011
100011
011001
Likewise, we can apply sampling
techniques to Big Data to learn
accurate characteristics of the
population quickly and cost
effectively

4
Illustrate Using Database of NYC Yellow Cab Taxi Rides
 Information on hundreds of millions of NYC
yellow cab rides per year since 2009
– Available here
– Various interesting analyses with dataset
(e.g., How long does it take to get to JFK?)
 Using latitude and longitude information in
raw data, we mapped to NYC boroughs and
neighborhoods
Type of Information Available
• Pickup & dropoff time
• Pickup & dropoff coordinates
(latitude, longitude)
• Trip distance
• Number of passengers
• Fare amount
• Cash or credit card payment
• Tip amount (for credit card
payments)
 We also created new variables based on existing ones
– Such as trip duration based on pickup and dropoff time
– Tip percent on top of base fare
 And merged in external information (weather and Dow Jones
Industrial Average at close of each business day) for modeling

5
Overview of What Follows
Demonstrate how a small random sample will agree well with measures
based on the ride population
Discuss why we might want to stratify the sample, and the need for post-
sampling weighting adjustments to align with population
Move beyond discerning and describing patterns to predicting them

6
2014 Taxi Dropoffs by Hour: All 163 Million Rides

7
2014 Taxi Dropoffs by Hour: 6K Random Sample vs. All 163M Rides
A mere 0.004% random
sample aligns tightly with
the Big Data population

8
From Simple to Stratified Random Sample
86.51%
5.45%
5.07%
0.53%
0.02% 2.43%
Manhattan
Brooklyn
Queens
Bronx
Staten Island
Other
Reflects population: More
than 85% of 163M taxi
rides in 2014 dropped off in
Manhattan
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
0 2000 4000 6000
Manhattan
Brooklyn
Queens
Bronx
Staten Island
Other*
Manhattan dropoffs
dominate our simple
random sample; no sample
at all from Staten Island
Stratify random sample by borough
(1,000 rides each in new 6K sample)
*Other: Dropoffs outside the
five NYC boroughs (e.g., NJ)

9
2014 Taxi Dropoffs by Hour: Adding a Stratified Random Sample
• Alignment with population
not nearly as good as
simple random sample
• Can address by weighting
stratified sample based on
dropoffs per borough

10
2014 Taxi Dropoffs by Hour: Impact of a Weighted Stratified Sample

11
Looking at Other Ride Metrics: Average Fare
Average Fare:
All 163M 2014 Rides
$12.66
Average Fare: 6K
Stratified Random
Sample (unweighted)
Comparison to All
2014 Rides
$27.83
$27.83 / $12.66 =
2.20
Average Fare: 6K
Random Sample
Comparison to All
2014 Rides
$12.30
$12.30 / $12.66 =
0.97
Average Fare: 6K
Stratified Random
Sample (weighted)
Comparison to All
2014 Rides
$12.50
$12.50 / $12.66 =
0.99

12
Robustness of Big Data Sampling
6K Random Sample
6K Stratified Random
Sample (Unweighted)
6K Stratified Random
Sample (Weighted)
1 = Perfect alignment between
sample and population values

13
Fitting Big Data with Small Models
 Move beyond discerning and describing patterns to predicting them
– Illustrate value of ensemble modeling in which averaging over a number
of small models agrees closely with results from a single population
model
– Just as we sample multiple respondents to infer population
characteristics, so we can sample multiple models for greater accuracy
 We’ll use two approaches to predict total fare amount from
characteristics of the pickup (e.g., where, when)
– First approach: Build a model on a population of rides (defined as 5
million rides in this example)
– Second approach: Build 100 models on 5,000 randomly selected rides
each and average the results

14
Pickup Characteristics that Predict Fare Amount in a Ride Population
Predicted fare for:
• Brooklyn pickup
• Between 12 and 6am
• On Friday:
$14.82

15
A Sample of Smaller Models Aligns with a Single Population Model
• Correlation between
population and sample
models: .99997
• Average difference: 1.7%

16
Conclusions
Gain market insights
faster and less
expensively
Small samples deliver
accurate insights about a
Big Data Population
Appropriate weighting
may be needed
An ensemble of small
models accurately
predicts characteristics
of a Big Data population

MRMW N America 2016 presentation kelly and zanutto naxion

Recommended

Recommended

More Related Content

Similar to MRMW N America 2016 presentation kelly and zanutto naxion

Similar to MRMW N America 2016 presentation kelly and zanutto naxion (20)

Recently uploaded

Recently uploaded (20)

MRMW N America 2016 presentation kelly and zanutto naxion

Editor's Notes