꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
MRMW N America 2016 presentation kelly and zanutto naxion
1. Cutting Big Data Down to Size
Michael Kelly, PhD
Elaine Zanutto, PhD
July, 2016
2. 1
Navigating a Big (Data) Universe
Finding High Value Data
Integrating Diverse Data
Sources
Cleaning, Organizing, Tagging
Data
Working with Vast Amounts of Data
Bits of Big Data =
Stars in Universe
3. 2
Tying Down Big Data: Big Price Tag
“Free and open source” but…
• Infrastructure: Compute, storage, networking
• Set-up costs, including design considerations for parallel
architecture
• Talent: Developers, database engineers, data scientists
~$500K+ up front, $50K+ ongoing per month
• Infrastructure: Scalable
compute & storage resources
• Software: Database engine,
analytics, Hadoop framework
• Connectivity: Networking
investments
• Talent: Developers, database
engineers, data scientists
$50K+ per month
Big Data
in the
Cloud
4. 3
Taking a Sampling Approach to Big Data
In survey research, we don’t need a
census to learn about a population
– a small, representative sample
will do nicely
HH Population
Sample
Data Population
1010111001100010110101100010
1101000101100010101010111010
0100110100100001110100110011
0011101100010110011100101001
Sample
011011
001011
100011
011001
Likewise, we can apply sampling
techniques to Big Data to learn
accurate characteristics of the
population quickly and cost
effectively
5. 4
Illustrate Using Database of NYC Yellow Cab Taxi Rides
Information on hundreds of millions of NYC
yellow cab rides per year since 2009
– Available here
– Various interesting analyses with dataset
(e.g., How long does it take to get to JFK?)
Using latitude and longitude information in
raw data, we mapped to NYC boroughs and
neighborhoods
Type of Information Available
• Pickup & dropoff time
• Pickup & dropoff coordinates
(latitude, longitude)
• Trip distance
• Number of passengers
• Fare amount
• Cash or credit card payment
• Tip amount (for credit card
payments)
We also created new variables based on existing ones
– Such as trip duration based on pickup and dropoff time
– Tip percent on top of base fare
And merged in external information (weather and Dow Jones
Industrial Average at close of each business day) for modeling
6. 5
Overview of What Follows
Demonstrate how a small random sample will agree well with measures
based on the ride population
Discuss why we might want to stratify the sample, and the need for post-
sampling weighting adjustments to align with population
Move beyond discerning and describing patterns to predicting them
8. 7
2014 Taxi Dropoffs by Hour: 6K Random Sample vs. All 163M Rides
A mere 0.004% random
sample aligns tightly with
the Big Data population
9. 8
From Simple to Stratified Random Sample
86.51%
5.45%
5.07%
0.53%
0.02% 2.43%
Manhattan
Brooklyn
Queens
Bronx
Staten Island
Other
Reflects population: More
than 85% of 163M taxi
rides in 2014 dropped off in
Manhattan
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
[CELLRA
NGE]
0 2000 4000 6000
Manhattan
Brooklyn
Queens
Bronx
Staten Island
Other*
Manhattan dropoffs
dominate our simple
random sample; no sample
at all from Staten Island
Stratify random sample by borough
(1,000 rides each in new 6K sample)
*Other: Dropoffs outside the
five NYC boroughs (e.g., NJ)
10. 9
2014 Taxi Dropoffs by Hour: Adding a Stratified Random Sample
• Alignment with population
not nearly as good as
simple random sample
• Can address by weighting
stratified sample based on
dropoffs per borough
12. 11
Looking at Other Ride Metrics: Average Fare
Average Fare:
All 163M 2014 Rides
$12.66
Average Fare: 6K
Stratified Random
Sample (unweighted)
Comparison to All
2014 Rides
$27.83
$27.83 / $12.66 =
2.20
Average Fare: 6K
Random Sample
Comparison to All
2014 Rides
$12.30
$12.30 / $12.66 =
0.97
Average Fare: 6K
Stratified Random
Sample (weighted)
Comparison to All
2014 Rides
$12.50
$12.50 / $12.66 =
0.99
13. 12
Robustness of Big Data Sampling
6K Random Sample
6K Stratified Random
Sample (Unweighted)
6K Stratified Random
Sample (Weighted)
1 = Perfect alignment between
sample and population values
14. 13
Fitting Big Data with Small Models
Move beyond discerning and describing patterns to predicting them
– Illustrate value of ensemble modeling in which averaging over a number
of small models agrees closely with results from a single population
model
– Just as we sample multiple respondents to infer population
characteristics, so we can sample multiple models for greater accuracy
We’ll use two approaches to predict total fare amount from
characteristics of the pickup (e.g., where, when)
– First approach: Build a model on a population of rides (defined as 5
million rides in this example)
– Second approach: Build 100 models on 5,000 randomly selected rides
each and average the results
15. 14
Pickup Characteristics that Predict Fare Amount in a Ride Population
Predicted fare for:
• Brooklyn pickup
• Between 12 and 6am
• On Friday:
$14.82
16. 15
A Sample of Smaller Models Aligns with a Single Population Model
• Correlation between
population and sample
models: .99997
• Average difference: 1.7%
17. 16
Conclusions
Gain market insights
faster and less
expensively
Small samples deliver
accurate insights about a
Big Data Population
Appropriate weighting
may be needed
An ensemble of small
models accurately
predicts characteristics
of a Big Data population
Editor's Notes
By 2020, IDC estimates that digital bits will be about the same as # of stars in universe
Finding high value data sometimes seems as difficult as the search for extraterrestrial life
Data janitor work (Just as the universe contains dark matter, so corporate warehouses collect more and more dark data)
Heard talk where speaker frustrated about time required to process all their data
Some might also be frustrated about the potential cost….
Excludes costs to clean up your data
Example costs - Source: https://www.mobomo.com/2014/2/big-data-on-small-budget/
10-node cluster with AWS Elastic Map/Reduce
10-node m2.4xlarge cluster: $16,435/month
Need to budget ~1/5 of above to cover network i/o and storage costs
Processing petabyte worth of data requires a 21-node hs1.8xlarge cluster, which costs: $88,000/month
DIY build using AWS EC2 Instances
10-node m2.4xlarge cluster: $11,800/month
Need to budget ~1/5 of above to cover network i/o and storage costs
Processing petabyte worth of data requires 21-node hs1.8xlarge cluster, which costs $70,000/month
“Early in the 20th Century, sampling for surveys was a radical idea. The notion that a thousand people selected from households throughout the United States could yield consistent and accurate estimates of characteristics of the entire population seemed to defy reason. Such sampling is now accepted as an essential cornerstone of the survey method.” D.A. Dillman
Data processing and analyses conducted with commodity PC and extra external storage using open source software PostgreSQL, PostGIS, and R
Like other types of Big Data, need to process it in various ways to make it more useful
Time stamps decomposed into hour of day and day of week
Mapping latitude/longitude to particular NYC areas was the most time consuming step (in terms of code run-time)
Any variable to be used in sampling will need to be defined in the Big Data population
External information didn’t come in as notable predictors – but easier to merge into sample than to entire Big Data population
First analyses will look at the distribution of taxi drop-offs by hour
Population defined as all yellow cab rides in 2014
0.8% records deleted because cash or credit payment information not present
Ride rates pretty constant across the workday after rise during rush hour
Pattern essentially identical for all rides between 2009 and 2014
Correlation = .988; r-squared = .975 (i.e., patterns in random sample account of 97.5% of variance in total ride population)
Random sample very tightly aligned with borough distribution in 2014 ride population (e.g., 86.55% of sample rides dropped off in Manhattan compared with 86.51% in population)
Motivation to stratify: You may want to do some analyses at the borough level
Stratified sample includes 1,000 randomly selected “Other” rides in which dropoff was not in one of the five NYC boroughs
1,000 per borough / segment provides excellent power (.97, .89) to detect a small effect size (d=.2) at p < .01 or p < .001; at n=500, for example, power drops to .72 and .45 respectively at p < .01 and p < .001
Correlation with 2014 ride population drops from .988 in simple random sample to .485 in stratified random sample
Each ride weighted in proportion to total drop-offs in the relevant borough
Correlation with 2014 ride population rises from .485 in stratified unweighted random sample to .952 in stratified weighted random sample
Let’s go beyond dropoff hour to other metrics
Simple random sample consistently aligns closely with population scores on each metric (range: .899 to 1.003)
Stratified random sample (unweighted) generally does pretty well in matching total 2014 ride population. But when it’s off, it’s way off (range: .923 to 2.636)
Stratified weighted random sample consistently does as well as a simple random sample while allowing for borough level analysis if desired (range: .900 to 1.040)
Linear regression model accounts for 25% of variance in fare, which is a reasonable model
Bars show amount to add to starting fare (intercept in the regression model) if pickup has a particular characteristic
Of course, some use cases may require a census-type approach to Big Data sets
Sampling approach helps address reproducibility crisis in science