Big data hubris is the belief that big data sets are a substitute for, rather than a supplement to, traditional data collection and analysis. Thomas will share how his understanding of big data has evolved during his economics research and why 273 million Uber trips is not considered “big” enough. He will also discuss lessons learned from the rise and fall of Google Flu Trends.
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
Big Data Hubris: Limitations in Aggregating Uber and Google Data
1. Big Data Hubris
Limitations in Aggregating Uber and Google Data
Thomas J. Weinandy
Ph.D. Candidate in Applied Economics
Western Michigan University
March 17, 2020
Data Science and Analytics West Michigan
West Michigan R Users Group
Thomas J. Weinandy Big Data Hubris March 17, 2020 1 / 33
2. Overview
Slides available at https://www.slideshare.net/TomWeinandy
R code available at https://github.com/tomweinandy
Part I: The Parable of Google Flu Trends
Part II: The impact of fuel prices on Uber and taxi trips
Thomas J. Weinandy Big Data Hubris March 17, 2020 2 / 33
3. The Gartner Hype Cycle
Where is Big Data at?
https://en.wikipedia.org/wiki/Hype_cycle
Thomas J. Weinandy Big Data Hubris March 17, 2020 3 / 33
4. From Big Data to Data Science
August 23, 2019
Thomas J. Weinandy Big Data Hubris March 17, 2020 4 / 33
5. Stage 1: Technology Trigger
The Problem
CDC collects national health data on seasonal influenza cases.
Publishes data at the weekly level with a 1-2 week reporting lag.
The Solution
Google Flu Trends was launched in 2009 based on search engine
data.
It was modeled with 45 queries selected to have the best fit with
CDC reports.
Provided daily estimates for the US with only a one-day lag
(Ginsberg et al. 2009)
Thomas J. Weinandy Big Data Hubris March 17, 2020 5 / 33
6. Stage 2: Peak of Inflated Expectations
“Harnessing the collective intelligence of millions of users, Google
web search logs can provide one of the most timely, broad-reaching
influenza monitoring systems available today.” (Ginsberg et al.
2009, Nature p. 1014)
It became the quintessential example of how Big Data can solve
problems.
It was new, relatable, intuitive and brand-name.
By 2013, Google Flu Trends was rolled out to 29 countries and
expanded to include dengue fever (Butler 2013).
Thomas J. Weinandy Big Data Hubris March 17, 2020 6 / 33
7. Stage 3: Trough of Disillusionment
Lazer et al. (2014)
The 0 line is the number of influenza-like illnesses from the CDC.
Each line is the percent error from the baseline for three models.
Not robust to anomalous flu seasons.
Thomas J. Weinandy Big Data Hubris March 17, 2020 7 / 33
8. Stage 3: Trough of Disillusionment
Lazer et al. (2014)
Google Flu Trends (orange) does not fully correct for seasonality,
shown by the wave pattern of the error.
The error is temporally related (autocorrelation).
Google Flu Trends from 8/21/11 to 9/1/13 systematically
overestimated cases, sometime by more than 100%.
Thomas J. Weinandy Big Data Hubris March 17, 2020 8 / 33
9. Stage 4: Slope of Enlightenment
What went wrong?
1. Complexity & dynamism
Google Search is a complicated, always-changing algorithm
2. Overfitting big data to small data
Google data: 50 million search terms at the daily level
CDC data: 1152 influenza-like illness reports at the weekly level
Thomas J. Weinandy Big Data Hubris March 17, 2020 9 / 33
10. Stage 5: Plateau of Productivity
Lazer et al. (2014)
Out-of-sample mean absolute error:
0.463 for Google Flu model
0.311 for Lagged CDC model
0.232 for combined Google Flu + Lagged CDC model
Thomas J. Weinandy Big Data Hubris March 17, 2020 10 / 33
11. The Parable of Google Flu Trends
Big data hubris is the belief that big data sets are a substitute for,
rather than a supplement to, traditional data collection and
analysis.
Big data is not a silver bullet.
Aggregating big data to match small data is inherently destructive.
Thomas J. Weinandy Big Data Hubris March 17, 2020 11 / 33
12. How this relates to my dissertation
Intended Research Questions
How does fuel price impact the quantity of car service trips?
Will taxi and Uber drivers react differently?
Unintended Research Questions
Is data still “big” once I aggregate it?
Thomas J. Weinandy Big Data Hubris March 17, 2020 12 / 33
13. Types of Car Services in New York City (NYC)
Thomas J. Weinandy Big Data Hubris March 17, 2020 13 / 33
14. Taxi Background
Taxis are regulated by the New York City Taxi and Limousine
Commission (TLC) that sets the price schedule and determines a
fixed supply through medallions.
The medallion system makes licensed taxis very expensive, so most
cab drivers must lease a taxi for regularly scheduled 12-hour shifts
(Bagchi 2018).
Drivers decide when to end their shift and tend to work longer as
expected income goes up (Farber 2015, Chen et al. 2017).
Taxi drivers keep all revenue but pay for their lease and their own
gasoline.
Thomas J. Weinandy Big Data Hubris March 17, 2020 14 / 33
15. TNC Background
TNCs are not subject to taxi regulation since they pick up
customers through an app, not from a street hail.
TNCs have a flexible supply, flexible price model that adjusts in
real time according to relative supply and demand.
84% of surveyed Uber drivers chose the job “to have more
flexibility in my schedule and balance my work with my life and
family” (Hall & Krueger 2018, p713).
TNC drivers pay about 25% of their revenue to the platform while
also paying for all vehicle costs, including gasoline.
Thomas J. Weinandy Big Data Hubris March 17, 2020 15 / 33
16. Estimated Model
quantity of car service trips = f(fuel price,
previous supply & demand, economic activity,
weather, traffic flow, supply & demand patterns)
ln(tripst) = β0 + β1ln(fuel pricet) + β2ln(tripst-1)
+ β3ln(stock activityt) + β4temperaturet + β5raint
+ β7ln(collisionst) + β8ln(collisionst) ∗ raint
+ β9ln(collisionst) ∗ snowt + β10holidayt + β11trend
+ γ day-of-weekt + φ week-of-yeart + t
Thomas J. Weinandy Big Data Hubris March 17, 2020 16 / 33
18. Big Data
Collected over 726 million ride-level observations in NYC from
https://opendata.cityofnewyork.us
However, fuel prices are at the day level with 1002 observations.
Thomas J. Weinandy Big Data Hubris March 17, 2020 18 / 33
19. Car Service Trips
Daily Taxi and TNC Trips
Thomas J. Weinandy Big Data Hubris March 17, 2020 19 / 33
20. Fuel Prices
Price of New York Harbor Gasoline
Thomas J. Weinandy Big Data Hubris March 17, 2020 20 / 33
21. Estimation Method
Estimate separately for TNCs and taxis
Include multiple specifications for robustness
With and without a lagged dependent variable
With and without a measure of economic activity
Evaluate using OLS regression and SUR estimation
I do not include the competing car service in the model since that
would introduce endogeneity.
Seemingly Unrelated Regression (SUR) estimation corrects for the
contemporaneous correlation of residuals between the TNCs and
taxis by estimating them together (Zellner 1962).
Thomas J. Weinandy Big Data Hubris March 17, 2020 21 / 33
26. Marginal Effects: Methodology
Consider the following parsimonious equation
yt = αxt + βyt-1 + γ Γt
where yt is the number of car service trips, xt is fuel price, and Γt is a
vector of all other variables.
Due to the lagged dependent variable yt-1, the impact of fuel prices on
the number of trips will persist for all future periods.
Thus, the cumulative marginal effect of a one-day change in gas prices
is
∞
i=0
∂yt+i
∂xt
=
α
1 − β
Thomas J. Weinandy Big Data Hubris March 17, 2020 26 / 33
27. Marginal Effects: Methodology
∞
i=0
∂yt+i
∂xt
=
α
1 − β
To calculate the standard error around the above marginal effect, I
follow De Boef & Keele (2008) to estimate the variance using
V ar
a
b
=
a2
b2
V ar (a)
a2
+
V ar (b)
b2
−
2Cov (a, b)
ab
where a = α and b = 1 − β. This gives me the estimator and its
statistical significance.
Thomas J. Weinandy Big Data Hubris March 17, 2020 27 / 33
30. Summary of Results
1 A one-day 1% increase in fuel prices is associated with a 0.367%
decrease in TNC trips and a 0.088% increase in taxi trips.
2 On a typical day when fuel prices change 1.729%, there will be an
average change of 2,190 TNC trips and 576 taxi trips in New York
City alone.
3 TNC drivers are very flexible and can reduce their supply when an
input cost like gasoline increases.
4 Due to heavy taxi regulation, cab drivers are stuck paying the
higher fuel price but benefit when competing TNC drivers exit the
market.
Thomas J. Weinandy Big Data Hubris March 17, 2020 30 / 33
32. Works Cited
Bagchi, Sutirtha. “A Tale of Two Cities: An Examination of Medallion Prices in New York and
Chicago.” Review of Industrial Organization 53.2 (2018): 295-319.
Butler, Declan. ”When Google got flu wrong: US outbreak foxes a leading web-based method for
tracking seasonal flu.” Nature 494.7436 (2013): 155-157.
Chen, Minzhang, Judith A. Chevalier, Peter E. Rossi and Emily Oehlsen. “The Value of Flexible
Work: Evidence from Uber Drivers.” NBER Working Paper No. 23296 (March 2017).
De Boef, Suzanna, and Luke Keele. ”Taking time seriously.” American Journal of Political Science
52.1 (2008): 184-200.
Farber, Henry S. “Why you Can’t Find a Taxi in the Rain and Other Labor Supply Lessons from
Cab Drivers,” The Quarterly Journal of Economics, Volume 130, Issue 4, (November 1, 2015).
1975–2026.
Ginsberg, Jeremy, et al. ”Detecting influenza epidemics using search engine query data.” Nature
457.7232 (2009): 1012-1014.
Hall, Jonathan V., and Alan B. Krueger. “An analysis of the labor market for Uber’s
driver-partners in the United States.” ILR Review 71.3 (2018). 705-732.
Lazer, David, et al. ”The parable of Google Flu: traps in big data analysis.” Science 343.6176
(2014): 1203-1205.
Zellner, Arnold. “An Efficient Method of Estimating Seemingly Unrelated Regression Equations
and Tests of Aggregation Bias,” Journal of the American Statistical Association, 57. (1962). 500-509.
Thomas J. Weinandy Big Data Hubris March 17, 2020 32 / 33