Inference with big data: SCECR 2012 Presentation


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Inference with big data: SCECR 2012 Presentation

  1. 1. Inference with Big Data A Superpower Approach Galit Shmuéli Indian School of Business CECRMohit Dayal Mingfeng LinLalita Reddi Hank LucasBhim Pochiraju
  2. 2. Big data studies (in information systems)increasingly common # IS papers with n>10,000 (2004-2010)
  3. 3. Large-study IS papers: How Big?“over 10,000 publicly available feedback text comments… in eBay” The Nature and Role of Feedback Text Comments in Online Marketplaces Pavlou & Dimoka, ISR 2006 For our analysis, we have … 784,882 [portal visits] Household-Specific Regressions Using Clickstream Data Goldfarb & Lu, Statistical Science 2006 “51,062 rare coin auctions that took place… on eBay” The Sound of Silence in Online Feedback Dellarocas & Wood, Management Science 2006“We collected data on … [175,714] reviews from Amazon” Examining the Relationship Between Reviews and Sales Forman et al., ISR 2008108,333 used vehicles offered in the wholesale automotive market Electronic vs. Physical Market Mechanisms Overby & Jap, Management Science 2009“we use… 3.7 million records, encompassing transactions for the Federal Supply Service(FSS) of the U.S. Federal government in fiscal year 2000 Using Transaction Prices to Re-Examine Price Dispersion in Electronic Markets Ghose & Yao, ISR 2011
  4. 4. Apply small sample approach toBig Data studies?
  5. 5. It’s all about Power
  6. 6. Magnify effects Separate signal from noise
  7. 7. Artwork: Running the numbers by Chris Jordan ( 426,000 cell phones retired in the US every day
  8. 8. Power = Prob (detect H1 effect) = f ( sample size, effect size, a, noise )
  9. 9. Rareevents Stronger validity Small & complex effectsThe Promise
  10. 10. Statistical Technology Hypotheses Data Exploration Models Model Validation Inference
  11. 11. Chapter 1:With Mohit Dayal & Lalita Reedi (ISB)DATA VIZ: “BIG DATA” CHARTS
  12. 12. Scaling Up Data VisualizationMissing values
  13. 13. Big Data Scatter plot
  14. 14. Visualization:Big DataBoxplot
  15. 15. BIG DATA (SUPERPOWER) APPROACH:Charts based on aggregationInteractive viz (zoom & pan, filter, etc.)
  17. 17. Simple Hypotheses Assumptions? H1: b1>0 Few control Few hypotheses variables𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘 + 𝜀 Which model? What data? Sign + Statistical significance Model fit Robustness 𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
  18. 18. What doesn’t scale up?(wrong conclusions)What are missed opportunities?
  19. 19. Chapter 2:With Hank Lucas (UMD) & Mingfeng Lin (UoA)TOO BIG TO FAIL:LARGE SAMPLES AND FALSE DISCOVERIES
  20. 20. small p-values*** are not interesting
  21. 21. p-value ~ proximity of sample to H0 = f(effect size, sample size, noise) H0: b=0Large sample result:Deflated p-values
  22. 22. auctions for digital cameras Aug ’07- Jan ‘08 [thanks to Wolfgang Jank for the data!]ln Price  b0  b1 *ln(minimumBid )  b 2 * reserve  b3 *ln(sellerFeedback )  b 4 (Duration)  b5 (controls)   H1: Higher minimum bids lead to higher final prices (b1>0) H2: Auctions with reserve price will sell for higher prices (b2>0) H3: Duration affects price (b4≠0) H4: The higher the seller feedback, the higher the price (b3>0) n=341,136
  23. 23. n=341,136
  24. 24. “In a large sample, we can obtain very large tstatistics with low p-values for our predictors,when, in fact, their effect on Y is very slight” Applied Statistics in Business & Economics Doane & Seward
  25. 25. BIG DATA (SUPERPOWER) APPROACH:Focus on size (ignore p-values)Subsamples for robustness: “results quantitatively similar”
  27. 27. With big data, we’re in the realm of asymptotic behaviour 𝑛→∞
  28. 28. Violated assumptions: less tinkeringAssumption Coefficient Standard Redundant diagnosticViolation bias errors testsUnder-specification all biasEndogeneity* all 2SLS is Instrument strength worse (Sargan)E(e) =0 𝜷𝟎 bias Lack-of-fitNon-normality Anderson-DarlingHeteroscedasticity bias Breusch-PaganOver-specificationSerial dependence bias Durbin -WatsonMulticollinearity increase Significant correlationsInfluential outliers Leverage (multiple testing)*IV estimators only have desirable asymptotic, not finite sample, properties
  29. 29. BIG DATA (SUPERPOWER) APPROACH:Focus on bias-related assumptionsAvoid statistical tests (p-value challenge)
  30. 30. Chapter 4:With Bhimsankaram Pochiraju & Mohit Dayal (ISB)COMPLEX EFFECTS &HETEROGENEITY
  31. 31. With Big Data:Detect small (but important) effectsDetect rare events (in rare minorities)
  32. 32. Complex H1: b3>2 Fixed effects Less hypotheses assumptions control variables 𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋12 + 𝛽3 𝑋2 𝑋3 + ⋯ + 𝛽 𝑘 𝑋𝑘 + 𝜀 Which model? Heterogeneous Clustering/Mixtures Sub-samples What data? Propensity Scores, 2SLSPredictive Model fit Robustness magnitude 𝑓 𝑦 = 𝛽0 + 𝛽1 𝑋1 + 𝛽 2 𝑋2 + 𝛽 3 𝑋1 𝑋2 + ⋯ + 𝛽 𝑘 𝑋𝑘
  33. 33. Test complexhypothesesModeratorsNonlinear relationshipsMultiple categoriesControl variables The rovers have a magnifying camera… that scientists can use to carefully look at the fine structure of a rockQuantify subtle effectsSpecific measures (20 eBay categories)“Moderator variables are difficult to detect” -Aguinis, 1994Low R2, yet non-zero coefficients
  34. 34. Stepwise SelectionOLS with Stepwise (AIC measure)Logistic with variable selection (RELR) All independent variables All control variables Quadratic terms of continuous variables 2-way interactionsChoose software carefully (R: “out of memory”)
  35. 35. Heterogeneity: CART• Identify non-linearities and interactions• Does not identify different models• Challenge: independent variables vs. control variables
  36. 36. Clustering 1. Cluster all independent and control variables 2. Fit separate regression models to each cluster• Popular in risk analytics• Fast, easy• Does not guarantee distinct relationships
  37. 37. Finite MixtureRegressionSearch for k separateregressionsConvergence issueson entire datasetFor 10 subsamples(n=30K) convergedfor seven
  38. 38. Chapter 5:MODEL VALIDATION
  39. 39. Improve model validation, comparison, and generalizationInternal & external validityRobustness across subsamples non-random random (overlapping/non)
  40. 40. Improve predictive validation Holdout set Training set
  42. 42. Clark Kent ≤ Superman