SlideShare a Scribd company logo
Common Limitations with Big Data
in the Field
8/7/2014
Sean J. Taylor
Joint Statistical Meetings
Source: “big data” from Google
Mo’ Data,
Mo’ Problems
Why are the data so big?
data
“To consult the
statistician after an
experiment is finished is
often merely to ask him
to conduct a post-
mortem examination.
He can perhaps say what
the experiment died of.”
— Ronald Fisher
Data Alchemy?
Common Big Data Limitations:
!
1. Measuring the wrong thing
2. Biased samples
3. Dependence
4. Uncommon support
1. Measuring the wrong thing
The data you have a lot of doesn’t
measure what you want.
Common Pattern
▪ High volume of of cheap, easy to measure “surrogate” 

(e.g. steps, clicks)
▪ Surrogate is correlated with true measurement of interest
(e.g. overall health, purchase intention)
▪ key question: sign and magnitude of “interpretation bias”
▪ obvious strategy: design better measurement
▪ opportunity: joint, causal modeling of high-resolution
surrogate measurement with low-resolution “true”
measurement.
2. Biased samples
Size doesn’t guarantee your data is
representative of the population.
Bias?
World Cup Arrivals by Country
Is Liberal
Has
Daughter
Is Liberal
on
Facebook
Has
Daughter
on
Facebook
What we want
What we get
Common Pattern
▪ Your sample conditions on your own convenience, user
activity level, and/or technological sophistication of
participants.
▪ You’d like to extrapolate your statistic to a more general
population of interest.
▪ key question: (again) sign and magnitude of bias
▪ obvious strategy: weighting or regression
▪ opportunity: understand effect of selection on latent
variables on bias
3. Dependence
Effective size of data is small due
to repeated measurements of
units.
Bakshy et al. (2012)
“Social Influence in Social Advertising”
Person D Y
Evan 1 1
Evan 0 0
Ashley 0 1
Ashley 1 1
Ashley 0 1
Greg 1 0
Leena 1 0
Leena 1 1
Ema 0 0
Seamus 1 1
Person Ad D Y
Evan Sprite 1 1
Evan Coke 0 0
Ashley Pepsi 0 1
Ashley Pepsi 1 1
Ashley Coke 0 1
Greg Coke 1 0
Leena Pepsi 1 0
Leena Coke 1 1
Ema Sprite 0 0
Seamus Sprite 1 1
One-way Dependency Two-way Dependency
Yij = Dij + ↵i + j + ✏ijYij = Dij + ↵i + ✏ij
Bakshy and Eckles (2013)
“Uncertainty in Online Experiments with Dependent Data”
Common Pattern
▪ Your sample is large because it includes many repeated
measurements of the most active units.
▪ An inferential procedure that assumes independence will
be anti-conservative.
▪ key question: how to efficiently produce confidence
intervals with the right coverage.
▪ obvious strategy: multi-way bootstrap (Owen and Eckles)
▪ opportunity: scalable inference for crossed random effects
models
4. Uncommon support
Your data are sparse in part of the
distribution you care about.
1.0
1.5
2.0
2.5
3.0
−2 −1 0 1 2 3 4 5
Time Relative to Kickoff (hours)
Positive/NegativeEmotionWords
outcome
loss
win
“The Emotional Highs and Lows of the NFL Season”
Facebook Data Science Blog
Which check-ins are useful?
▪ We can only study people who post enough text before
and after their check-ins to yield reliable measurement.
What is the control check-in?
▪ Ideally we’d do a within-subjects design. If so we’re stuck
with only other check-ins (with enough status updates)
from the same people.
▪ National Park visit is likely a vacation, so we need to find
other vacation check-ins. Similar day, time, and distance.



Only a small fraction of the “treatment” cases yield
reliable measurement plus control cases.
Common Pattern
▪ Your sample is large but observations satisfying some set
of qualifying criteria (usually matching) are rare.
▪ Even with strong assumptions, hard to measure precise
differences due to lack of similar cases.
▪ key question: how to use as many of the observations as
possible without introducing bias.
▪ obvious strategy: randomized experiments
▪ opportunity: more efficient causal inference techniques
that work at scale (high dimensional regression/matching)
>
Conclusion 1:
Making your own quality data is
better than being a data alchemist.
Conclusion 2:
Great research opportunities in
addressing limitations of
“convenience” big data.
!
(or how I stopped worrying and
learned to be a data alchemist)
Sustaining versus Disruptive Innovations
Sustaining innovations
▪ improve product performance
Disruptive innovations
▪ Result in worse performance (short-term)
▪ Eventually surpass sustaining technologies in satisfying
market demand with lower costs
▪ cheaper, simpler, smaller, and frequently more
convenient to use
Innovator’s Dilemma
sjt@fb.com
http://seanjtaylor.com

More Related Content

What's hot

Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
Nik Spirin
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Domino Data Lab
 
Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up Seattle
Domino Data Lab
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayH
jk6653284
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
Revolution Analytics
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End tool
Amit Sharma
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?
Daniel Tunkelang
 
Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on Pinterest
June Andrews
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Matthew Lease
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
mortardata
 
2018 02 converged it
2018 02 converged it2018 02 converged it
2018 02 converged it
Chris Dwan
 
Probabilistic Reasoner
Probabilistic ReasonerProbabilistic Reasoner
Probabilistic Reasoner
Eugene Yakushev
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine Learning
Daniel Tunkelang
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
Mohamadreza Mohtat
 
Math in data
Math in dataMath in data
Math in data
June Andrews
 
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaThe Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
Stoic Advantage, LLC.
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)
Troy Magennis
 
Conversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyConversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie Eardley
Webanalisten .nl
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Ogechi Onuoha
 
Predicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingPredicting the Future With Microsoft Bing
Predicting the Future With Microsoft Bing
Cybera Inc.
 

What's hot (20)

Introduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine LearningIntroduction to Data Science and Large-scale Machine Learning
Introduction to Data Science and Large-scale Machine Learning
 
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up SeattleApplied Machine Learning for the IoT - Data Science Pop-up Seattle
Applied Machine Learning for the IoT - Data Science Pop-up Seattle
 
Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up Seattle
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayH
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
DoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End toolDoWhy Python library for causal inference: An End-to-End tool
DoWhy Python library for causal inference: An End-to-End tool
 
Web science - How is it different?
Web science - How is it different?Web science - How is it different?
Web science - How is it different?
 
Trends on Pinterest
Trends on PinterestTrends on Pinterest
Trends on Pinterest
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
2018 02 converged it
2018 02 converged it2018 02 converged it
2018 02 converged it
 
Probabilistic Reasoner
Probabilistic ReasonerProbabilistic Reasoner
Probabilistic Reasoner
 
My Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine LearningMy Three Ex’s: A Data Science Approach for Applied Machine Learning
My Three Ex’s: A Data Science Approach for Applied Machine Learning
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Math in data
Math in dataMath in data
Math in data
 
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating ArenaThe Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
The Top 4 Ways On How Neuro-Marketing Influences The Online Dating Arena
 
What is the story with agile data keynote agile 2018 (Magennis)
What is the story with agile data keynote   agile 2018 (Magennis)What is the story with agile data keynote   agile 2018 (Magennis)
What is the story with agile data keynote agile 2018 (Magennis)
 
Conversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie EardleyConversion Hotel 2018 Keynote: Lizzie Eardley
Conversion Hotel 2018 Keynote: Lizzie Eardley
 
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
Uncertainty Quantification in Complex Physical Systems. (An Inroduction)
 
Predicting the Future With Microsoft Bing
Predicting the Future With Microsoft BingPredicting the Future With Microsoft Bing
Predicting the Future With Microsoft Bing
 

Viewers also liked

Automatic Forecasting at Scale
Automatic Forecasting at ScaleAutomatic Forecasting at Scale
Automatic Forecasting at Scale
Sean Taylor
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experiments
Sean Taylor
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of Entrepreneurship
TaylorPearson.me
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra
 
Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?
NUS-ISS
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop
InnoTech
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
Karthik Padmanabhan ( MLE℠)
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
SpringPeople
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Zekeriya Besiroglu
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Boston Consulting Group
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
Amazon Web Services
 
RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetup
Peter Laurinec
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
Arthur Charpentier
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
Amazon Web Services
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 

Viewers also liked (19)

Automatic Forecasting at Scale
Automatic Forecasting at ScaleAutomatic Forecasting at Scale
Automatic Forecasting at Scale
 
Implementing and analyzing online experiments
Implementing and analyzing online experimentsImplementing and analyzing online experiments
Implementing and analyzing online experiments
 
The Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of EntrepreneurshipThe Safety And Sanity Of Entrepreneurship
The Safety And Sanity Of Entrepreneurship
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?Big Data Analytics: Challenge or Opportunity?
Big Data Analytics: Challenge or Opportunity?
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
Bigdata Nedir? Hadoop Nedir? MapReduce Nedir? Big Data.
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?What is Cloud Computing with Amazon Web Services?
What is Cloud Computing with Amazon Web Services?
 
RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetup
 
Econometrics 2017-graduate-3
Econometrics 2017-graduate-3Econometrics 2017-graduate-3
Econometrics 2017-graduate-3
 
Introduction to Amazon Web Services
Introduction to Amazon Web ServicesIntroduction to Amazon Web Services
Introduction to Amazon Web Services
 
Big Data
Big DataBig Data
Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Similar to Jsm big-data

sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
wolf vanpaemel
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostello
Data Con LA
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)
Iin Angriyani
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
clearsateam
 
The Art and Science of Survey Research
The Art and Science of Survey ResearchThe Art and Science of Survey Research
The Art and Science of Survey Research
Siobhan O'Dwyer
 
The Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsThe Social Side of Behavioural Economics
The Social Side of Behavioural Economics
David Perrott
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentationsteve muzzy
 
تحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطياتتحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطيات
مركز البحوث الأقسام العلمية
 
Capturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careCapturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised care
Vanessa Lopez
 
Audit and stat for medical professionals
Audit and stat for medical professionalsAudit and stat for medical professionals
Audit and stat for medical professionals
Nadir Mehmood
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
Ruth Kearney
 
Book Summary : Everybody Lies
Book Summary : Everybody LiesBook Summary : Everybody Lies
Book Summary : Everybody Lies
Rahul Rishi
 
intro_big_data.pptx
intro_big_data.pptxintro_big_data.pptx
intro_big_data.pptx
SathishSarma1
 
Data 101 Workshop
Data 101 WorkshopData 101 Workshop
Data 101 Workshop
WiLS
 
You Want Me to Measure What?
You Want Me to Measure What?You Want Me to Measure What?
You Want Me to Measure What?
Dave Hogue
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and management
Mark Reed
 
Big Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataBig Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big Data
Sylvia Ogweng
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practices
Gregory Nelson
 

Similar to Jsm big-data (20)

sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Lecture 5
Lecture 5Lecture 5
Lecture 5
 
Data science and good questions eric kostello
Data science and good questions eric kostelloData science and good questions eric kostello
Data science and good questions eric kostello
 
321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)321423152 e-0016087606-session39134-201012122352 (1)
321423152 e-0016087606-session39134-201012122352 (1)
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
 
The Art and Science of Survey Research
The Art and Science of Survey ResearchThe Art and Science of Survey Research
The Art and Science of Survey Research
 
The Social Side of Behavioural Economics
The Social Side of Behavioural EconomicsThe Social Side of Behavioural Economics
The Social Side of Behavioural Economics
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentation
 
تحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطياتتحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطيات
 
Capturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised careCapturing Social and Clinical Knowledge for personalised care
Capturing Social and Clinical Knowledge for personalised care
 
Audit and stat for medical professionals
Audit and stat for medical professionalsAudit and stat for medical professionals
Audit and stat for medical professionals
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
 
Book Summary : Everybody Lies
Book Summary : Everybody LiesBook Summary : Everybody Lies
Book Summary : Everybody Lies
 
intro_big_data.pptx
intro_big_data.pptxintro_big_data.pptx
intro_big_data.pptx
 
Data 101 Workshop
Data 101 WorkshopData 101 Workshop
Data 101 Workshop
 
You Want Me to Measure What?
You Want Me to Measure What?You Want Me to Measure What?
You Want Me to Measure What?
 
Developing core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and managementDeveloping core common outcomes for tropical peatland research and management
Developing core common outcomes for tropical peatland research and management
 
Big Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big DataBig Data Privacy - Society Issues + Big Data
Big Data Privacy - Society Issues + Big Data
 
Data is love data viz best practices
Data is love   data viz best practicesData is love   data viz best practices
Data is love data viz best practices
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 

Jsm big-data

  • 1. Common Limitations with Big Data in the Field 8/7/2014 Sean J. Taylor Joint Statistical Meetings
  • 2.
  • 3.
  • 5.
  • 7. Why are the data so big?
  • 8.
  • 10. “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post- mortem examination. He can perhaps say what the experiment died of.” — Ronald Fisher
  • 12. Common Big Data Limitations: ! 1. Measuring the wrong thing 2. Biased samples 3. Dependence 4. Uncommon support
  • 13. 1. Measuring the wrong thing The data you have a lot of doesn’t measure what you want.
  • 14.
  • 15.
  • 16. Common Pattern ▪ High volume of of cheap, easy to measure “surrogate” 
 (e.g. steps, clicks) ▪ Surrogate is correlated with true measurement of interest (e.g. overall health, purchase intention) ▪ key question: sign and magnitude of “interpretation bias” ▪ obvious strategy: design better measurement ▪ opportunity: joint, causal modeling of high-resolution surrogate measurement with low-resolution “true” measurement.
  • 17. 2. Biased samples Size doesn’t guarantee your data is representative of the population.
  • 18.
  • 20.
  • 22. Common Pattern ▪ Your sample conditions on your own convenience, user activity level, and/or technological sophistication of participants. ▪ You’d like to extrapolate your statistic to a more general population of interest. ▪ key question: (again) sign and magnitude of bias ▪ obvious strategy: weighting or regression ▪ opportunity: understand effect of selection on latent variables on bias
  • 23. 3. Dependence Effective size of data is small due to repeated measurements of units.
  • 24. Bakshy et al. (2012) “Social Influence in Social Advertising”
  • 25. Person D Y Evan 1 1 Evan 0 0 Ashley 0 1 Ashley 1 1 Ashley 0 1 Greg 1 0 Leena 1 0 Leena 1 1 Ema 0 0 Seamus 1 1 Person Ad D Y Evan Sprite 1 1 Evan Coke 0 0 Ashley Pepsi 0 1 Ashley Pepsi 1 1 Ashley Coke 0 1 Greg Coke 1 0 Leena Pepsi 1 0 Leena Coke 1 1 Ema Sprite 0 0 Seamus Sprite 1 1 One-way Dependency Two-way Dependency Yij = Dij + ↵i + j + ✏ijYij = Dij + ↵i + ✏ij
  • 26. Bakshy and Eckles (2013) “Uncertainty in Online Experiments with Dependent Data”
  • 27. Common Pattern ▪ Your sample is large because it includes many repeated measurements of the most active units. ▪ An inferential procedure that assumes independence will be anti-conservative. ▪ key question: how to efficiently produce confidence intervals with the right coverage. ▪ obvious strategy: multi-way bootstrap (Owen and Eckles) ▪ opportunity: scalable inference for crossed random effects models
  • 28. 4. Uncommon support Your data are sparse in part of the distribution you care about.
  • 29.
  • 30. 1.0 1.5 2.0 2.5 3.0 −2 −1 0 1 2 3 4 5 Time Relative to Kickoff (hours) Positive/NegativeEmotionWords outcome loss win “The Emotional Highs and Lows of the NFL Season” Facebook Data Science Blog
  • 31. Which check-ins are useful? ▪ We can only study people who post enough text before and after their check-ins to yield reliable measurement. What is the control check-in? ▪ Ideally we’d do a within-subjects design. If so we’re stuck with only other check-ins (with enough status updates) from the same people. ▪ National Park visit is likely a vacation, so we need to find other vacation check-ins. Similar day, time, and distance.
 
 Only a small fraction of the “treatment” cases yield reliable measurement plus control cases.
  • 32. Common Pattern ▪ Your sample is large but observations satisfying some set of qualifying criteria (usually matching) are rare. ▪ Even with strong assumptions, hard to measure precise differences due to lack of similar cases. ▪ key question: how to use as many of the observations as possible without introducing bias. ▪ obvious strategy: randomized experiments ▪ opportunity: more efficient causal inference techniques that work at scale (high dimensional regression/matching)
  • 33. > Conclusion 1: Making your own quality data is better than being a data alchemist.
  • 34. Conclusion 2: Great research opportunities in addressing limitations of “convenience” big data. ! (or how I stopped worrying and learned to be a data alchemist)
  • 35.
  • 36. Sustaining versus Disruptive Innovations Sustaining innovations ▪ improve product performance Disruptive innovations ▪ Result in worse performance (short-term) ▪ Eventually surpass sustaining technologies in satisfying market demand with lower costs ▪ cheaper, simpler, smaller, and frequently more convenient to use