SlideShare a Scribd company logo
1 of 40
Introduction to Data Science
Lecture 6
Exploratory Data Analysis
CS 194 Spring 2014
Michael Franklin
Dan Bruckner, Evan Sparks,
Shivaram Venkataraman
Outline for this Evening
• Class Lecture
• Exploratory Data Analysis
• Hypothesis Testing
• Exercise – EDA and HT in Python
(Evan: Tutorial and Lab)
next week: we’ll play with “R”
• Review of exercise
• Time for Project Group Discussions
Topics Today and Next Time
• Exploratory Data Analysis
• Data Diagnosis
• Graphical/Visual Methods
• Data Transformation
• Confirmatory Data Analysis
• Statistical Hypothesis Testing
• Graphical Inference
Descriptive vs. Inferential
• Descriptive: e.g., Mean; describes data you
have but can't be generalized beyond that
• We’ll talk about Exploratory Data Analysis
• Inferential: e.g., t-test, that enable inferences
about the population beyond our data
• These are the techniques we’ll leverage for
Machine Learning and Prediction
Examples of Business Questions
• Simple (descriptive) Stats
• “Who are the most profitable customers?”
• Hypothesis Testing
• “Is there a difference in value to the company of these
customers?”
• Segmentation/Classification
• What are the common characteristics of these
customers?
• Prediction
• Will this new customer become a profitable
customer? If so, how profitable?
adapted from Provost and Fawcett, “Data Science for Business”
Applying techniques
• What models/techniques to use depends on
the problem context, data and underlying
assumptions.
• e.g., Classification problem with binary
outcome? -> logistic regression, Naïve Bayes,
…
• e.g., Classification problem but no labels?
• -> Perhaps use K-means clustering
Exploratory Data Analysis
1977
• Based on insights developed at Bell Labs
in the 60’s
• Techniques for visualizing and
summarizing data
• What can the data tell us? (in contrast to
“confirmatory” data analysis)
• Introduced many basic techniques:
• 5-number summary, box plots, stem
and leaf diagrams,…
• 5 Number summary:
• extremes (min and max)
• median & quartiles
• More robust to skewed & longtailed
distributions
The Trouble with Summary Stats
Looking at Data
10
Data Presentation
• Dashboard
11
Data Presentation
• Data Art
12
Chart types
• Single variable
• Dot plot
• Jitter plot
• Box plot
• Histogram
• Kernel density estimate
• Cumulative distribution function
(note: examples using qplot library from R)
Chart examples from Jeff Hammerbacher’s 2012 CS194 class
13
Chart types
• Dot plot
14
Chart types
• Jitter plot
15
Chart types
• Box plot
16
Chart types
• Box plot
17
Chart types
• Histogram
18
Chart types
• Kernel density estimate
19
Chart types
• Histogram and Kernel Density Estimates
• Histogram
• Proper selection of bin width is important
• Outliers should be discarded
• KDE
• Kernel function
• Box, Epanechnikov, Gaussian
• Kernel bandwidth
20
Chart types
• Cumulative distribution function
21
Chart types
• Two variables
• Scatter plot
• Line plot
• Log-log plot
• Cut-and-stack plot
• Pairs plot
22
Chart types
• Scatter plot
23
Chart types
• Line plot
24
Chart types
• Log-log plot
25
Chart types
• Coxcomb plot
26
Chart types
• Treemap
27
Chart types
• Heatmap
28
Chart types
• Gapminder
The Need for Models
“All models are wrong, but some models are useful.” George
Box
• Data represents the traces of the real-world processes.
• Two sources of randomness and uncertainty:
1) those underlying the process themselves
2) those associated with the data collection methods
• To simplify the traces into something more
comprehensible you need:
• mathematical models or functions of the data -> Statistical
estimators
More on Models
• N is size of population
• n is sample size (subset of the population)
• Getting the subset (i.e. sampling) can
introduce "bias" leading to incorrect
conclusions
Probability Distributions
• Natural processes tend to generate
measurements whose empirical shape could
be approximated by mathematical functions
with a few parameters that could be
estimated from the data.
Note on ML Algos vs. Stat Models
• Techniques and underlying concepts in common
• Difference in goals/use:
• ML Algos – goal: predict or classify with high
accuracty.
• basis of many data products
• Models – get at the underlying generative process
• “Black box” vs. “White box”
• Dealing with uncertainty (at the heart of stats)
• Distributions vs. non-parametic approaches
More on Hypothesis Testing
• Null Hypothesis is given the benefit of the
doubt (e.g., innocent until proven guilty).
• Alternative Hypothesis directly contradicts the
Null Hypothesis
• "Step 1: State the hypotheses."
• "Step 2: Set the criteria for a decision."
• "Step 3: Compute the test statistic."
• "Step 4: Make a decision."
p Value
• A p value is the probability of obtaining a
sample outcome, given that the value stated
in the null hypothesis is true.
• In many cases: when the p value is less than
5% (p < .05), we reject the null hypothesis
• Note this means that 1 out of 20 times we
incorrectly reject the null hypothesis
• Do “green jelly beans cause acne?” (see XKCD)
From G.J. Primavera, “Statistics for the Behavioral Sciences”
Two-tailed Significance
When the p value is less than 5% (p < .05), we
reject the null hypothesis
From G.J. Primavera, “Statistics for the Behavioral Sciences”
Hypothesis Testing
From G.J. Primavera, “Statistics for the Behavioral Sciences”
Are Two Sets of Data Really Different?
• Null Hypothesis: The differences we see are
due to “chance”
• For Small Sample sizes: use T-test
• We’ll do this next in the lab.
Some Notes on the Class
• 3/17 Intro to Supervised Learning
• HW2 coming out tomorrow night
• Due after Spring Break but do it before!
• FINAL PROJECTS
• Group size = 3
• What’s expected – find data, build a COOL Data
Product, integration & viz or good reason why not
• Schedule:
• Groups Formed
• 1-2page proposal DUE 3/11 Midnight
• Midway review meeting with Prof or GSIs following 1-2
weeks
• Final Presentation (Posters and/or Lightning talks)
• Final Report

More Related Content

Similar to CS194Lec0hbh6EDA.pptx

Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfAbdullahOmar64
 
Introduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdfIntroduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdfAfframHspt
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
Data Science 101
Data Science 101Data Science 101
Data Science 101ideatoipo
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...ilmideas
 
How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share ilmideas
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...statisfactions
 

Similar to CS194Lec0hbh6EDA.pptx (20)

Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdf
 
AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
Introduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdfIntroduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdf
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Chap4 part 1
Chap4 part 1Chap4 part 1
Chap4 part 1
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRIICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
 
How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share
 
Ml - A shallow dive
Ml  - A shallow diveMl  - A shallow dive
Ml - A shallow dive
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
4646150.ppt
4646150.ppt4646150.ppt
4646150.ppt
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
 

Recently uploaded

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 

Recently uploaded (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 

CS194Lec0hbh6EDA.pptx

  • 1. Introduction to Data Science Lecture 6 Exploratory Data Analysis CS 194 Spring 2014 Michael Franklin Dan Bruckner, Evan Sparks, Shivaram Venkataraman
  • 2. Outline for this Evening • Class Lecture • Exploratory Data Analysis • Hypothesis Testing • Exercise – EDA and HT in Python (Evan: Tutorial and Lab) next week: we’ll play with “R” • Review of exercise • Time for Project Group Discussions
  • 3. Topics Today and Next Time • Exploratory Data Analysis • Data Diagnosis • Graphical/Visual Methods • Data Transformation • Confirmatory Data Analysis • Statistical Hypothesis Testing • Graphical Inference
  • 4. Descriptive vs. Inferential • Descriptive: e.g., Mean; describes data you have but can't be generalized beyond that • We’ll talk about Exploratory Data Analysis • Inferential: e.g., t-test, that enable inferences about the population beyond our data • These are the techniques we’ll leverage for Machine Learning and Prediction
  • 5. Examples of Business Questions • Simple (descriptive) Stats • “Who are the most profitable customers?” • Hypothesis Testing • “Is there a difference in value to the company of these customers?” • Segmentation/Classification • What are the common characteristics of these customers? • Prediction • Will this new customer become a profitable customer? If so, how profitable? adapted from Provost and Fawcett, “Data Science for Business”
  • 6. Applying techniques • What models/techniques to use depends on the problem context, data and underlying assumptions. • e.g., Classification problem with binary outcome? -> logistic regression, Naïve Bayes, … • e.g., Classification problem but no labels? • -> Perhaps use K-means clustering
  • 7. Exploratory Data Analysis 1977 • Based on insights developed at Bell Labs in the 60’s • Techniques for visualizing and summarizing data • What can the data tell us? (in contrast to “confirmatory” data analysis) • Introduced many basic techniques: • 5-number summary, box plots, stem and leaf diagrams,… • 5 Number summary: • extremes (min and max) • median & quartiles • More robust to skewed & longtailed distributions
  • 8. The Trouble with Summary Stats
  • 12. 12 Chart types • Single variable • Dot plot • Jitter plot • Box plot • Histogram • Kernel density estimate • Cumulative distribution function (note: examples using qplot library from R) Chart examples from Jeff Hammerbacher’s 2012 CS194 class
  • 18. 18 Chart types • Kernel density estimate
  • 19. 19 Chart types • Histogram and Kernel Density Estimates • Histogram • Proper selection of bin width is important • Outliers should be discarded • KDE • Kernel function • Box, Epanechnikov, Gaussian • Kernel bandwidth
  • 20. 20 Chart types • Cumulative distribution function
  • 21. 21 Chart types • Two variables • Scatter plot • Line plot • Log-log plot • Cut-and-stack plot • Pairs plot
  • 29. The Need for Models “All models are wrong, but some models are useful.” George Box • Data represents the traces of the real-world processes. • Two sources of randomness and uncertainty: 1) those underlying the process themselves 2) those associated with the data collection methods • To simplify the traces into something more comprehensible you need: • mathematical models or functions of the data -> Statistical estimators
  • 30. More on Models • N is size of population • n is sample size (subset of the population) • Getting the subset (i.e. sampling) can introduce "bias" leading to incorrect conclusions
  • 31. Probability Distributions • Natural processes tend to generate measurements whose empirical shape could be approximated by mathematical functions with a few parameters that could be estimated from the data.
  • 32. Note on ML Algos vs. Stat Models • Techniques and underlying concepts in common • Difference in goals/use: • ML Algos – goal: predict or classify with high accuracty. • basis of many data products • Models – get at the underlying generative process • “Black box” vs. “White box” • Dealing with uncertainty (at the heart of stats) • Distributions vs. non-parametic approaches
  • 33.
  • 34. More on Hypothesis Testing • Null Hypothesis is given the benefit of the doubt (e.g., innocent until proven guilty). • Alternative Hypothesis directly contradicts the Null Hypothesis • "Step 1: State the hypotheses." • "Step 2: Set the criteria for a decision." • "Step 3: Compute the test statistic." • "Step 4: Make a decision."
  • 35. p Value • A p value is the probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true. • In many cases: when the p value is less than 5% (p < .05), we reject the null hypothesis • Note this means that 1 out of 20 times we incorrectly reject the null hypothesis • Do “green jelly beans cause acne?” (see XKCD)
  • 36. From G.J. Primavera, “Statistics for the Behavioral Sciences”
  • 37. Two-tailed Significance When the p value is less than 5% (p < .05), we reject the null hypothesis From G.J. Primavera, “Statistics for the Behavioral Sciences”
  • 38. Hypothesis Testing From G.J. Primavera, “Statistics for the Behavioral Sciences”
  • 39. Are Two Sets of Data Really Different? • Null Hypothesis: The differences we see are due to “chance” • For Small Sample sizes: use T-test • We’ll do this next in the lab.
  • 40. Some Notes on the Class • 3/17 Intro to Supervised Learning • HW2 coming out tomorrow night • Due after Spring Break but do it before! • FINAL PROJECTS • Group size = 3 • What’s expected – find data, build a COOL Data Product, integration & viz or good reason why not • Schedule: • Groups Formed • 1-2page proposal DUE 3/11 Midnight • Midway review meeting with Prof or GSIs following 1-2 weeks • Final Presentation (Posters and/or Lightning talks) • Final Report

Editor's Notes

  1. Atrributed to Florence Nightingale