SlideShare a Scribd company logo
1 of 27
+ 
Working with Numerical Data 
Slides edited by Valerio Di Fonzo for www.globalpolis.org 
Based on the work of Mine Çetinkaya-Rundel of OpenIntro 
The slides may be copied, edited, and/or shared via the CC BY-SA license 
Some images may be included under fair use guidelines (educational purposes)
Scatterplot 
Scatterplots are useful for visualizing the relationship between two numerical 
variables. 
Do life expectancy and total fertility 
appear to be associated or independent? 
They appear to be linearly and negatively 
associated: as fertility increases, life 
expectancy decreases. 
Was the relationship the same throughout 
the years, or did it change? 
The relationship changed over the years. http://www.gapminder.org/world
Dot Plots 
Useful for visualizing one numerical variable. Darker colors 
represent areas where there are more observations. 
How would you describe the distribution of GPAs in this data 
set? Make sure to say something about the center, shape, and 
spread of the distribution.
Dot Plots & Mean 
The mean, also called the average (marked with a triangle in the 
above plot), is one way to measure the center of a distribution 
of data. 
The mean GPA is 3.59.
Mean 
The sample mean, denoted as , can be calculated as 
where x1, x2, ..., xn represent the n observed values. 
The population mean is also computed the same way but is denoted 
as μ. It is often not possible to calculate μ since population data are 
rarely available. 
The sample mean is a sample statistic, and serves as a point estimate 
of the population mean. This estimate may not be perfect, but if the 
sample is good (representative of the population), 
it is usually a pretty good estimate.
Stacked Dot Plot 
Higher bars represent areas where there are more 
observations, makes it a little easier to judge the center and 
the shape of the distribution.
Histograms 
● Histograms provide a view of the data density. Higher bars 
represent where the data are relatively more common. 
● Histograms are especially convenient for describing the shape of 
the data distribution. 
● The chosen bin width can alter the story the histogram is telling.
Bin Width 
The chosen bin width can alter the story the histogram is telling.
Shape of a Distribution: Modality 
Does the histogram have a single prominent peak (unimodal), several 
prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? 
Note: In order to determine modality, step back and imagine a smooth 
curve over the histogram -- imagine that the bars are wooden blocks 
and you drop a limp spaghetti over them, the shape the spaghetti would 
take could be viewed as a smooth curve.
Shape of a Distribution: Skewness 
Is the histogram right skewed, left skewed, or symmetric? 
Histograms are said to be skewed to the side of the long tail.
Shape of a Distribution: 
Unusual Observations 
Are there any unusual observations or potential outliers?
Commonly observed shapes of 
distributions 
Modality 
Skewness
Variance 
Variance is roughly the average squared deviation from the mean. 
● The sample mean is 
and the sample size is n = 217. 
● The variance of amount of sleep 
students get per night can be 
calculated as:
Variance (cont.) 
Why do we use the squared deviation in the calculation of 
variance? 
● To get rid of negatives so that observations equally distant from 
the mean are weighed equally. 
● To weigh larger deviations more heavily.
Standard Deviation 
The standard deviation is the square root of the variance, and has the 
same units as the data. 
The standard deviation of amount 
of sleep students get per night 
can be calculated as: 
We can see that all of the data are within 3 standard deviations of the 
mean.
Median 
The median is the value that splits the data in half when ordered in 
ascending order. 
If there are an even number of observations, then the median is the 
average of the two values in the middle. 
Since the median is the midpoint of the data, 50% of the values are below 
it. Hence, it is also the 50th percentile.
Q1, Q3, and IQR 
● The 25th percentile is also called the first quartile, Q1. 
● The 50th percentile is also called the median. 
● The 75th percentile is also called the third quartile, Q3. 
Between Q1 and Q3 is the middle 50% of the data. The range 
these data span is called the interquartile range, or the IQR. 
IQR = Q3 - Q1
Box Plot 
The box in a box plot represents the middle 50% of the data, and 
the thick line in the box is the median.
Anatomy of a Box Plot
Whiskers and Outliers 
Whiskers of a box plot can extend up to 1.5 x IQR away from the quartiles. 
max upper whisker reach = Q3 + 1.5 x IQR 
max lower whisker reach = Q1 - 1.5 x IQR 
IQR: 20 - 10 = 10 
max upper whisker reach = 20 + 1.5 x 10 = 35 
max lower whisker reach = 10 - 1.5 x 10 = -5 
A potential outlier is defined as an observation beyond the maximum 
reach of the whiskers. It is an observation that appears extreme relative 
to the rest of the data. Why is it important to look for outliers? 
 Identify extreme skew in the distribution. 
 Identify data collection and entry errors. 
 Provide insight into interesting features of the data.
Extreme Observations 
How would sample statistics such as mean, median, SD, and IQR 
of household income be affected if the largest value was replaced 
with $10 million? What if the smallest value was replaced with 
$10 million?
Robust Statistics
Robust Statistics 
Median and IQR are more robust to skewness and outliers than 
mean and SD. Therefore, 
● for skewed distributions it is often more helpful to use median 
and IQR to describe the center and spread 
● for symmetric distributions it is often more helpful to use the 
mean and SD to describe the center and spread 
If you would like to estimate the typical household income for 
a student, would you be more interested in the mean or 
median income? 
Median
Mean vs. Median 
If the distribution is symmetric, center is often defined as the mean: 
mean ~ median 
If the distribution is skewed or has extreme outliers, center is often defined 
as the median 
● Right-skewed: mean > median 
● Left-skewed: mean < median
Extremely Skewed Data 
When data are extremely skewed, transforming them might make 
modeling easier. A common transformation is the log transformation. 
The histograms on the left shows the distribution of number of basketball 
games attended by students. The histogram on the right shows the 
distribution of log of number of games attended.
Pros and Cons of Transformations 
Skewed data are easier to model with when they are transformed because 
outliers tend to become far less prominent after an appropriate 
transformation. 
# of games 70 50 25 … 
# of games 4.25 3.91 3.22 ... 
However, results of an analysis might be difficult to interpret because the 
log of a measured variable is usually meaningless. 
What other variables would you expect to be extremely skewed? 
Salary, housing prices, etc.
Intensity Maps 
What patterns are apparent in the change in population between 
2000 and 2010? 
http://projects.nytimes.com/census/2010/map

More Related Content

Similar to Working with Numerical Data

Statistics for machine learning shifa noorulain
Statistics for machine learning   shifa noorulainStatistics for machine learning   shifa noorulain
Statistics for machine learning shifa noorulainShifaNoorUlAin1
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of datadrasifk
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAnand Thokal
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfHimakshi7
 
Summary statistics
Summary statisticsSummary statistics
Summary statisticsRupak Roy
 
CJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docxCJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docxmonicafrancis71118
 
These is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docxThese is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docxmeagantobias
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data DescriptionFarhan Alfin
 
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptx
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptxGraphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptx
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptxRanggaMasyhuriNuur
 
Data Representations
Data RepresentationsData Representations
Data Representationsbujols
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxMesfinMelese4
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1Ashwini Mathur
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendencyrenukamorani143
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
What is Descriptive Statistics and How Do You Choose the Right One for Enterp...
What is Descriptive Statistics and How Do You Choose the Right One for Enterp...What is Descriptive Statistics and How Do You Choose the Right One for Enterp...
What is Descriptive Statistics and How Do You Choose the Right One for Enterp...Smarten Augmented Analytics
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive StatisticsBhagya Silva
 

Similar to Working with Numerical Data (20)

Introduction to Descriptive Statistics
Introduction to Descriptive StatisticsIntroduction to Descriptive Statistics
Introduction to Descriptive Statistics
 
Statistics for machine learning shifa noorulain
Statistics for machine learning   shifa noorulainStatistics for machine learning   shifa noorulain
Statistics for machine learning shifa noorulain
 
Statistics for ess
Statistics for essStatistics for ess
Statistics for ess
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Data presenatation
Data presenatationData presenatation
Data presenatation
 
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdfGraphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
Graphicalrepresntationofdatausingstatisticaltools2019_210902_105156.pdf
 
Summary statistics
Summary statisticsSummary statistics
Summary statistics
 
CJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docxCJ 301 – Measures of DispersionVariability Think back to the .docx
CJ 301 – Measures of DispersionVariability Think back to the .docx
 
These is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docxThese is info only ill be attaching the questions work CJ 301 – .docx
These is info only ill be attaching the questions work CJ 301 – .docx
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data Description
 
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptx
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptxGraphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptx
Graphical Presentation of Data - Rangga Masyhuri Nuur LLU 27.pptx
 
Data Representations
Data RepresentationsData Representations
Data Representations
 
m2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptxm2_2_variation_z_scores.pptx
m2_2_variation_z_scores.pptx
 
R for statistics session 1
R for statistics session 1R for statistics session 1
R for statistics session 1
 
Measures of central tendency
Measures of central tendencyMeasures of central tendency
Measures of central tendency
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
What is Descriptive Statistics and How Do You Choose the Right One for Enterp...
What is Descriptive Statistics and How Do You Choose the Right One for Enterp...What is Descriptive Statistics and How Do You Choose the Right One for Enterp...
What is Descriptive Statistics and How Do You Choose the Right One for Enterp...
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 

More from Global Polis

Binomial distribution
Binomial distributionBinomial distribution
Binomial distributionGlobal Polis
 
Normal distribution
Normal distributionNormal distribution
Normal distributionGlobal Polis
 
Bayeasian inference
Bayeasian inferenceBayeasian inference
Bayeasian inferenceGlobal Polis
 
Conditional probability, and probability trees
Conditional probability, and probability treesConditional probability, and probability trees
Conditional probability, and probability treesGlobal Polis
 
Introduction to probability
Introduction to probabilityIntroduction to probability
Introduction to probabilityGlobal Polis
 
Observational Studies and Sampling
Observational Studies and Sampling Observational Studies and Sampling
Observational Studies and Sampling Global Polis
 
Experimental Design and Blocking
Experimental Design and BlockingExperimental Design and Blocking
Experimental Design and BlockingGlobal Polis
 
Data, Variables, Hypothesis
Data, Variables, Hypothesis Data, Variables, Hypothesis
Data, Variables, Hypothesis Global Polis
 
Data Collection and Sampling
Data Collection and SamplingData Collection and Sampling
Data Collection and SamplingGlobal Polis
 
Gender Discrimination: A case study
Gender Discrimination: A case studyGender Discrimination: A case study
Gender Discrimination: A case studyGlobal Polis
 

More from Global Polis (11)

Binomial distribution
Binomial distributionBinomial distribution
Binomial distribution
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
Bayeasian inference
Bayeasian inferenceBayeasian inference
Bayeasian inference
 
Conditional probability, and probability trees
Conditional probability, and probability treesConditional probability, and probability trees
Conditional probability, and probability trees
 
Introduction to probability
Introduction to probabilityIntroduction to probability
Introduction to probability
 
Observational Studies and Sampling
Observational Studies and Sampling Observational Studies and Sampling
Observational Studies and Sampling
 
Experimental Design and Blocking
Experimental Design and BlockingExperimental Design and Blocking
Experimental Design and Blocking
 
Data, Variables, Hypothesis
Data, Variables, Hypothesis Data, Variables, Hypothesis
Data, Variables, Hypothesis
 
Data Collection and Sampling
Data Collection and SamplingData Collection and Sampling
Data Collection and Sampling
 
Categorical Data
Categorical DataCategorical Data
Categorical Data
 
Gender Discrimination: A case study
Gender Discrimination: A case studyGender Discrimination: A case study
Gender Discrimination: A case study
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 

Working with Numerical Data

  • 1. + Working with Numerical Data Slides edited by Valerio Di Fonzo for www.globalpolis.org Based on the work of Mine Çetinkaya-Rundel of OpenIntro The slides may be copied, edited, and/or shared via the CC BY-SA license Some images may be included under fair use guidelines (educational purposes)
  • 2. Scatterplot Scatterplots are useful for visualizing the relationship between two numerical variables. Do life expectancy and total fertility appear to be associated or independent? They appear to be linearly and negatively associated: as fertility increases, life expectancy decreases. Was the relationship the same throughout the years, or did it change? The relationship changed over the years. http://www.gapminder.org/world
  • 3. Dot Plots Useful for visualizing one numerical variable. Darker colors represent areas where there are more observations. How would you describe the distribution of GPAs in this data set? Make sure to say something about the center, shape, and spread of the distribution.
  • 4. Dot Plots & Mean The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data. The mean GPA is 3.59.
  • 5. Mean The sample mean, denoted as , can be calculated as where x1, x2, ..., xn represent the n observed values. The population mean is also computed the same way but is denoted as μ. It is often not possible to calculate μ since population data are rarely available. The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.
  • 6. Stacked Dot Plot Higher bars represent areas where there are more observations, makes it a little easier to judge the center and the shape of the distribution.
  • 7. Histograms ● Histograms provide a view of the data density. Higher bars represent where the data are relatively more common. ● Histograms are especially convenient for describing the shape of the data distribution. ● The chosen bin width can alter the story the histogram is telling.
  • 8. Bin Width The chosen bin width can alter the story the histogram is telling.
  • 9. Shape of a Distribution: Modality Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? Note: In order to determine modality, step back and imagine a smooth curve over the histogram -- imagine that the bars are wooden blocks and you drop a limp spaghetti over them, the shape the spaghetti would take could be viewed as a smooth curve.
  • 10. Shape of a Distribution: Skewness Is the histogram right skewed, left skewed, or symmetric? Histograms are said to be skewed to the side of the long tail.
  • 11. Shape of a Distribution: Unusual Observations Are there any unusual observations or potential outliers?
  • 12. Commonly observed shapes of distributions Modality Skewness
  • 13. Variance Variance is roughly the average squared deviation from the mean. ● The sample mean is and the sample size is n = 217. ● The variance of amount of sleep students get per night can be calculated as:
  • 14. Variance (cont.) Why do we use the squared deviation in the calculation of variance? ● To get rid of negatives so that observations equally distant from the mean are weighed equally. ● To weigh larger deviations more heavily.
  • 15. Standard Deviation The standard deviation is the square root of the variance, and has the same units as the data. The standard deviation of amount of sleep students get per night can be calculated as: We can see that all of the data are within 3 standard deviations of the mean.
  • 16. Median The median is the value that splits the data in half when ordered in ascending order. If there are an even number of observations, then the median is the average of the two values in the middle. Since the median is the midpoint of the data, 50% of the values are below it. Hence, it is also the 50th percentile.
  • 17. Q1, Q3, and IQR ● The 25th percentile is also called the first quartile, Q1. ● The 50th percentile is also called the median. ● The 75th percentile is also called the third quartile, Q3. Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range, or the IQR. IQR = Q3 - Q1
  • 18. Box Plot The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.
  • 19. Anatomy of a Box Plot
  • 20. Whiskers and Outliers Whiskers of a box plot can extend up to 1.5 x IQR away from the quartiles. max upper whisker reach = Q3 + 1.5 x IQR max lower whisker reach = Q1 - 1.5 x IQR IQR: 20 - 10 = 10 max upper whisker reach = 20 + 1.5 x 10 = 35 max lower whisker reach = 10 - 1.5 x 10 = -5 A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data. Why is it important to look for outliers?  Identify extreme skew in the distribution.  Identify data collection and entry errors.  Provide insight into interesting features of the data.
  • 21. Extreme Observations How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?
  • 23. Robust Statistics Median and IQR are more robust to skewness and outliers than mean and SD. Therefore, ● for skewed distributions it is often more helpful to use median and IQR to describe the center and spread ● for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread If you would like to estimate the typical household income for a student, would you be more interested in the mean or median income? Median
  • 24. Mean vs. Median If the distribution is symmetric, center is often defined as the mean: mean ~ median If the distribution is skewed or has extreme outliers, center is often defined as the median ● Right-skewed: mean > median ● Left-skewed: mean < median
  • 25. Extremely Skewed Data When data are extremely skewed, transforming them might make modeling easier. A common transformation is the log transformation. The histograms on the left shows the distribution of number of basketball games attended by students. The histogram on the right shows the distribution of log of number of games attended.
  • 26. Pros and Cons of Transformations Skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation. # of games 70 50 25 … # of games 4.25 3.91 3.22 ... However, results of an analysis might be difficult to interpret because the log of a measured variable is usually meaningless. What other variables would you expect to be extremely skewed? Salary, housing prices, etc.
  • 27. Intensity Maps What patterns are apparent in the change in population between 2000 and 2010? http://projects.nytimes.com/census/2010/map