SlideShare a Scribd company logo
1 of 38
Download to read offline
1
Fundamental Statistics
Lecture 2
1.3. Descriptive Statistics—Graphical
Data Representation
1.4. Descriptive Methods in Regression
Analysis
Masayo Hirose
Kyushu University
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
2
Descriptive Statistics—
Graphical Data Representation
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
33
Data Representation
 Data example: Height (cm) of 100 males
 It is difficult to get a clear picture from this
quantitative data.
 Data are grouped into categories or classes.
177.04 172.93 169.72 174.56 181.85 167.12 177.05 176.3 169.49 180.3 174.13 167.35 169.19 167.57 173.79
169.79 170.42 180.7 175.44 172.55 171.61 163.56 172.74 180.66 165.65 177.75 165.67 178.62 178.44 180.14
176.16 172.1 175.12 168.5 172.2 174.35 173.54 179.6 176.92 171.69 173.22 178.69 171.58 171.12 173.35
172.17 173.16 180.93 171.66 166.13 179.3 172.38 171.05 178.63 169.48 176.21 175.18 168.34 174.88 169.81
176.82 177.42 181.12 185.88 177.56 172.6 178.39 166.83 194.68 177.47 177.12 169.77 172.95 173.52 172.44
164.79 168.82 171.53 175.08 173.2 180.01 176.31 175.61 166.68 173.95 174.35 173.64 172.81 167.98 168.51
169.3 172.28 181.86 173.59 169.89 174.61 176.55 177.78 168.66 181.75
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
44
Grouping Data
 Frequency (distribution) table, which displays
Frequency distribution.
Height Frequency Relative frequency Midpoint
160~165 2 0.02 162.5
165~170 23 0.23 167.5
170~175 36 0.36 172.5
175~180 27 0.27 177.5
180~185 10 0.1 182.5
185~190 1 0.01 187.5
190~195 1 0.01 192.5
Lower bound Upper bound
classes
Classes: Categories for grouping data.
Frequency: The number of observations that fall in a particular class.
Frequency distribution: A listing of all classes along with their frequencies.
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
55
Grouping Data
Height Frequency Relative frequency Midpoint
160~165 2 0.02 162.5
165~170 23 0.23 167.5
170~175 36 0.36 172.5
175~180 27 0.27 177.5
180~185 10 0.1 182.5
185~190 1 0.01 187.5
190~195 1 0.01 192.5
classes
Relative frequency: The ratio of the frequency of a class to the total
number of observations.
Relative frequency distribution: A listing of all the classes along with their
relative frequencies.
Midpoint: The middle of a class (average of its lower and upper bounds).
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
66
Histogram
Height (cm)
Frequency
160 165 170 175 180 185 190 195
05101520253035
(Frequency) Histogram: A graph that displays the classes on the horizontal
axis and the frequencies of the classes on the vertical axis.
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
77
Histogram
 We can check the shape of the frequency
distribution—is it:
 Unimodal? (has one peak)
 Bimodal? Multimodal?
 Symmetric? ⇒ Bell shape? Uniform?
 Skewed? ⇒ Skewed to the right or left? (has long right or left
tail?)
Height (cm)
Frequency
160 165 170 175 180 185 190 195
05101520253035
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
88
One Slide from Lecture 1:
Interpretation of Standard Deviation
 Three-standard-deviations rule:
 The interval ( ̅𝑥𝑥 − 3𝑆𝑆, ̅𝑥𝑥 + 3𝑆𝑆) usually includes almost all the
observations in any data set.
 If a data set is obtained by sampling a particular type
of population* (keyword: normal distribution*), then:
 The interval ( ̅𝑥𝑥 − 𝑆𝑆, ̅𝑥𝑥 + 𝑆𝑆) usually includes about 68% of the
observations with large N.
 The interval ( ̅𝑥𝑥 − 1.96𝑆𝑆, ̅𝑥𝑥 + 1.96𝑆𝑆) usually includes about 95% of
the observations with large N.
 The interval ( ̅𝑥𝑥 − 3𝑆𝑆, ̅𝑥𝑥 + 3𝑆𝑆) usually includes about 99.7% of
the observations with large N.
*”Population” and “normal distribution” will be explained in another
lecture.
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
99
Question
Please observe the following histogram.
What do you think: is “mean” appropriate as a measure
of center?
Frequency
0 200 400 600 800 1000
010203040
Skewed to the right !
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
1010
Other Graphical Data Representations
 Histograms are designed for use with quantitative
data.
 For Qualitative data?
 e.g., frequency distribution (Table 2.15 in Weiss’s book)
 For displaying qualitative data, pie charts and bar
graphs are two common methods.
Party Frequency Relative frequency
Democratic 13 0.325
Republican 18 0.450
other 9 0.225
40 1.000
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
1111
Pie Chart
Pie chart: a circular statistical graph divided into slices proportional to
the relative frequencies.
other
Democratic
Republican
Pie chart of Political Party affiliation
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
12
Bar Graph
Democratic Republican other
Frequency
051015
Democratic Republican other
RelativeFrequency
0.00.10.20.30.4
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
Bar graph: a graph using bar that displays the (relative) frequencies of
the classes on the vertical axis.
1313
Scatterplot
 A scatterplot can visualize the relationship between
two variables of interest.
𝑥𝑥
𝑦𝑦
Are these variables related? If so, how are they related?
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
14
Linear Correlation
𝑥𝑥
𝑦𝑦
𝑥𝑥 𝑥𝑥
Case 1
Case 2 Case 3
For example, the relationship
between:
・Height and Test score.
・Height and Weight.
・Height and Failure rate
of Volleyball’s serve.
𝑦𝑦
𝑦𝑦
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
1515
Linear Correlation
 Case 1: no linear correlation:
 As 𝑥𝑥 increases, y does not tend to increase
or decrease linearly.
 Case 2: positive linear correlation:
 As 𝑥𝑥 increases, y tends to increase linearly.
 Case 3: negative linear correlation:
 As 𝑥𝑥 increases, y tends to decrease linearly.
𝑥𝑥
𝑦𝑦
𝑥𝑥
𝑦𝑦
𝑥𝑥
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
1616
Linear Correlation Coefficient 𝜌𝜌
 Linear correlation coefficient
 Descriptive measure of the strength of the linear relationship
between two variables.
 For a data set (𝑥𝑥1, 𝑦𝑦1), ⋯ , (𝑥𝑥𝑁𝑁, 𝑦𝑦𝑁𝑁), linear correlation
coefficient is defined by
𝜌𝜌 =
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦)
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2 ∑𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − 𝑦𝑦)2
=
1
𝑁𝑁
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖− ̅𝑥𝑥)(𝑦𝑦𝑖𝑖− �𝑦𝑦)
1
𝑁𝑁
∑𝑖𝑖=1
𝑁𝑁 (𝑥𝑥𝑖𝑖− ̅𝑥𝑥)2 1
𝑁𝑁
∑𝑖𝑖=1
𝑁𝑁 (𝑦𝑦𝑖𝑖−𝑦𝑦)2
Numerator:
Covariance of x and y
Denominator:
SD of x times SD of y
Numerator:
Sum of products of deviation
Denominator:
Squared root of
sum of squared deviations
Property: −𝟏𝟏 ≤ 𝝆𝝆 ≤ 𝟏𝟏
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
1717
Covariance?
 Covariance is defined as:
𝑆𝑆𝑥𝑥𝑥𝑥 =
1
𝑁𝑁
�
𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦)
Mean of 𝑥𝑥𝑖𝑖 − ̅𝑥𝑥 𝑦𝑦𝑖𝑖 − �𝑦𝑦 > 0
⇔ a linear correlation coefficient 𝜌𝜌 > 0
⇔ positive linear correlation
( ̅𝑥𝑥, �𝑦𝑦)
𝜌𝜌 =
1
𝑁𝑁
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦)
1
𝑁𝑁
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2 1
𝑁𝑁
∑𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − 𝑦𝑦)2
Denominator
is positive
i.e., 𝑆𝑆𝑥𝑥𝑥𝑥>0
0 < 𝜌𝜌 < 1
Second set of axes
centered at the point ( ̅𝑥𝑥, �𝑦𝑦)
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
1818
Covariance?
𝑆𝑆𝑥𝑥𝑥𝑥 =
1
𝑁𝑁
�
𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦)
 𝑆𝑆𝑥𝑥𝑥𝑥 < 0
⇔ negative linear correlation
 𝑆𝑆𝑥𝑥𝑥𝑥 ≈ 0
⇔ no linear correlation tendency
( ̅𝑥𝑥, �𝑦𝑦)
( ̅𝑥𝑥, �𝑦𝑦)
−1 < 𝜌𝜌 < 0
𝜌𝜌 ≈ 0
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
1919
Linear Correlation Coefficient 𝜌𝜌
 If the data points are on a linear line (slope is not
zero),
then it implies that 𝑦𝑦𝑖𝑖 = 𝑎𝑎𝑥𝑥𝑖𝑖 for all i=1,…N with 𝑎𝑎 ≠ 0
 In this case, the linear correlation coefficient reduces to
𝜌𝜌 =
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − ̅𝑦𝑦)
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2 ∑𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − ̅𝑦𝑦)2
=
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖− ̅𝑥𝑥)(𝑎𝑎𝑎𝑎𝑖𝑖−𝑎𝑎 ̅𝑥𝑥)
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖− ̅𝑥𝑥)2 ∑𝑖𝑖=1
𝑁𝑁
(𝑎𝑎𝑥𝑥𝑖𝑖−𝑎𝑎 ̅𝑥𝑥) 2
=�
1 𝛼𝛼 > 0
−1 𝛼𝛼 < 0
𝑥𝑥
𝑦𝑦
𝑥𝑥
𝑦𝑦
𝜌𝜌 = 1 𝜌𝜌 = −1
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
2020
Linear Correlation Coefficient 𝜌𝜌
 Linear correlation coefficient might help understand a
distribution shape (not always!!Hence, check visually
also) 相 関 係 数: 1 相 関 係 数: 0.7 相 関 係 数: 0.3
相 関 係 数: 0 相 関 係 数: -0.3 相 関 係 数: -0.7
𝜌𝜌 = 1 𝜌𝜌 = 0.7 𝜌𝜌 = 0.3
𝜌𝜌 = 0 𝜌𝜌 = −0.3 𝜌𝜌 = −0.7
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
2121
Attention!!: Correlation Represents a
Linear Association, not Causality!
 For e.g., linear correlation between chocolate consumption
per capita and the number of Nobel laureates.
Linear correlation coefficient
𝜌𝜌 = 0.791
For winning the Nobel Prize,
we should eat more chocolate?!
“The principal finding of this study is a surprisingly
powerful correlation between chocolate intake per capita
and the number of Nobel laureates in various countries.”
(Messerli, 2012)
(Significant linear correlation between chocolate consumption per capita
and the number of Nobel laureates per 10 million people)
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
2222
Attention!!: Correlation Represents a
Linear Association, not Causality!
Chocolate
consumption
The number of
Nobel
laureates
“Of course, a correlation between X and Y does not prove causation
but indicates that either X influences Y, Y influences
X, or X and Y are influenced by a common underlying mechanism.”
(Messerli, 2012)
association
not causality
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
23
Descriptive Methods in Regression
Analysis
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
24
Regression
Test score
Hours of study
Mr. A
Mrs. B
Mr. E
Mr. C
Ms. D
(Potential) Cause Result
Hours of study Test Score
Assume that test scores increase as the number of hours of study increase.
How strong is this tendency?
How can we predict the approximate test score from the hours of study?
Assumption
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
25
Linear Regression
Test score
Hours of study
・Tendency of positive linear correlation
・They are clustered around a straight line
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
26
Prediction
If we fit a straight line to such data points…
Test score
Hours of study3 hours
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
27
Prediction
Predicted
score: 65
Test score
Hours of study3 hours
… we could predict the test score given
the number of hours of study using the
straight line
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
28
Best Straight Line?
Test score
Hours of study
How can we choose the “best” straight line?
It is better to choose a straight line
that best fits a set of data points
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
Regression analysis requires to choosing the best fitting function
29
Method for Choosing the “Best” Line
using Least Squares Criterion
Error of fitting
Test score
Hours of study
Least squares criterion: best straight line has the smallest sum of
squared error
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
30
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
Method for Choosing the “Best” Line
using Least Squares Criterion
�
𝑖𝑖
𝑦𝑦𝑖𝑖 − 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏
2
𝑦𝑦 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏
𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖
𝑦𝑦𝑖𝑖 − 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏
2
Squared Error
𝑥𝑥𝑖𝑖
𝑥𝑥𝑖𝑖, 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏
Test score y
Hours of study x
According to the least squares criterion, we can seek
𝑎𝑎 and 𝑏𝑏 providing minimum
We call this “method of least squares”
3131
Method of Least Squares
 Given a data set (𝑥𝑥1, 𝑦𝑦1), ⋯ , (𝑥𝑥𝑁𝑁, 𝑦𝑦𝑁𝑁), the method of
least squares provides the following a and b
�𝑎𝑎 =
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦)
∑𝑖𝑖=1
𝑁𝑁
(𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2
�𝑏𝑏 = �𝑦𝑦 − �𝑎𝑎 ̅𝑥𝑥
�𝑦𝑦 = �𝑎𝑎𝑥𝑥 + �𝑏𝑏
Regression line:
straight line is the one having
the smallest sum of squared errors
Regression equation:
the equation of the regression line
�𝑦𝑦 = �𝑎𝑎𝑥𝑥 + �𝑏𝑏
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
3232
Coefficient of Determination
How valuable is the regression equation?
How well does the regression equation fit?
Test score
Hours of study
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
3333
Coefficient of Determination
One way is to determine the percentage of variation
in the observed values of the variable that is explained by
the regression.
Test score
Hours of study
�𝑦𝑦
Regression line
Deviation of observed y value
from the mean
(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖)
(𝑥𝑥𝑖𝑖, �𝑦𝑦𝑖𝑖)
(Refer to Weiss’s book)
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
3434
Coefficient of Determination
One way is to determine the percentage of variation
in the observed values of the variable that is explained by
the regression.
Test score
Hours of study
�𝑦𝑦
Regression line
Deviation not explained by regression
(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖)
(𝑥𝑥𝑖𝑖, �𝑦𝑦𝑖𝑖)
(Refer to Weiss’s book)
Deviation explained by regression
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
3535
Coefficient of Determination
One way is to determine the percentage of variation
in the observed values of the variable that is explained by
the regression.
Test score
�𝑦𝑦
Regression line(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖)
(𝑥𝑥𝑖𝑖, �𝑦𝑦𝑖𝑖)
�
𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − �𝑦𝑦)2
= �
𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖)2
+ �
𝑖𝑖=1
𝑁𝑁
(�𝑦𝑦𝑖𝑖 − �𝑦𝑦)2
In fact, the following equation holds
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
3636
Coefficient of Determination
�
𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − �𝑦𝑦)2
= �
𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖)2
+ �
𝑖𝑖=1
𝑁𝑁
(�𝑦𝑦𝑖𝑖 − �𝑦𝑦)2
Total sum of squares
(SST)
Error sum of squares
(SSE)
=
Regression sum of squares
(SSR)
+
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
3737
Coefficient of Determination
 Coefficient of determination 𝑅𝑅2
: proportion of variation
in the observed values of the variable that is explained
by the regression.
We also call this "R squared".
 𝑅𝑅2 is lies between 0 and 1.
 𝑅𝑅2 value near 0 ⇒ regression equation does not fit.
 𝑅𝑅2 value near 1 ⇒ regression equation fits well.
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
𝑅𝑅2
=
∑𝑖𝑖=1
𝑁𝑁
(�𝑦𝑦𝑖𝑖 − �𝑦𝑦)2
∑𝑖𝑖=1
𝑁𝑁
(𝑦𝑦𝑖𝑖 − �𝑦𝑦)2
=
𝑆𝑆𝑆𝑆𝑆𝑆
𝑆𝑆𝑆𝑆𝑆𝑆
3838
Reference
 Messerli, F.H. (2012). Chocolate consumption,
cognitive function, and Nobel laureates, New England
Journal of Medicine, 367, 1562-1564.
 Weiss.N.A, (1999), Elementary statistics 4th ed.,
Addison Wesley.
 http://mdsc.kyushu-u.ac.jp/lectures.
Acknowledgements:
We would like to thank Editage (www.editage.com) for
English language editing.
©2019 Education and Research Center for Mathematical and Data Science, Kyushu University

More Related Content

Recently uploaded

AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 

Recently uploaded (17)

AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 

Lecture_2.slide

  • 1. 1 Fundamental Statistics Lecture 2 1.3. Descriptive Statistics—Graphical Data Representation 1.4. Descriptive Methods in Regression Analysis Masayo Hirose Kyushu University ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 2. 2 Descriptive Statistics— Graphical Data Representation ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 3. 33 Data Representation  Data example: Height (cm) of 100 males  It is difficult to get a clear picture from this quantitative data.  Data are grouped into categories or classes. 177.04 172.93 169.72 174.56 181.85 167.12 177.05 176.3 169.49 180.3 174.13 167.35 169.19 167.57 173.79 169.79 170.42 180.7 175.44 172.55 171.61 163.56 172.74 180.66 165.65 177.75 165.67 178.62 178.44 180.14 176.16 172.1 175.12 168.5 172.2 174.35 173.54 179.6 176.92 171.69 173.22 178.69 171.58 171.12 173.35 172.17 173.16 180.93 171.66 166.13 179.3 172.38 171.05 178.63 169.48 176.21 175.18 168.34 174.88 169.81 176.82 177.42 181.12 185.88 177.56 172.6 178.39 166.83 194.68 177.47 177.12 169.77 172.95 173.52 172.44 164.79 168.82 171.53 175.08 173.2 180.01 176.31 175.61 166.68 173.95 174.35 173.64 172.81 167.98 168.51 169.3 172.28 181.86 173.59 169.89 174.61 176.55 177.78 168.66 181.75 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 4. 44 Grouping Data  Frequency (distribution) table, which displays Frequency distribution. Height Frequency Relative frequency Midpoint 160~165 2 0.02 162.5 165~170 23 0.23 167.5 170~175 36 0.36 172.5 175~180 27 0.27 177.5 180~185 10 0.1 182.5 185~190 1 0.01 187.5 190~195 1 0.01 192.5 Lower bound Upper bound classes Classes: Categories for grouping data. Frequency: The number of observations that fall in a particular class. Frequency distribution: A listing of all classes along with their frequencies. ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 5. 55 Grouping Data Height Frequency Relative frequency Midpoint 160~165 2 0.02 162.5 165~170 23 0.23 167.5 170~175 36 0.36 172.5 175~180 27 0.27 177.5 180~185 10 0.1 182.5 185~190 1 0.01 187.5 190~195 1 0.01 192.5 classes Relative frequency: The ratio of the frequency of a class to the total number of observations. Relative frequency distribution: A listing of all the classes along with their relative frequencies. Midpoint: The middle of a class (average of its lower and upper bounds). ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 6. 66 Histogram Height (cm) Frequency 160 165 170 175 180 185 190 195 05101520253035 (Frequency) Histogram: A graph that displays the classes on the horizontal axis and the frequencies of the classes on the vertical axis. ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 7. 77 Histogram  We can check the shape of the frequency distribution—is it:  Unimodal? (has one peak)  Bimodal? Multimodal?  Symmetric? ⇒ Bell shape? Uniform?  Skewed? ⇒ Skewed to the right or left? (has long right or left tail?) Height (cm) Frequency 160 165 170 175 180 185 190 195 05101520253035 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 8. 88 One Slide from Lecture 1: Interpretation of Standard Deviation  Three-standard-deviations rule:  The interval ( ̅𝑥𝑥 − 3𝑆𝑆, ̅𝑥𝑥 + 3𝑆𝑆) usually includes almost all the observations in any data set.  If a data set is obtained by sampling a particular type of population* (keyword: normal distribution*), then:  The interval ( ̅𝑥𝑥 − 𝑆𝑆, ̅𝑥𝑥 + 𝑆𝑆) usually includes about 68% of the observations with large N.  The interval ( ̅𝑥𝑥 − 1.96𝑆𝑆, ̅𝑥𝑥 + 1.96𝑆𝑆) usually includes about 95% of the observations with large N.  The interval ( ̅𝑥𝑥 − 3𝑆𝑆, ̅𝑥𝑥 + 3𝑆𝑆) usually includes about 99.7% of the observations with large N. *”Population” and “normal distribution” will be explained in another lecture. ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 9. 99 Question Please observe the following histogram. What do you think: is “mean” appropriate as a measure of center? Frequency 0 200 400 600 800 1000 010203040 Skewed to the right ! ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 10. 1010 Other Graphical Data Representations  Histograms are designed for use with quantitative data.  For Qualitative data?  e.g., frequency distribution (Table 2.15 in Weiss’s book)  For displaying qualitative data, pie charts and bar graphs are two common methods. Party Frequency Relative frequency Democratic 13 0.325 Republican 18 0.450 other 9 0.225 40 1.000 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 11. 1111 Pie Chart Pie chart: a circular statistical graph divided into slices proportional to the relative frequencies. other Democratic Republican Pie chart of Political Party affiliation ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 12. 12 Bar Graph Democratic Republican other Frequency 051015 Democratic Republican other RelativeFrequency 0.00.10.20.30.4 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University Bar graph: a graph using bar that displays the (relative) frequencies of the classes on the vertical axis.
  • 13. 1313 Scatterplot  A scatterplot can visualize the relationship between two variables of interest. 𝑥𝑥 𝑦𝑦 Are these variables related? If so, how are they related? ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 14. 14 Linear Correlation 𝑥𝑥 𝑦𝑦 𝑥𝑥 𝑥𝑥 Case 1 Case 2 Case 3 For example, the relationship between: ・Height and Test score. ・Height and Weight. ・Height and Failure rate of Volleyball’s serve. 𝑦𝑦 𝑦𝑦 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 15. 1515 Linear Correlation  Case 1: no linear correlation:  As 𝑥𝑥 increases, y does not tend to increase or decrease linearly.  Case 2: positive linear correlation:  As 𝑥𝑥 increases, y tends to increase linearly.  Case 3: negative linear correlation:  As 𝑥𝑥 increases, y tends to decrease linearly. 𝑥𝑥 𝑦𝑦 𝑥𝑥 𝑦𝑦 𝑥𝑥 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 16. 1616 Linear Correlation Coefficient 𝜌𝜌  Linear correlation coefficient  Descriptive measure of the strength of the linear relationship between two variables.  For a data set (𝑥𝑥1, 𝑦𝑦1), ⋯ , (𝑥𝑥𝑁𝑁, 𝑦𝑦𝑁𝑁), linear correlation coefficient is defined by 𝜌𝜌 = ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦) ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2 ∑𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − 𝑦𝑦)2 = 1 𝑁𝑁 ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖− ̅𝑥𝑥)(𝑦𝑦𝑖𝑖− �𝑦𝑦) 1 𝑁𝑁 ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖− ̅𝑥𝑥)2 1 𝑁𝑁 ∑𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖−𝑦𝑦)2 Numerator: Covariance of x and y Denominator: SD of x times SD of y Numerator: Sum of products of deviation Denominator: Squared root of sum of squared deviations Property: −𝟏𝟏 ≤ 𝝆𝝆 ≤ 𝟏𝟏 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 17. 1717 Covariance?  Covariance is defined as: 𝑆𝑆𝑥𝑥𝑥𝑥 = 1 𝑁𝑁 � 𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦) Mean of 𝑥𝑥𝑖𝑖 − ̅𝑥𝑥 𝑦𝑦𝑖𝑖 − �𝑦𝑦 > 0 ⇔ a linear correlation coefficient 𝜌𝜌 > 0 ⇔ positive linear correlation ( ̅𝑥𝑥, �𝑦𝑦) 𝜌𝜌 = 1 𝑁𝑁 ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦) 1 𝑁𝑁 ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2 1 𝑁𝑁 ∑𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − 𝑦𝑦)2 Denominator is positive i.e., 𝑆𝑆𝑥𝑥𝑥𝑥>0 0 < 𝜌𝜌 < 1 Second set of axes centered at the point ( ̅𝑥𝑥, �𝑦𝑦) ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 18. 1818 Covariance? 𝑆𝑆𝑥𝑥𝑥𝑥 = 1 𝑁𝑁 � 𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦)  𝑆𝑆𝑥𝑥𝑥𝑥 < 0 ⇔ negative linear correlation  𝑆𝑆𝑥𝑥𝑥𝑥 ≈ 0 ⇔ no linear correlation tendency ( ̅𝑥𝑥, �𝑦𝑦) ( ̅𝑥𝑥, �𝑦𝑦) −1 < 𝜌𝜌 < 0 𝜌𝜌 ≈ 0 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 19. 1919 Linear Correlation Coefficient 𝜌𝜌  If the data points are on a linear line (slope is not zero), then it implies that 𝑦𝑦𝑖𝑖 = 𝑎𝑎𝑥𝑥𝑖𝑖 for all i=1,…N with 𝑎𝑎 ≠ 0  In this case, the linear correlation coefficient reduces to 𝜌𝜌 = ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − ̅𝑦𝑦) ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2 ∑𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − ̅𝑦𝑦)2 = ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖− ̅𝑥𝑥)(𝑎𝑎𝑎𝑎𝑖𝑖−𝑎𝑎 ̅𝑥𝑥) ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖− ̅𝑥𝑥)2 ∑𝑖𝑖=1 𝑁𝑁 (𝑎𝑎𝑥𝑥𝑖𝑖−𝑎𝑎 ̅𝑥𝑥) 2 =� 1 𝛼𝛼 > 0 −1 𝛼𝛼 < 0 𝑥𝑥 𝑦𝑦 𝑥𝑥 𝑦𝑦 𝜌𝜌 = 1 𝜌𝜌 = −1 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 20. 2020 Linear Correlation Coefficient 𝜌𝜌  Linear correlation coefficient might help understand a distribution shape (not always!!Hence, check visually also) 相 関 係 数: 1 相 関 係 数: 0.7 相 関 係 数: 0.3 相 関 係 数: 0 相 関 係 数: -0.3 相 関 係 数: -0.7 𝜌𝜌 = 1 𝜌𝜌 = 0.7 𝜌𝜌 = 0.3 𝜌𝜌 = 0 𝜌𝜌 = −0.3 𝜌𝜌 = −0.7 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 21. 2121 Attention!!: Correlation Represents a Linear Association, not Causality!  For e.g., linear correlation between chocolate consumption per capita and the number of Nobel laureates. Linear correlation coefficient 𝜌𝜌 = 0.791 For winning the Nobel Prize, we should eat more chocolate?! “The principal finding of this study is a surprisingly powerful correlation between chocolate intake per capita and the number of Nobel laureates in various countries.” (Messerli, 2012) (Significant linear correlation between chocolate consumption per capita and the number of Nobel laureates per 10 million people) ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 22. 2222 Attention!!: Correlation Represents a Linear Association, not Causality! Chocolate consumption The number of Nobel laureates “Of course, a correlation between X and Y does not prove causation but indicates that either X influences Y, Y influences X, or X and Y are influenced by a common underlying mechanism.” (Messerli, 2012) association not causality ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 23. 23 Descriptive Methods in Regression Analysis ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 24. 24 Regression Test score Hours of study Mr. A Mrs. B Mr. E Mr. C Ms. D (Potential) Cause Result Hours of study Test Score Assume that test scores increase as the number of hours of study increase. How strong is this tendency? How can we predict the approximate test score from the hours of study? Assumption ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 25. 25 Linear Regression Test score Hours of study ・Tendency of positive linear correlation ・They are clustered around a straight line ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 26. 26 Prediction If we fit a straight line to such data points… Test score Hours of study3 hours ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 27. 27 Prediction Predicted score: 65 Test score Hours of study3 hours … we could predict the test score given the number of hours of study using the straight line ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 28. 28 Best Straight Line? Test score Hours of study How can we choose the “best” straight line? It is better to choose a straight line that best fits a set of data points ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University Regression analysis requires to choosing the best fitting function
  • 29. 29 Method for Choosing the “Best” Line using Least Squares Criterion Error of fitting Test score Hours of study Least squares criterion: best straight line has the smallest sum of squared error ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 30. 30 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University Method for Choosing the “Best” Line using Least Squares Criterion � 𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 2 𝑦𝑦 = 𝑎𝑎𝑎𝑎 + 𝑏𝑏 𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 2 Squared Error 𝑥𝑥𝑖𝑖 𝑥𝑥𝑖𝑖, 𝑎𝑎𝑥𝑥𝑖𝑖 + 𝑏𝑏 Test score y Hours of study x According to the least squares criterion, we can seek 𝑎𝑎 and 𝑏𝑏 providing minimum We call this “method of least squares”
  • 31. 3131 Method of Least Squares  Given a data set (𝑥𝑥1, 𝑦𝑦1), ⋯ , (𝑥𝑥𝑁𝑁, 𝑦𝑦𝑁𝑁), the method of least squares provides the following a and b �𝑎𝑎 = ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)(𝑦𝑦𝑖𝑖 − �𝑦𝑦) ∑𝑖𝑖=1 𝑁𝑁 (𝑥𝑥𝑖𝑖 − ̅𝑥𝑥)2 �𝑏𝑏 = �𝑦𝑦 − �𝑎𝑎 ̅𝑥𝑥 �𝑦𝑦 = �𝑎𝑎𝑥𝑥 + �𝑏𝑏 Regression line: straight line is the one having the smallest sum of squared errors Regression equation: the equation of the regression line �𝑦𝑦 = �𝑎𝑎𝑥𝑥 + �𝑏𝑏 ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 32. 3232 Coefficient of Determination How valuable is the regression equation? How well does the regression equation fit? Test score Hours of study ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 33. 3333 Coefficient of Determination One way is to determine the percentage of variation in the observed values of the variable that is explained by the regression. Test score Hours of study �𝑦𝑦 Regression line Deviation of observed y value from the mean (𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) (𝑥𝑥𝑖𝑖, �𝑦𝑦𝑖𝑖) (Refer to Weiss’s book) ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 34. 3434 Coefficient of Determination One way is to determine the percentage of variation in the observed values of the variable that is explained by the regression. Test score Hours of study �𝑦𝑦 Regression line Deviation not explained by regression (𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) (𝑥𝑥𝑖𝑖, �𝑦𝑦𝑖𝑖) (Refer to Weiss’s book) Deviation explained by regression ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 35. 3535 Coefficient of Determination One way is to determine the percentage of variation in the observed values of the variable that is explained by the regression. Test score �𝑦𝑦 Regression line(𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) (𝑥𝑥𝑖𝑖, �𝑦𝑦𝑖𝑖) � 𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 = � 𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖)2 + � 𝑖𝑖=1 𝑁𝑁 (�𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 In fact, the following equation holds ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 36. 3636 Coefficient of Determination � 𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 = � 𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖)2 + � 𝑖𝑖=1 𝑁𝑁 (�𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 Total sum of squares (SST) Error sum of squares (SSE) = Regression sum of squares (SSR) + ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University
  • 37. 3737 Coefficient of Determination  Coefficient of determination 𝑅𝑅2 : proportion of variation in the observed values of the variable that is explained by the regression. We also call this "R squared".  𝑅𝑅2 is lies between 0 and 1.  𝑅𝑅2 value near 0 ⇒ regression equation does not fit.  𝑅𝑅2 value near 1 ⇒ regression equation fits well. ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University 𝑅𝑅2 = ∑𝑖𝑖=1 𝑁𝑁 (�𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 ∑𝑖𝑖=1 𝑁𝑁 (𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 = 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆
  • 38. 3838 Reference  Messerli, F.H. (2012). Chocolate consumption, cognitive function, and Nobel laureates, New England Journal of Medicine, 367, 1562-1564.  Weiss.N.A, (1999), Elementary statistics 4th ed., Addison Wesley.  http://mdsc.kyushu-u.ac.jp/lectures. Acknowledgements: We would like to thank Editage (www.editage.com) for English language editing. ©2019 Education and Research Center for Mathematical and Data Science, Kyushu University