2.
2
Exploratory Data Analysis
Introduction
• Exploratory data analysis was promoted by John Tukey in 1977 to
encourage statisticians visually to examine their data sets, to formulate
hypotheses that could be tested on data-sets
• Exploratory data analysis (EDA) is an approach for analysing data to
summarize the main characteristics of variables in easy-to-understand
form, often with visual graphs, without using a statistical model or
having formulated a hypothesis
• EDA techniques are generally graphical. They include scatter plots,
Stem and leaf plots, box plots, histograms, quantile plots, residual
plots, and mean plots
• Exploratory data analysis is generally cross-classified in two ways.
First, each method is either non-graphical or graphical. And second,
each method is either univariate or multivariate (usually just bivariate)
•
3.
3
Exploratory Data Analysis
• EDA offers several techniques to comprehend data
• But EDA is more than a library of data analysis techniques
• EDA is an approach to data analysis
• EDA involves inspecting data without any assumptions
– Mostly using information graphics
4.
4
Exploratory Data Analysis
Univariate non-graphical EDA
Categorical data
Only useful univariate non-graphical techniques for categorical variables
is some form of tabulation of the frequencies, usually along with
calculation of the fraction (or percent) of data that falls in each
category
Quantitative data
Univariate non-graphical EDA focuses, generally, on measures of central
tendency(mean, median & mode), quartiles, spread(variance, sd & IQR),
skewness and kurtosis
These descriptives quantitatively describe the main features of data
5.
5
Univariate non-graphical EDA
A typical output of Descriptive
Statistics
Variable : Di (Development index)
N Valid 1201
Missing 0
Mean 0.260333
Median 0.261697
Mode 0.214959
Std. Deviation 0.086778
Skewness 0.086567
Std. Error of Skewness 0.070593
Kurtosis -0.88541
Std. Error of Kurtosis 0.14107
Percentiles
25 0.186396
50 0.261697
75 0.330004
• When data has outliers median
is more robust
• When data distribution is skewed
median is more meaningful
• IQR = .0.143608
• IQR is also a robust measure of
spread
6.
6
Univariate graphical EDA -Histogram
• Graphical display of frequency
distribution
– Counts of data falling in various ranges
(bins)
– Histogram for numeric data
• Bin size selection is important
– Too small – may show spurious
patterns
– Too large – may hide important
patterns
• Several Variations possible
– Plot relative frequencies instead of
raw frequencies
– Make the height of the histogram
equal to the ‘relative frequency/width’
• Area under the histogram is 1
• When observations come from
continuous scale histograms can be
approximated by continuous curves
7.
7
Stem and Leaf Plot
• This plot organizes data for
easy visual inspection
– Min and max values
– Data distribution
• Unlike descriptive statistics,
this plot shows all the data
– No information loss
– Individual values can be
inspected
• Structure of the plot
– Stem – the digits in the largest
place (e.g. tens place)
– Leaves – the digits in the
smallest place (e.g. ones place)
– Leaves are listed to the left of
stem separated by ‘|’
• Possible to place leaves from
another data set to the right of
the stem for comparing two data
distributions
29, 44, 12, 53, 21, 34, 39, 25,
48, 23, 17, 24, 27, 32, 34, 15,
42, 21, 28, 37
Stem and Leaf Plot
1 | 2 7 5
2 | 9 1 5 3 4 7 1 8
3 | 4 9 2 4 7
4 | 4 8 2
5 | 3
Data
9.
9
Box Plot
• A five value summary plot of
data
– Minimum, maximum
– Median
– 1st
and 3rd
quartiles
• Often used in conjunction with a
histogram in EDA
• Structure of the plot
– Box represents the IQR (the
middle 50% values)
– The horizontal line in the box
shows the median
– Vertical lines extend above and
below the box
– Ends of vertical lines called
whiskers indicate the max and
min values
• If max and min fall within
1.5*IQR
– Shows outliers above/below the
whiskers
10.
10
Quantile-Normal plot
• Used to see how well a
particular sample follows a
particular theoritical
distribution
• Many statistical tests have
the assumption that the
outcome for any set of
values of the explanatory
variables is approximately
normally distributed, and
that is why QN plots are
useful: if the assumption is
grossly violated, the p-value
and confidence intervals of
those tests are wrong
11.
11
Scatter Plot
• Scatter plots are two
dimensional graphs with
– explanatory attribute
plotted on the x-axis
– Response attribute plotted
on the y-axis
• Useful for understanding the
relationship between two
attributes
• Features of the relationship
– strength
– shape (linear or curve)
– Direction
– Outliers
12.
12
Scatter Plot Matrix
• When multiple
attributes need to be
visualized all at once
– Scatter plots are drawn
for every pair of
attributes and arranged
into a 2D matrix.
• Useful for spotting
relationships among
attributes
– Similar to a scatter plot
– Attributes are shown on
the diagonal
13.
13
Cross tabulation
• For categorical data (and quantitative data with only a few
different values) an extension of tabulation called cross-
tabulation is very useful.
• For two variables, cross-tabulation is performed by making a
two-way table with column headings that match the levels of one
variable and row headings that match the levels of the other
variable, then filling in the counts of all subjects that share a
pair of levels.
• The two variables might be both explanatory, both outcome, or
one of each. Depending on the goals, row percentages (which add
to 100% for each row), column percentages (which add to 100%
for each column) and/or cell percentages (which add to 100%
over all cells) are also useful.
• Cross-tabulation can be extended to three (and sometimes
more) variables by making separate two-way tables for two
variables at each level of a third variable. Cross-tabulation is
the basic bivariate non-graphical EDA technique.
14.
14
Cross tabulation
MainOccupation * Castehierarchy Crosstabulation
Castehierarchy Total
MainOccupation SC/ST BackwardOBC Upper caste
Labour Count 148 33 35 12 228
% within MainOccupation 64.9 14.5 15.4 5.3 100.0
% within Castehierarchy 42.2 26.2 15.0 2.4 19.0
% of Total 12.3 2.7 2.9 1.0 19.0
Business Count 20 6 26 26 78
% within MainOccupation 25.6 7.7 33.3 33.3 100.0
% within Castehierarchy 5.7 4.8 11.1 5.3 6.5
% of Total 1.7 0.5 2.2 2.2 6.5
Service Count 21 5 4 37 67
% within MainOccupation 31.3 7.5 6.0 55.2 100.0
% within Castehierarchy 6.0 4.0 1.7 7.6 5.6
% of Total 1.7 0.4 0.3 3.1 5.6
Farming Count 162 82 169 415 828
% within MainOccupation 19.6 9.9 20.4 50.1 100.0
% within Castehierarchy 46.2 65.1 72.2 84.7 68.9
% of Total 13.5 6.8 14.1 34.6 68.9
Count 351 126 234 490 1201
% within MainOccupation 29.2 10.5 19.5 40.8 100.0
% within Castehierarchy 100.0 100.0 100.0 100.0 100.0
% of Total 29.2 10.5 19.5 40.8 100.0
15.
15
Univariate statistics by category
• For one categorical variable
(usually explanatory) and one
quantitative variable (usually
outcome), it is common to
produce some of the standard
univariate non-graphical
statistics for the quantitative
variables separately for each
level of the categorical
variable, and then compare
the statistics across levels of
the categorical variable
Univariate statistics of Di by
category
Statecode Mean SD Median Min Max Skewness Kurtosis
Andhra Pradesh 0.1901 0.0592 0.1787 0.0947 0.3947 0.6399 0.1279
Assam 0.2080 0.0569 0.1970 0.0878 0.3664 0.2354 -0.5341
Haryana 0.2706 0.0684 0.2853 0.1135 0.3997 -0.4030 -0.7584
HP 0.3319 0.0617 0.3353 0.1559 0.4862 -0.1965 0.0755
Karnataka 0.1782 0.0586 0.1716 0.0781 0.4674 1.5890 4.5150
Maharashtra 0.2537 0.0778 0.2434 0.0975 0.4318 0.1728 -0.6878
Punjab 0.3342 0.0676 0.3346 0.1623 0.4694 -0.1837 -0.5428
Uttrakhand 0.3144 0.0552 0.3216 0.1864 0.4416 -0.3060 -0.6048
Total 0.2603 0.0868 0.2617 0.0781 0.4862 0.0866 -0.8854
18.
18
EDA summary
• All the techniques presented so far are the
tools useful for EDA
• But without an understanding built from the
EDA, effective use of tools is not possible
• EDA helps to answer a lot of questions
– What is a typical value?
– What is the uncertainty of a typical value?
– What is a good distributional fit for the data?
– What are the relationships between two
attributes?
– etc
19.
19
The greatest value of a picture is when it forces us to notice what we
never expected to see.
— John W. Tukey
The best thing about being a statistician is that you get to
play in everyone’s backyard. - John W. Tukey
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran
The greatest value of a picture is when it forces us to notice what we
never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran
The best thing about being a statistician is that you get to
play in everyone’s backyard. - John W. Tukey
The greatest value of a picture is when it forces us to notice what we
never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran
The greatest value of a picture is when it forces us to notice what we
never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran
The best thing about being a statistician is that you get to
play in everyone’s backyard. - John W. Tukey
The greatest value of a picture is when it forces us to notice what we
never expected to see.
— John W. Tukey
The obvious is that which is never seen until someone
expresses it simply.
Kahlil Gibran
Be the first to comment