Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

No Downloads

Total views

325

On SlideShare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

8

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Exploratory Data Analysis M. Srinath
- 2. 2 Exploratory Data Analysis Introduction • Exploratory data analysis was promoted by John Tukey in 1977 to encourage statisticians visually to examine their data sets, to formulate hypotheses that could be tested on data-sets • Exploratory data analysis (EDA) is an approach for analysing data to summarize the main characteristics of variables in easy-to-understand form, often with visual graphs, without using a statistical model or having formulated a hypothesis • EDA techniques are generally graphical. They include scatter plots, Stem and leaf plots, box plots, histograms, quantile plots, residual plots, and mean plots • Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate) •
- 3. 3 Exploratory Data Analysis • EDA offers several techniques to comprehend data • But EDA is more than a library of data analysis techniques • EDA is an approach to data analysis • EDA involves inspecting data without any assumptions – Mostly using information graphics
- 4. 4 Exploratory Data Analysis Univariate non-graphical EDA Categorical data Only useful univariate non-graphical techniques for categorical variables is some form of tabulation of the frequencies, usually along with calculation of the fraction (or percent) of data that falls in each category Quantitative data Univariate non-graphical EDA focuses, generally, on measures of central tendency(mean, median & mode), quartiles, spread(variance, sd & IQR), skewness and kurtosis These descriptives quantitatively describe the main features of data
- 5. 5 Univariate non-graphical EDA A typical output of Descriptive Statistics Variable : Di (Development index) N Valid 1201 Missing 0 Mean 0.260333 Median 0.261697 Mode 0.214959 Std. Deviation 0.086778 Skewness 0.086567 Std. Error of Skewness 0.070593 Kurtosis -0.88541 Std. Error of Kurtosis 0.14107 Percentiles 25 0.186396 50 0.261697 75 0.330004 • When data has outliers median is more robust • When data distribution is skewed median is more meaningful • IQR = .0.143608 • IQR is also a robust measure of spread
- 6. 6 Univariate graphical EDA -Histogram • Graphical display of frequency distribution – Counts of data falling in various ranges (bins) – Histogram for numeric data • Bin size selection is important – Too small – may show spurious patterns – Too large – may hide important patterns • Several Variations possible – Plot relative frequencies instead of raw frequencies – Make the height of the histogram equal to the ‘relative frequency/width’ • Area under the histogram is 1 • When observations come from continuous scale histograms can be approximated by continuous curves
- 7. 7 Stem and Leaf Plot • This plot organizes data for easy visual inspection – Min and max values – Data distribution • Unlike descriptive statistics, this plot shows all the data – No information loss – Individual values can be inspected • Structure of the plot – Stem – the digits in the largest place (e.g. tens place) – Leaves – the digits in the smallest place (e.g. ones place) – Leaves are listed to the left of stem separated by ‘|’ • Possible to place leaves from another data set to the right of the stem for comparing two data distributions 29, 44, 12, 53, 21, 34, 39, 25, 48, 23, 17, 24, 27, 32, 34, 15, 42, 21, 28, 37 Stem and Leaf Plot 1 | 2 7 5 2 | 9 1 5 3 4 7 1 8 3 | 4 9 2 4 7 4 | 4 8 2 5 | 3 Data
- 8. 8 Stem and leaf plot Di Stem-and-Leaf Plot Frequency Stem & Leaf 1.00 0 . & 10.00 0 . 999& 32.00 1 . 0000001111 66.00 1 . 2222222222223333333333 59.00 1 . 4444444445555555555 104.00 1 . 66666666666666666777777777777777777 81.00 1 . 888888888888899999999999999 76.00 2 . 00000000000000011111111111 82.00 2 . 2222222222222333333333333333 82.00 2 . 444444444444444445555555555 96.00 2 . 66666666666667777777777777777777 91.00 2 . 888888888888888899999999999999 79.00 3 . 00000000000000111111111111 90.00 3 . 222222222222223333333333333333 82.00 3 . 4444444444444555555555555555 67.00 3 . 6666666666667777777777 38.00 3 . 888888889999 33.00 4 . 00000011111 18.00 4 . 222233 9.00 4 . 445 4.00 4 . 6& 1.00 4 . & Stem width: .1000000 Each leaf: 3 case(s) & denotes fractional leaves.
- 9. 9 Box Plot • A five value summary plot of data – Minimum, maximum – Median – 1st and 3rd quartiles • Often used in conjunction with a histogram in EDA • Structure of the plot – Box represents the IQR (the middle 50% values) – The horizontal line in the box shows the median – Vertical lines extend above and below the box – Ends of vertical lines called whiskers indicate the max and min values • If max and min fall within 1.5*IQR – Shows outliers above/below the whiskers
- 10. 10 Quantile-Normal plot • Used to see how well a particular sample follows a particular theoritical distribution • Many statistical tests have the assumption that the outcome for any set of values of the explanatory variables is approximately normally distributed, and that is why QN plots are useful: if the assumption is grossly violated, the p-value and confidence intervals of those tests are wrong
- 11. 11 Scatter Plot • Scatter plots are two dimensional graphs with – explanatory attribute plotted on the x-axis – Response attribute plotted on the y-axis • Useful for understanding the relationship between two attributes • Features of the relationship – strength – shape (linear or curve) – Direction – Outliers
- 12. 12 Scatter Plot Matrix • When multiple attributes need to be visualized all at once – Scatter plots are drawn for every pair of attributes and arranged into a 2D matrix. • Useful for spotting relationships among attributes – Similar to a scatter plot – Attributes are shown on the diagonal
- 13. 13 Cross tabulation • For categorical data (and quantitative data with only a few different values) an extension of tabulation called cross- tabulation is very useful. • For two variables, cross-tabulation is performed by making a two-way table with column headings that match the levels of one variable and row headings that match the levels of the other variable, then filling in the counts of all subjects that share a pair of levels. • The two variables might be both explanatory, both outcome, or one of each. Depending on the goals, row percentages (which add to 100% for each row), column percentages (which add to 100% for each column) and/or cell percentages (which add to 100% over all cells) are also useful. • Cross-tabulation can be extended to three (and sometimes more) variables by making separate two-way tables for two variables at each level of a third variable. Cross-tabulation is the basic bivariate non-graphical EDA technique.
- 14. 14 Cross tabulation MainOccupation * Castehierarchy Crosstabulation Castehierarchy Total MainOccupation SC/ST BackwardOBC Upper caste Labour Count 148 33 35 12 228 % within MainOccupation 64.9 14.5 15.4 5.3 100.0 % within Castehierarchy 42.2 26.2 15.0 2.4 19.0 % of Total 12.3 2.7 2.9 1.0 19.0 Business Count 20 6 26 26 78 % within MainOccupation 25.6 7.7 33.3 33.3 100.0 % within Castehierarchy 5.7 4.8 11.1 5.3 6.5 % of Total 1.7 0.5 2.2 2.2 6.5 Service Count 21 5 4 37 67 % within MainOccupation 31.3 7.5 6.0 55.2 100.0 % within Castehierarchy 6.0 4.0 1.7 7.6 5.6 % of Total 1.7 0.4 0.3 3.1 5.6 Farming Count 162 82 169 415 828 % within MainOccupation 19.6 9.9 20.4 50.1 100.0 % within Castehierarchy 46.2 65.1 72.2 84.7 68.9 % of Total 13.5 6.8 14.1 34.6 68.9 Count 351 126 234 490 1201 % within MainOccupation 29.2 10.5 19.5 40.8 100.0 % within Castehierarchy 100.0 100.0 100.0 100.0 100.0 % of Total 29.2 10.5 19.5 40.8 100.0
- 15. 15 Univariate statistics by category • For one categorical variable (usually explanatory) and one quantitative variable (usually outcome), it is common to produce some of the standard univariate non-graphical statistics for the quantitative variables separately for each level of the categorical variable, and then compare the statistics across levels of the categorical variable Univariate statistics of Di by category Statecode Mean SD Median Min Max Skewness Kurtosis Andhra Pradesh 0.1901 0.0592 0.1787 0.0947 0.3947 0.6399 0.1279 Assam 0.2080 0.0569 0.1970 0.0878 0.3664 0.2354 -0.5341 Haryana 0.2706 0.0684 0.2853 0.1135 0.3997 -0.4030 -0.7584 HP 0.3319 0.0617 0.3353 0.1559 0.4862 -0.1965 0.0755 Karnataka 0.1782 0.0586 0.1716 0.0781 0.4674 1.5890 4.5150 Maharashtra 0.2537 0.0778 0.2434 0.0975 0.4318 0.1728 -0.6878 Punjab 0.3342 0.0676 0.3346 0.1623 0.4694 -0.1837 -0.5428 Uttrakhand 0.3144 0.0552 0.3216 0.1864 0.4416 -0.3060 -0.6048 Total 0.2603 0.0868 0.2617 0.0781 0.4862 0.0866 -0.8854
- 16. 16 Univariate graph by category Bar plot
- 17. 17 Univariate graph by category Box plot
- 18. 18 EDA summary • All the techniques presented so far are the tools useful for EDA • But without an understanding built from the EDA, effective use of tools is not possible • EDA helps to answer a lot of questions – What is a typical value? – What is the uncertainty of a typical value? – What is a good distributional fit for the data? – What are the relationships between two attributes? – etc
- 19. 19 The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey The best thing about being a statistician is that you get to play in everyone’s backyard. - John W. Tukey The obvious is that which is never seen until someone expresses it simply. Kahlil Gibran The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey The obvious is that which is never seen until someone expresses it simply. Kahlil Gibran The best thing about being a statistician is that you get to play in everyone’s backyard. - John W. Tukey The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey The obvious is that which is never seen until someone expresses it simply. Kahlil Gibran The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey The obvious is that which is never seen until someone expresses it simply. Kahlil Gibran The best thing about being a statistician is that you get to play in everyone’s backyard. - John W. Tukey The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey The obvious is that which is never seen until someone expresses it simply. Kahlil Gibran

No public clipboards found for this slide

Be the first to comment