SlideShare a Scribd company logo
EXPLORATORY DATA ANALYSIS
(EDA)
1
WHAT IS EDA?
• The analysis of datasets based on various numerical methods and
graphical tools.
• Exploring data for patterns, trends, underlying structure, deviations
from the trend, anomalies and strange structures.
• It facilitates discovering unexpected as well as conforming the
expected.
• Another definition: An approach/philosophy for data analysis that
employs a variety of techniques (mostly graphical).
2
3
AIM OF THE EDA
• Maximize insight into a dataset
• Uncover underlying structure
• Extract important variables
• Detect outliers and anomalies
• Test underlying assumptions
• Develop valid models
• Determine optimal factor settings (Xs)
4
AIM OF THE EDA
• The goal of EDA is to open-mindedly explore data.
• Tukey: EDA is detective work… Unless detective finds the clues, judge
or jury has nothing to consider.
• Here, judge or jury is a confirmatory data analysis
• Tukey: Confirmatory data analysis goes further, assessing the
strengths of the evidence.
• With EDA, we can examine data and try to understand the meaning of
variables. What are the abbreviations stand for.
5
Exploratory vs Confirmatory Data Analysis
EDA CDA
• No hypothesis at first
• Generate hypothesis
• Uses graphical methods (mostly)
• Start with hypothesis
• Test the null hypothesis
• Uses statistical models
6
STEPS OF EDA
• Generate good research questions
• Data restructuring: You may need to make new variables from the existing ones.
• Instead of using two variables, obtaining rates or percentages of them
• Creating dummy variables for categorical variables
• Based on the research questions, use appropriate graphical tools and obtain
descriptive statistics. Try to understand the data structure, relationships, anomalies,
unexpected behaviors.
• Try to identify confounding variables, interaction relations and multicollinearity, if any.
• Handle missing observations
• Decide on the need of transformation (on response and/or explanatory variables).
• Decide on the hypothesis based on your research questions
7
AFTER EDA
• Confirmatory Data Analysis: Verify the hypothesis by statistical
analysis
• Get conclusions and present your results nicely.
8
Classification of EDA*
• Exploratory data analysis is generally cross-classified in two ways. First,
each method is either non-graphical or graphical. And second, each
method is either univariate or multivariate (usually just bivariate).
• Non-graphical methods generally involve calculation of summary statistics,
while graphical methods obviously summarize the data in a diagrammatic
or pictorial way.
• Univariate methods look at one variable (data column) at a time, while
multivariate methods look at two or more variables at a time to explore
relationships. Usually our multivariate EDA will be bivariate (looking at
exactly two variables), but occasionally it will involve three or more
variables.
• It is almost always a good idea to perform univariate EDA on each of the
components of a multivariate EDA before performing the multivariate EDA.
*Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf
9
EXAMPLE 1
Data from the Places Rated Almanac *Boyer and Savageau, 1985)
9 variables fro 329 metropolitan areas in the USA
1. Climate mildness
2. Housing cost
3. Health care and environment
4. Crime
5. Transportation supply
6. Educational opportunities and effort
7. Arts and culture facilities
8. Recreational opportunities
9. Personal economic outlook
+ latitude and longitude of each city
Questions:
1. How is climate related to location?
2. Are there clusters in the data (excluding
location)?
3. Are nearby cities similar?
4. Any relation bw economic outlook and
crime?
5. What else???
10
EXAMPLE 2
• In a breast cancer research, main questions of interest might be
• Does any treatment method result in a higher survival rate? Can a
particular treatment be suggested to a woman with specific
characteristic?
• Is there any difference between patients in terms of survival rates
(e.g. Are white woman more likely to survive compare the black
woman if they are both at the same stage of disease?)
11
EXAMPLE 3
• In a project, investigating the well-being of teenagers after an
economic hardship, main questions can be
• Is there a positive ( and significant) effect of economic problems on
distress?
• Which other factors can be most related to the distress of teenagers?
e.g. age, gender,…?
12
EXAMPLE 4*
New cancer cases in the U.S. based on a cancer registry
• The rows in the registry are called observations they correspond to
individuals
• The columns are variables or data fields they correspond to attributes
of the individuals
https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf 13
Examples of Variables
• Identifier(s):
- patient number,
- visit # or measurement date (if measured more than once)
• Attributes at study start (baseline):
- enrollment date,
- demographics (age, BMI, etc.)
- prior disease history, labs, etc.
- assigned treatment or intervention group
- outcome variable
• Attributes measured at subsequent times
- any variables that may change over time
- outcome variable
14
Data Types and Measurement Scales
• Variables may be one of several types, and have a defined set of
valid values.
• Two main classes of variables are:
Continuous Variables: (Quantitative, numeric).
Continuous data can be rounded or binned to create categorical data.
Categorical Variables: (Discrete, qualitative).
Some categorical variables (e.g. counts) are sometimes treated as
continuous.
15
Categorical Data
• Unordered categorical data (nominal)
2 possible values (binary or dichotomous)
Examples: gender, alive/dead, yes/no.
Greater than 2 possible values - No order to categories
Examples: marital status, religion, country of birth, race.
• Ordered categorical data (ordinal)
Ratings or preferences
Cancer stage
Quality of life scales,
National Cancer Institute's NCI Common Toxicity Criteria
(severity grades 1-5)
Number of copies of a recessive gene (0, 1 or 2)
16
EDA Part 2: Summarizing Data With Tables
and Plots
Examine the entire data set using basic techniques before starting a
formal statistical analysis.
• Familiarizing yourself with the data.
• Find possible errors and anomalies.
• Examine the distribution of values for each variable.
17
Summarizing Variables
• Categorical variables
Frequency tables - how many observations in each category?
Relative frequency table - percent in each category.
Bar chart and other plots.
• Continuous variables
Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as
ordered categorical.
Plots specific to Continuous variables.
The goal for both categorical and continuous data is data reduction
while preserving/extracting key information about the process under
investigation.
18
Categorical Data Summaries
Tables
Cancer site is a variable taking 5 values
• categorical or continuous?
• ordered or unordered?
19
Frequency Table
• Frequency Table: Categories with counts
• Relative Frequency Table: Percentage in each category
20
Graphing a Frequency Table - Bar Chart:
Plot the number of observations in each category:
21
Continuous Data - Tables
Example: Ages of 10 adult leukemia patients:
35; 40; 52; 27; 31; 42; 43; 28; 50; 35
One option is to group these ages into decades and create a categorical
age variable:
22
We can then create a frequency table for this new categorical age
variable.
23
Continuous data - plots
A histogram is a bar chart constructed using the frequencies or relative
frequencies of a grouped (or binned") continuous variable
It discards some information (the exact values), retaining only the
frequencies in each bin"
24
Age histogram of 10 adult leukemia patients
25
EXAMPLE 5: Motor Trend Car Road Tests
26
27
Running individual summary functions
28
Shortcut: the summary() function
29
Tabulate counts with table()
30
Table()
31
Plotting Functions
R has several distinct plotting systems
Base R functions
• hist()
• barplot()
• boxplot()
• plot()
lattice package
ggplot2 package
32
Boxplot
> boxplot(mtcars$mpg, main = "Miles per Gallon")
33
• The boxplot function can also take a formula as an argument mpg cyl
mpg conditional on cyl"
> boxplot(mpg ~ cyl,
+ data = mtcars,
+ main = "Miles per Gallon by Number of Cylinders",
+ xlab = "Cylinders",
+ ylab = "Miles per Gallon")
34
> # Expand the formula
> boxplot(mpg ~ cyl + am,
+ data = mtcars,
+ main = "MPG by Number of Cylinders & Transmissions”)
35
Histogram
Takes a vector, and plots the distribution of values
> hist(mtcars$mpg)
36
Bar Chart
Use the table function to create a two-way frequency table, and
plotting options to group bars
> counts <- table(mtcars$cyl, mtcars$am)
> colnames(counts) <- c("Auto", "Manual")
> barplot(counts,
+ main = "Number of Cars by Transmission and Cylinders",
+ xlab = "Transmission",
+ beside = TRUE,
+ legend = rownames(counts))
37
Scatterplot
> plot(mtcars$mpg,
+ mtcars$hp,
+ xlab = "Miles per Gallon",
+ ylab = "Horsepower")
38
> # create a vector for conditional color coding
> colorcode <- ifelse(mtcars$am == 0, "red", "blue")
> plot(mtcars$mpg,
+ mtcars$hp,
+ xlab = "Miles per Gallon",
+ ylab = "Horsepower",
+ col = colorcode)
39
Lattice graphics*
• lattice is an add-on package that implements Trellis graphics (originally developed
for S and S-PLUS) in R. It is a powerful and elegant high-level data visualization
system, with an emphasis on multivariate data.
• To fix ideas, we start with a few simple examples. We use the Chem97 dataset
from the mlmRev package.
> library(mlmRev)
> data(Chem97, package = "mlmRev")
> head(Chem97)
lea school student score gender age gcsescore gcsecnt
1 1 1 1 4 F 3 6.625 0.3393157
2 1 1 2 10 F -3 7.625 1.3393157
3 1 1 3 10 F -4 7.250 0.9643157
4 1 1 4 10 F -2 7.500 1.2143157
5 1 1 5 8 F -1 6.444 0.1583157
6 1 1 6 10 F 4 7.750 1.4643157
40
*All notes related to lattice graphics: https://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf
Variables in CHEM97 Data
• A data frame with 31022 observations on the following 8 variables.
• lea: Local Education Authority - a factor
• school: School identifier - a factor
• student: Student identifier - a factor
• score: Point score on A-level Chemistry in 1997
• gender: Student's gender
• age: Age in month, centred at 222 months or 18.5 years
• gcsescore: Average GCSE score of individual.
• gcsecnt: Average GCSE score of individual, centered at mean.
41
Lattice graphics
• The dataset records information on students appearing in the 1997 A-
level chemistry examination in Britain.
• We are only interested in the following variables:
• score: point score in the A-level exam, with six possible values (0, 2, 4, 6, 8).
• gcsescore: average score in GCSE exams. This is a continuous score that may
be used as a predictor of the A-level score.
• gender: gender of the student.
• Using lattice, we can draw a histogram of all the gcsescore values
using
> library(lattice)
> histogram(~ gcsescore, data = Chem97)
42
Lattice graphics
histogram(~ gcsescore, data = Chem97)
This plot shows a reasonably symmetric unimodal distribution, but is
otherwise uninteresting. A more interesting display would be one
where the distribution of gcsescore is compared across different
subgroups, say those defined by the A-level exam score.
43
Lattice graphics
> histogram(~ gcsescore | factor(score), data = Chem97)
44
Lattice graphics
• More effective comparison is enabled by direct superposition. This is
hard to do with conventional histograms, but easier using kernel
density estimates. In the following example, we use the same
subgroups as before in the different panels, but additionally subdivide
the gcsescore values by gender within each panel.
45
Lattice graphics
> densityplot(~ gcsescore | factor(score), Chem97, groups = gender,
plot.points = FALSE, auto.key = TRUE)
46
Lattice graphics
• Several standard statistical graphics are intended to visualize the
distribution of a continuous random variable. We have already seen
histograms and density plots, which are both estimates of the probability
density function. Another useful display is the normal Q-Q plot, which is
related to the distribution function F(x) = P(X ≤ x). Normal Q-Q plots can be
produced by the lattice function qqmath().
• Normal Q-Q plots plot empirical quantiles of the data against quantiles of
the normal distribution (or some other theoretical distribution). They can
be regarded as an estimate of the distribution function F, with the
probability axis transformed by the normal quantile function. They are
designed to detect departures from normality; for a good fit, the points lie
approximate along a straight line. In the plot above, the systematic
convexity suggests that the distributions are left-skewed, and the change in
slopes suggests changing variance.
47
Lattice graphics
> qqmath(~ gcsescore | factor(score), Chem97, groups = gender,
+ f.value = ppoints(100), auto.key = list(columns = 2),
+ type = c("p", "g"), aspect = "xy")
48
Lattice graphics
The type argument adds a common reference grid to each panel that makes
it easier to see the upward shift in gcsescore across panels. The aspect
argument automatically computes an aspect ratio. Two-sample Q-Q plots
compare quantiles of two samples (rather than one sample and a theoretical
distribution). They can be produced by the lattice function qq(), with a
formula that has two primary variables. In the formula y ~ x, y needs to be a
factor with two levels, and the samples compared are the subsets of x for the
two levels of y. For example, we can compare the distributions of gcsescore
for males and females, conditioning on A-level score
> qq(gender ~ gcsescore | factor(score), Chem97,
+ f.value = ppoints(100), type = c("p", "g"), aspect = 1)
49
50
The plot suggests that females do
better than males in the GCSE
exam for a given A-level score (in
other words, males tend to
improve more from the GCSE exam
to the A-level exam), and also have
smaller variance (except in the first
panel).
Lattice graphics
• A well-known graphical design that allows comparison between an
arbitrary number of samples is the comparative box-and-whisker plot.
• Box-and-whisker plots can be produced by the lattice function
bwplot().
> bwplot(factor(score) ~ gcsescore | gender, Chem97)
51
52
The decreasing lengths of the boxes
and whiskers suggest decreasing
variance, and the large number of
outliers on one side indicate heavier
left tails (characteristic of a left-
skewed distribution).
> bwplot(gcsescore ~ gender | factor(score), Chem97, layout = c(6, 1))
53

More Related Content

What's hot

PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data Analysis
Aashish Patel
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
mlong24
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Vishwas N
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Marina Santini
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
Pawneshwar Datt Rai
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
hktripathy
 
Data mining
Data miningData mining
Data mining
Ritesh Tiwari
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR
 
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive AnalyticsDI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DATAVERSITY
 
Introduction to Statistics - Basic concepts
Introduction to Statistics - Basic conceptsIntroduction to Statistics - Basic concepts
Introduction to Statistics - Basic concepts
DocIbrahimAbdelmonaem
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Salah Amean
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
DATA Types
DATA TypesDATA Types
DATA Types
Aniruddha Deshmukh
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
Umair Shafique
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 

What's hot (20)

PG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data AnalysisPG STAT 531 Lecture 4 Exploratory Data Analysis
PG STAT 531 Lecture 4 Exploratory Data Analysis
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Data mining
Data miningData mining
Data mining
 
Bias and variance trade off
Bias and variance trade offBias and variance trade off
Bias and variance trade off
 
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive AnalyticsDI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
DI&A Slides: Descriptive, Prescriptive, and Predictive Analytics
 
Introduction to Statistics - Basic concepts
Introduction to Statistics - Basic conceptsIntroduction to Statistics - Basic concepts
Introduction to Statistics - Basic concepts
 
PCA
PCAPCA
PCA
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
DATA Types
DATA TypesDATA Types
DATA Types
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 

Similar to EXPLORATORY DATA ANALYSIS

Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
Marwa Zalat
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
manaswidebbarma1
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
SamsonOlusinaBamiwuy
 
Week 2 measures of disease occurence
Week 2  measures of disease occurenceWeek 2  measures of disease occurence
Week 2 measures of disease occurence
Hamdi Alhakimi
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
kongara
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
Nguyen Ngoc Binh Phuong
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and Tabulation
International advisers
 
Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)BarryCRNA
 
Sampling of Blood
Sampling of BloodSampling of Blood
Sampling of Blooddrantopa
 
introduction to statistical theory
introduction to statistical theoryintroduction to statistical theory
introduction to statistical theory
Unsa Shakir
 
Research methodology - Analysis of Data
Research methodology - Analysis of DataResearch methodology - Analysis of Data
Research methodology - Analysis of Data
The Stockker
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
Ali Al Mousawi
 
Exploratory Data Analysis and Machine Learning.pptx
Exploratory Data Analysis and Machine Learning.pptxExploratory Data Analysis and Machine Learning.pptx
Exploratory Data Analysis and Machine Learning.pptx
AraniNavaratnarajah2
 
Chapter one Business statistics referesh
Chapter one Business statistics refereshChapter one Business statistics referesh
Chapter one Business statistics referesh
Yasin Abdela
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
gokulprasath06
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
Bahzad5
 
EDA
EDAEDA
Business statistics (Basics)
Business statistics (Basics)Business statistics (Basics)
Business statistics (Basics)
AhmedToheed3
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statisticsalbertlaporte
 
An Introduction to Statistics
An Introduction to StatisticsAn Introduction to Statistics
An Introduction to Statistics
Nazrul Islam
 

Similar to EXPLORATORY DATA ANALYSIS (20)

Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
 
Week 2 measures of disease occurence
Week 2  measures of disease occurenceWeek 2  measures of disease occurence
Week 2 measures of disease occurence
 
Univariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi squareUnivariate, bivariate analysis, hypothesis testing, chi square
Univariate, bivariate analysis, hypothesis testing, chi square
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and Tabulation
 
Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)Sampling, measurement, and stats(2013)
Sampling, measurement, and stats(2013)
 
Sampling of Blood
Sampling of BloodSampling of Blood
Sampling of Blood
 
introduction to statistical theory
introduction to statistical theoryintroduction to statistical theory
introduction to statistical theory
 
Research methodology - Analysis of Data
Research methodology - Analysis of DataResearch methodology - Analysis of Data
Research methodology - Analysis of Data
 
Introduction to biostatistics
Introduction to biostatisticsIntroduction to biostatistics
Introduction to biostatistics
 
Exploratory Data Analysis and Machine Learning.pptx
Exploratory Data Analysis and Machine Learning.pptxExploratory Data Analysis and Machine Learning.pptx
Exploratory Data Analysis and Machine Learning.pptx
 
Chapter one Business statistics referesh
Chapter one Business statistics refereshChapter one Business statistics referesh
Chapter one Business statistics referesh
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
EDA
EDAEDA
EDA
 
Business statistics (Basics)
Business statistics (Basics)Business statistics (Basics)
Business statistics (Basics)
 
Introduction To Statistics
Introduction To StatisticsIntroduction To Statistics
Introduction To Statistics
 
An Introduction to Statistics
An Introduction to StatisticsAn Introduction to Statistics
An Introduction to Statistics
 

Recently uploaded

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 

Recently uploaded (20)

一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 

EXPLORATORY DATA ANALYSIS

  • 2. WHAT IS EDA? • The analysis of datasets based on various numerical methods and graphical tools. • Exploring data for patterns, trends, underlying structure, deviations from the trend, anomalies and strange structures. • It facilitates discovering unexpected as well as conforming the expected. • Another definition: An approach/philosophy for data analysis that employs a variety of techniques (mostly graphical). 2
  • 3. 3
  • 4. AIM OF THE EDA • Maximize insight into a dataset • Uncover underlying structure • Extract important variables • Detect outliers and anomalies • Test underlying assumptions • Develop valid models • Determine optimal factor settings (Xs) 4
  • 5. AIM OF THE EDA • The goal of EDA is to open-mindedly explore data. • Tukey: EDA is detective work… Unless detective finds the clues, judge or jury has nothing to consider. • Here, judge or jury is a confirmatory data analysis • Tukey: Confirmatory data analysis goes further, assessing the strengths of the evidence. • With EDA, we can examine data and try to understand the meaning of variables. What are the abbreviations stand for. 5
  • 6. Exploratory vs Confirmatory Data Analysis EDA CDA • No hypothesis at first • Generate hypothesis • Uses graphical methods (mostly) • Start with hypothesis • Test the null hypothesis • Uses statistical models 6
  • 7. STEPS OF EDA • Generate good research questions • Data restructuring: You may need to make new variables from the existing ones. • Instead of using two variables, obtaining rates or percentages of them • Creating dummy variables for categorical variables • Based on the research questions, use appropriate graphical tools and obtain descriptive statistics. Try to understand the data structure, relationships, anomalies, unexpected behaviors. • Try to identify confounding variables, interaction relations and multicollinearity, if any. • Handle missing observations • Decide on the need of transformation (on response and/or explanatory variables). • Decide on the hypothesis based on your research questions 7
  • 8. AFTER EDA • Confirmatory Data Analysis: Verify the hypothesis by statistical analysis • Get conclusions and present your results nicely. 8
  • 9. Classification of EDA* • Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate). • Non-graphical methods generally involve calculation of summary statistics, while graphical methods obviously summarize the data in a diagrammatic or pictorial way. • Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. Usually our multivariate EDA will be bivariate (looking at exactly two variables), but occasionally it will involve three or more variables. • It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the multivariate EDA. *Seltman, H.J. (2015). Experimental Design and Analysis. http://www.stat.cmu.edu/~hseltman/309/Book/Book.pdf 9
  • 10. EXAMPLE 1 Data from the Places Rated Almanac *Boyer and Savageau, 1985) 9 variables fro 329 metropolitan areas in the USA 1. Climate mildness 2. Housing cost 3. Health care and environment 4. Crime 5. Transportation supply 6. Educational opportunities and effort 7. Arts and culture facilities 8. Recreational opportunities 9. Personal economic outlook + latitude and longitude of each city Questions: 1. How is climate related to location? 2. Are there clusters in the data (excluding location)? 3. Are nearby cities similar? 4. Any relation bw economic outlook and crime? 5. What else??? 10
  • 11. EXAMPLE 2 • In a breast cancer research, main questions of interest might be • Does any treatment method result in a higher survival rate? Can a particular treatment be suggested to a woman with specific characteristic? • Is there any difference between patients in terms of survival rates (e.g. Are white woman more likely to survive compare the black woman if they are both at the same stage of disease?) 11
  • 12. EXAMPLE 3 • In a project, investigating the well-being of teenagers after an economic hardship, main questions can be • Is there a positive ( and significant) effect of economic problems on distress? • Which other factors can be most related to the distress of teenagers? e.g. age, gender,…? 12
  • 13. EXAMPLE 4* New cancer cases in the U.S. based on a cancer registry • The rows in the registry are called observations they correspond to individuals • The columns are variables or data fields they correspond to attributes of the individuals https://www.biostat.wisc.edu/~lindstro/2.EDA.9.10.pdf 13
  • 14. Examples of Variables • Identifier(s): - patient number, - visit # or measurement date (if measured more than once) • Attributes at study start (baseline): - enrollment date, - demographics (age, BMI, etc.) - prior disease history, labs, etc. - assigned treatment or intervention group - outcome variable • Attributes measured at subsequent times - any variables that may change over time - outcome variable 14
  • 15. Data Types and Measurement Scales • Variables may be one of several types, and have a defined set of valid values. • Two main classes of variables are: Continuous Variables: (Quantitative, numeric). Continuous data can be rounded or binned to create categorical data. Categorical Variables: (Discrete, qualitative). Some categorical variables (e.g. counts) are sometimes treated as continuous. 15
  • 16. Categorical Data • Unordered categorical data (nominal) 2 possible values (binary or dichotomous) Examples: gender, alive/dead, yes/no. Greater than 2 possible values - No order to categories Examples: marital status, religion, country of birth, race. • Ordered categorical data (ordinal) Ratings or preferences Cancer stage Quality of life scales, National Cancer Institute's NCI Common Toxicity Criteria (severity grades 1-5) Number of copies of a recessive gene (0, 1 or 2) 16
  • 17. EDA Part 2: Summarizing Data With Tables and Plots Examine the entire data set using basic techniques before starting a formal statistical analysis. • Familiarizing yourself with the data. • Find possible errors and anomalies. • Examine the distribution of values for each variable. 17
  • 18. Summarizing Variables • Categorical variables Frequency tables - how many observations in each category? Relative frequency table - percent in each category. Bar chart and other plots. • Continuous variables Bin the observations (create categories .e.g., (0-10), (11-20), etc.) then, treat as ordered categorical. Plots specific to Continuous variables. The goal for both categorical and continuous data is data reduction while preserving/extracting key information about the process under investigation. 18
  • 19. Categorical Data Summaries Tables Cancer site is a variable taking 5 values • categorical or continuous? • ordered or unordered? 19
  • 20. Frequency Table • Frequency Table: Categories with counts • Relative Frequency Table: Percentage in each category 20
  • 21. Graphing a Frequency Table - Bar Chart: Plot the number of observations in each category: 21
  • 22. Continuous Data - Tables Example: Ages of 10 adult leukemia patients: 35; 40; 52; 27; 31; 42; 43; 28; 50; 35 One option is to group these ages into decades and create a categorical age variable: 22
  • 23. We can then create a frequency table for this new categorical age variable. 23
  • 24. Continuous data - plots A histogram is a bar chart constructed using the frequencies or relative frequencies of a grouped (or binned") continuous variable It discards some information (the exact values), retaining only the frequencies in each bin" 24
  • 25. Age histogram of 10 adult leukemia patients 25
  • 26. EXAMPLE 5: Motor Trend Car Road Tests 26
  • 27. 27
  • 29. Shortcut: the summary() function 29
  • 30. Tabulate counts with table() 30
  • 32. Plotting Functions R has several distinct plotting systems Base R functions • hist() • barplot() • boxplot() • plot() lattice package ggplot2 package 32
  • 33. Boxplot > boxplot(mtcars$mpg, main = "Miles per Gallon") 33
  • 34. • The boxplot function can also take a formula as an argument mpg cyl mpg conditional on cyl" > boxplot(mpg ~ cyl, + data = mtcars, + main = "Miles per Gallon by Number of Cylinders", + xlab = "Cylinders", + ylab = "Miles per Gallon") 34
  • 35. > # Expand the formula > boxplot(mpg ~ cyl + am, + data = mtcars, + main = "MPG by Number of Cylinders & Transmissions”) 35
  • 36. Histogram Takes a vector, and plots the distribution of values > hist(mtcars$mpg) 36
  • 37. Bar Chart Use the table function to create a two-way frequency table, and plotting options to group bars > counts <- table(mtcars$cyl, mtcars$am) > colnames(counts) <- c("Auto", "Manual") > barplot(counts, + main = "Number of Cars by Transmission and Cylinders", + xlab = "Transmission", + beside = TRUE, + legend = rownames(counts)) 37
  • 38. Scatterplot > plot(mtcars$mpg, + mtcars$hp, + xlab = "Miles per Gallon", + ylab = "Horsepower") 38
  • 39. > # create a vector for conditional color coding > colorcode <- ifelse(mtcars$am == 0, "red", "blue") > plot(mtcars$mpg, + mtcars$hp, + xlab = "Miles per Gallon", + ylab = "Horsepower", + col = colorcode) 39
  • 40. Lattice graphics* • lattice is an add-on package that implements Trellis graphics (originally developed for S and S-PLUS) in R. It is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data. • To fix ideas, we start with a few simple examples. We use the Chem97 dataset from the mlmRev package. > library(mlmRev) > data(Chem97, package = "mlmRev") > head(Chem97) lea school student score gender age gcsescore gcsecnt 1 1 1 1 4 F 3 6.625 0.3393157 2 1 1 2 10 F -3 7.625 1.3393157 3 1 1 3 10 F -4 7.250 0.9643157 4 1 1 4 10 F -2 7.500 1.2143157 5 1 1 5 8 F -1 6.444 0.1583157 6 1 1 6 10 F 4 7.750 1.4643157 40 *All notes related to lattice graphics: https://www.isid.ac.in/~deepayan/R-tutorials/labs/04_lattice_lab.pdf
  • 41. Variables in CHEM97 Data • A data frame with 31022 observations on the following 8 variables. • lea: Local Education Authority - a factor • school: School identifier - a factor • student: Student identifier - a factor • score: Point score on A-level Chemistry in 1997 • gender: Student's gender • age: Age in month, centred at 222 months or 18.5 years • gcsescore: Average GCSE score of individual. • gcsecnt: Average GCSE score of individual, centered at mean. 41
  • 42. Lattice graphics • The dataset records information on students appearing in the 1997 A- level chemistry examination in Britain. • We are only interested in the following variables: • score: point score in the A-level exam, with six possible values (0, 2, 4, 6, 8). • gcsescore: average score in GCSE exams. This is a continuous score that may be used as a predictor of the A-level score. • gender: gender of the student. • Using lattice, we can draw a histogram of all the gcsescore values using > library(lattice) > histogram(~ gcsescore, data = Chem97) 42
  • 43. Lattice graphics histogram(~ gcsescore, data = Chem97) This plot shows a reasonably symmetric unimodal distribution, but is otherwise uninteresting. A more interesting display would be one where the distribution of gcsescore is compared across different subgroups, say those defined by the A-level exam score. 43
  • 44. Lattice graphics > histogram(~ gcsescore | factor(score), data = Chem97) 44
  • 45. Lattice graphics • More effective comparison is enabled by direct superposition. This is hard to do with conventional histograms, but easier using kernel density estimates. In the following example, we use the same subgroups as before in the different panels, but additionally subdivide the gcsescore values by gender within each panel. 45
  • 46. Lattice graphics > densityplot(~ gcsescore | factor(score), Chem97, groups = gender, plot.points = FALSE, auto.key = TRUE) 46
  • 47. Lattice graphics • Several standard statistical graphics are intended to visualize the distribution of a continuous random variable. We have already seen histograms and density plots, which are both estimates of the probability density function. Another useful display is the normal Q-Q plot, which is related to the distribution function F(x) = P(X ≤ x). Normal Q-Q plots can be produced by the lattice function qqmath(). • Normal Q-Q plots plot empirical quantiles of the data against quantiles of the normal distribution (or some other theoretical distribution). They can be regarded as an estimate of the distribution function F, with the probability axis transformed by the normal quantile function. They are designed to detect departures from normality; for a good fit, the points lie approximate along a straight line. In the plot above, the systematic convexity suggests that the distributions are left-skewed, and the change in slopes suggests changing variance. 47
  • 48. Lattice graphics > qqmath(~ gcsescore | factor(score), Chem97, groups = gender, + f.value = ppoints(100), auto.key = list(columns = 2), + type = c("p", "g"), aspect = "xy") 48
  • 49. Lattice graphics The type argument adds a common reference grid to each panel that makes it easier to see the upward shift in gcsescore across panels. The aspect argument automatically computes an aspect ratio. Two-sample Q-Q plots compare quantiles of two samples (rather than one sample and a theoretical distribution). They can be produced by the lattice function qq(), with a formula that has two primary variables. In the formula y ~ x, y needs to be a factor with two levels, and the samples compared are the subsets of x for the two levels of y. For example, we can compare the distributions of gcsescore for males and females, conditioning on A-level score > qq(gender ~ gcsescore | factor(score), Chem97, + f.value = ppoints(100), type = c("p", "g"), aspect = 1) 49
  • 50. 50 The plot suggests that females do better than males in the GCSE exam for a given A-level score (in other words, males tend to improve more from the GCSE exam to the A-level exam), and also have smaller variance (except in the first panel).
  • 51. Lattice graphics • A well-known graphical design that allows comparison between an arbitrary number of samples is the comparative box-and-whisker plot. • Box-and-whisker plots can be produced by the lattice function bwplot(). > bwplot(factor(score) ~ gcsescore | gender, Chem97) 51
  • 52. 52 The decreasing lengths of the boxes and whiskers suggest decreasing variance, and the large number of outliers on one side indicate heavier left tails (characteristic of a left- skewed distribution).
  • 53. > bwplot(gcsescore ~ gender | factor(score), Chem97, layout = c(6, 1)) 53