Exploratory data analysis

The University of Sydney Page 1
Exploratory data
analysis
The basics
Presented by
Professor Peter Reimann
Centre for Research on Learning and
Cognition

EDA is a inquiry cycle
Generate
questions
Search for
answers
in the data
Refine
questions
Visualize, transform, model the data
EDA is an important
component of theory-driven,
problem-driven, and
curiosity-driven research.

Where do questions come from?
An important source of questions on data are hypotheses derived from theory:
Data Hypotheses Theory
Another source are problems:
Data Questions
Problem(
s)
Data Questions Data
A third source are data themselves:

Models of data
EDA plays a role in all three scenarios.
– Theories do not get compared with data as such, but with models of data:
Data Hypotheses TheoryData
model(s)
ED
A
Data Questions
Problem(
s)
Data
model(s)
ED
A
Questions
Data
model(s)
And similarly for the other cases:
Data
Data
model(s)
ED
A

Data are not “objective”
– Measurements and observations are not theory- or assumption-free;
– There’s more than one way to build a (statistical) model of any data
set;
– While the data may support a theory, they likely support many other
theories;
– While a data set may support a theory, it could also contain relation
that are contradicting the theory
Hence, even if your data are carefully selected and
measured, and you think you know them well, it is
important to look for the unexpected!

The exploratory perspective
Key assumption: The more one knows about the data, the more effectively
data can used to
– develop, test and refine theory,
– solve problems, and
– ask interesting questions.
To maximise what is learned from data, one needs to adhere to two principles:
– scepticism, and
– openness.
One should be sceptical, for instance about the assumption that specific
statistical parameters (i.e., summaries of data, such as the mean) reflect data
faithfully, and open to different interpretations of what the data say.

Be sceptical! Be open!
One reason to be sceptical
about statistics in particular
is Anscombe’s Quartet:
– Four datasets with (almost)
identical statistics, but
very different shapes.
By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454

(cont.)
– Statistics (= summative accounts of data) can be misleading
– Data analysis is not identical with statistics:
– Visual analysis should precede statistical analysis
Stay open to multiple interpretations!
– The confirmatory, or hypothesis-testing mode, to data analysis can
keep one from seeing what other patterns might exist in data.
In addition to asking:
– Do these data confirm or disconfirm my hypothesis about x?
Ask:
– What can these data tell me about x?

Model and outliers
The basic way of thinking about data:
Data = pattern + deviations
(model + outliers)
(smooth + rough)
Data analysis, including statistical analysis, means to partition data into
patterns/models/smooths and deviations/outliers/roughs
For any given data, there are in principle many ways to do this
partitioning, and there is no logical reason to a priori prefer one over the
other  the analysis process is incremental, not one hypothesis testing
step.

Our tools for EDA
– dplyr: selecting, filtering, summarising data
– ggplot2: visualising data, patterns, trends.

Data selection with dplyr
Variable A (…) Variable v
Observation
1
Value 1A (…) Value 1v
Observation
2
Value 2A (…) Value 2v
(…) (…) (…) (…)
Observation
o
Value oA (…) Value ov
(2) filter on values
(3) arrange
by rows
(1) select variables
(4) mutate: create new variables
(5) sum-
marize
over
values
dplyr is made up out of 5 verbs:

“Sentences” in dplyr
General format: verb(data frame, parameters)
– The result is a new data frame: new_frame <- verb(data,
parameter).
Examples:
– filter(flights, month == 1, day == 1)
– arrange(flights, year, month, day)
– select(flights, year, month, day)
– mutate(flights, gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
– summarize(flights, delay = mean(dep_delay))

Boolean operations are supported for filtering
and selecting
! Is “not”, | is ”or”, & is
“and”
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
These two return the same observations:
For more on these commands, see for instance
https://www.youtube.com/watch?v=aywFompr1F4

Workbook
– The rest of this module is mainly in the workbook.

Exploratory data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Exploratory data analysis

Similar to Exploratory data analysis (20)

Recently uploaded

Recently uploaded (20)

Exploratory data analysis

Editor's Notes