SlideShare a Scribd company logo
The University of Sydney Page 1
Exploratory data
analysis
The basics
Presented by
Professor Peter Reimann
Centre for Research on Learning and
Cognition
The University of Sydney Page 2
EDA is a inquiry cycle
Generate
questions
Search for
answers
in the data
Refine
questions
Visualize, transform, model the data
EDA is an important
component of theory-driven,
problem-driven, and
curiosity-driven research.
The University of Sydney Page 3
Where do questions come from?
An important source of questions on data are hypotheses derived from theory:
Data Hypotheses Theory
Another source are problems:
Data Questions
Problem(
s)
Data Questions Data
A third source are data themselves:
The University of Sydney Page 4
Models of data
EDA plays a role in all three scenarios.
– Theories do not get compared with data as such, but with models of data:
Data Hypotheses TheoryData
model(s)
ED
A
Data Questions
Problem(
s)
Data
model(s)
ED
A
Questions
Data
model(s)
And similarly for the other cases:
Data
Data
model(s)
ED
A
The University of Sydney Page 5
Data are not “objective”
– Measurements and observations are not theory- or assumption-free;
– There’s more than one way to build a (statistical) model of any data
set;
– While the data may support a theory, they likely support many other
theories;
– While a data set may support a theory, it could also contain relation
that are contradicting the theory
Hence, even if your data are carefully selected and
measured, and you think you know them well, it is
important to look for the unexpected!
The University of Sydney Page 6
The exploratory perspective
Key assumption: The more one knows about the data, the more effectively
data can used to
– develop, test and refine theory,
– solve problems, and
– ask interesting questions.
To maximise what is learned from data, one needs to adhere to two principles:
– scepticism, and
– openness.
One should be sceptical, for instance about the assumption that specific
statistical parameters (i.e., summaries of data, such as the mean) reflect data
faithfully, and open to different interpretations of what the data say.
The University of Sydney Page 7
Be sceptical! Be open!
One reason to be sceptical
about statistics in particular
is Anscombe’s Quartet:
– Four datasets with (almost)
identical statistics, but
very different shapes.
By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454
The University of Sydney Page 8
(cont.)
– Statistics (= summative accounts of data) can be misleading
– Data analysis is not identical with statistics:
– Visual analysis should precede statistical analysis
Stay open to multiple interpretations!
– The confirmatory, or hypothesis-testing mode, to data analysis can
keep one from seeing what other patterns might exist in data.
In addition to asking:
– Do these data confirm or disconfirm my hypothesis about x?
Ask:
– What can these data tell me about x?
The University of Sydney Page 9
Model and outliers
The basic way of thinking about data:
Data = pattern + deviations
(model + outliers)
(smooth + rough)
Data analysis, including statistical analysis, means to partition data into
patterns/models/smooths and deviations/outliers/roughs
For any given data, there are in principle many ways to do this
partitioning, and there is no logical reason to a priori prefer one over the
other  the analysis process is incremental, not one hypothesis testing
step.
The University of Sydney Page 10
Our tools for EDA
– dplyr: selecting, filtering, summarising data
– ggplot2: visualising data, patterns, trends.
The University of Sydney Page 11
Data selection with dplyr
Variable A (
) Variable v
Observation
1
Value 1A (
) Value 1v
Observation
2
Value 2A (
) Value 2v
(
) (
) (
) (
)
Observation
o
Value oA (
) Value ov
(2) filter on values
(3) arrange
by rows
(1) select variables
(4) mutate: create new variables
(5) sum-
marize
over
values
dplyr is made up out of 5 verbs:
The University of Sydney Page 12
“Sentences” in dplyr
General format: verb(data frame, parameters)
– The result is a new data frame: new_frame <- verb(data,
parameter).
Examples:
– filter(flights, month == 1, day == 1)
– arrange(flights, year, month, day)
– select(flights, year, month, day)
– mutate(flights, gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
– summarize(flights, delay = mean(dep_delay))
The University of Sydney Page 13
Boolean operations are supported for filtering
and selecting
! Is “not”, | is ”or”, & is
“and”
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
These two return the same observations:
For more on these commands, see for instance
https://www.youtube.com/watch?v=aywFompr1F4
The University of Sydney Page 14
Workbook
– The rest of this module is mainly in the workbook.

More Related Content

What's hot

Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
Mahmoud Alfarra
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
Shirin Mojarad, Ph.D.
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
Dr. C.V. Suresh Babu
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
Martin Bago
 
Data Analysis
Data AnalysisData Analysis
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Analytics
AnalyticsAnalytics
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
Shailja Khurana
 
Business analytics
Business analyticsBusiness analytics
Business analytics
Dinakar nk
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Gramener
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Tharushi Ruwandika
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
Umair Shafique
 
Data cleansing
Data cleansingData cleansing
Data cleansingkunaljain1701
 
3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...
PROWEBSCRAPER
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Srishti44
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
Ravi Nayak
 
Data analytics
Data analyticsData analytics
Data analytics
Bhanu Pratap
 
data mining
data miningdata mining
data mining
manasa polu
 

What's hot (20)

Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Exploratory Data Analysis using Python
Exploratory Data Analysis using PythonExploratory Data Analysis using Python
Exploratory Data Analysis using Python
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Analytics
AnalyticsAnalytics
Analytics
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Business analytics
Business analyticsBusiness analytics
Business analytics
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Analytics
Data AnalyticsData Analytics
Data Analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
data mining
data miningdata mining
data mining
 

Similar to Exploratory data analysis

Business research (1)
Business research (1)Business research (1)
Business research (1)
007donmj
 
Lesson 6 chapter 4
Lesson 6   chapter 4Lesson 6   chapter 4
Lesson 6 chapter 4
MLG College of Learning, Inc
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
DrThenmozhiSPESUMCA
 
business-research.ppt
business-research.pptbusiness-research.ppt
business-research.ppt
KaneezElahi
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshare
Sanjeev Deshmukh
 
business-research
business-researchbusiness-research
business-research
Mbabba2
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptx
SalmaNiazi2
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data Management
Mahmoud91Tx
 
Thirupathi.ppt
Thirupathi.pptThirupathi.ppt
Thirupathi.ppt
preethi483339
 
Approaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchApproaches To Data Analysis In Social Research
Approaches To Data Analysis In Social Research
Karla Adamson
 
Aishwarya.ppt
Aishwarya.pptAishwarya.ppt
Aishwarya.ppt
Aishwariya32
 
CORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewCORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An Overview
Trident University
 
Research Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxResearch Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptx
ahamedaslambasha1
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdf
StanleyChivandire1
 
Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Daberkow
 
Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methods
jdubrow2000
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
Stats Statswork
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
Michael Brodie
 
Research Methodology
Research MethodologyResearch Methodology
Research Methodology
Aneel Raza
 

Similar to Exploratory data analysis (20)

Business research (1)
Business research (1)Business research (1)
Business research (1)
 
Lesson 6 chapter 4
Lesson 6   chapter 4Lesson 6   chapter 4
Lesson 6 chapter 4
 
EDM405 4.pptx
EDM405 4.pptxEDM405 4.pptx
EDM405 4.pptx
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
business-research.ppt
business-research.pptbusiness-research.ppt
business-research.ppt
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshare
 
business-research
business-researchbusiness-research
business-research
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptx
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data Management
 
Thirupathi.ppt
Thirupathi.pptThirupathi.ppt
Thirupathi.ppt
 
Approaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchApproaches To Data Analysis In Social Research
Approaches To Data Analysis In Social Research
 
Aishwarya.ppt
Aishwarya.pptAishwarya.ppt
Aishwarya.ppt
 
CORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewCORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An Overview
 
Research Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxResearch Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptx
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdf
 
Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Merriam ch 8 5.26.10
Merriam ch 8 5.26.10
 
Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methods
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 
Research Methodology
Research MethodologyResearch Methodology
Research Methodology
 

Recently uploaded

plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
yusufzako14
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
Cherry
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
SĂ©rgio Sacani
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
SĂ©rgio Sacani
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
muralinath2
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
NathanBaughman3
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 

Recently uploaded (20)

plant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptxplant biotechnology Lecture note ppt.pptx
plant biotechnology Lecture note ppt.pptx
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Penicillin...........................pptx
Penicillin...........................pptxPenicillin...........................pptx
Penicillin...........................pptx
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 

Exploratory data analysis

  • 1. The University of Sydney Page 1 Exploratory data analysis The basics Presented by Professor Peter Reimann Centre for Research on Learning and Cognition
  • 2. The University of Sydney Page 2 EDA is a inquiry cycle Generate questions Search for answers in the data Refine questions Visualize, transform, model the data EDA is an important component of theory-driven, problem-driven, and curiosity-driven research.
  • 3. The University of Sydney Page 3 Where do questions come from? An important source of questions on data are hypotheses derived from theory: Data Hypotheses Theory Another source are problems: Data Questions Problem( s) Data Questions Data A third source are data themselves:
  • 4. The University of Sydney Page 4 Models of data EDA plays a role in all three scenarios. – Theories do not get compared with data as such, but with models of data: Data Hypotheses TheoryData model(s) ED A Data Questions Problem( s) Data model(s) ED A Questions Data model(s) And similarly for the other cases: Data Data model(s) ED A
  • 5. The University of Sydney Page 5 Data are not “objective” – Measurements and observations are not theory- or assumption-free; – There’s more than one way to build a (statistical) model of any data set; – While the data may support a theory, they likely support many other theories; – While a data set may support a theory, it could also contain relation that are contradicting the theory Hence, even if your data are carefully selected and measured, and you think you know them well, it is important to look for the unexpected!
  • 6. The University of Sydney Page 6 The exploratory perspective Key assumption: The more one knows about the data, the more effectively data can used to – develop, test and refine theory, – solve problems, and – ask interesting questions. To maximise what is learned from data, one needs to adhere to two principles: – scepticism, and – openness. One should be sceptical, for instance about the assumption that specific statistical parameters (i.e., summaries of data, such as the mean) reflect data faithfully, and open to different interpretations of what the data say.
  • 7. The University of Sydney Page 7 Be sceptical! Be open! One reason to be sceptical about statistics in particular is Anscombe’s Quartet: – Four datasets with (almost) identical statistics, but very different shapes. By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454
  • 8. The University of Sydney Page 8 (cont.) – Statistics (= summative accounts of data) can be misleading – Data analysis is not identical with statistics: – Visual analysis should precede statistical analysis Stay open to multiple interpretations! – The confirmatory, or hypothesis-testing mode, to data analysis can keep one from seeing what other patterns might exist in data. In addition to asking: – Do these data confirm or disconfirm my hypothesis about x? Ask: – What can these data tell me about x?
  • 9. The University of Sydney Page 9 Model and outliers The basic way of thinking about data: Data = pattern + deviations (model + outliers) (smooth + rough) Data analysis, including statistical analysis, means to partition data into patterns/models/smooths and deviations/outliers/roughs For any given data, there are in principle many ways to do this partitioning, and there is no logical reason to a priori prefer one over the other  the analysis process is incremental, not one hypothesis testing step.
  • 10. The University of Sydney Page 10 Our tools for EDA – dplyr: selecting, filtering, summarising data – ggplot2: visualising data, patterns, trends.
  • 11. The University of Sydney Page 11 Data selection with dplyr Variable A (
) Variable v Observation 1 Value 1A (
) Value 1v Observation 2 Value 2A (
) Value 2v (
) (
) (
) (
) Observation o Value oA (
) Value ov (2) filter on values (3) arrange by rows (1) select variables (4) mutate: create new variables (5) sum- marize over values dplyr is made up out of 5 verbs:
  • 12. The University of Sydney Page 12 “Sentences” in dplyr General format: verb(data frame, parameters) – The result is a new data frame: new_frame <- verb(data, parameter). Examples: – filter(flights, month == 1, day == 1) – arrange(flights, year, month, day) – select(flights, year, month, day) – mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) – summarize(flights, delay = mean(dep_delay))
  • 13. The University of Sydney Page 13 Boolean operations are supported for filtering and selecting ! Is “not”, | is ”or”, & is “and” filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120) These two return the same observations: For more on these commands, see for instance https://www.youtube.com/watch?v=aywFompr1F4
  • 14. The University of Sydney Page 14 Workbook – The rest of this module is mainly in the workbook.

Editor's Notes

  1. https://en.wikipedia.org/wiki/Anscombe's_quartet. The reason for some of this is that many statistics are very sensitive towards outliers. See in particular 3 and 4.