Data exploration, also known as exploratory data analysis (EDA), is the initial step in the data analysis process. It involves examining and understanding the characteristics, patterns, and relationships within a dataset to gain insights and guide further analysis. Data exploration helps in identifying trends, outliers, missing values, and potential problems in the data.
Here are some key aspects of data exploration:
Descriptive Statistics: Descriptive statistics summarize and describe the main characteristics of the dataset. This includes measures such as mean, median, mode, standard deviation, minimum, maximum, and quartiles. Descriptive statistics provide a high-level overview of the data distribution, central tendencies, and variability.
Data Visualization: Data visualization techniques, such as plots, charts, and graphs, help in visually representing the data to identify patterns, trends, and relationships. Common visualizations include histograms, scatter plots, box plots, bar charts, and heatmaps. Visualization allows for a better understanding of the data and can reveal hidden insights that may not be apparent from raw numbers alone.
Data Cleaning and Preprocessing: Data exploration involves assessing the quality and integrity of the dataset. It includes handling missing values, dealing with outliers, correcting data inconsistencies, and addressing other data quality issues. Data cleaning and preprocessing ensure that the dataset is suitable for further analysis.
Feature Exploration: In data exploration, it is important to understand the individual features (variables) within the dataset. This includes examining the distributions, ranges, and potential relationships between features. It can involve analyzing continuous variables, categorical variables, or a combination of both.
Correlation Analysis: Data exploration often involves exploring the correlations between variables. Correlation analysis helps identify relationships and dependencies between different features in the dataset. It provides insights into which variables may be related and can guide further analysis or modeling.
Hypothesis Generation: During data exploration, researchers or analysts may generate hypotheses or formulate initial assumptions about the data. These hypotheses can guide further analysis and provide a starting point for hypothesis testing or more in-depth investigations.
Data exploration is an iterative process that informs subsequent analysis steps, such as modeling, hypothesis testing, and predictive analytics. It helps in identifying important variables, selecting appropriate analysis techniques, and ensuring the validity and reliability of the data.
Exploratory data analysis is crucial in understanding the characteristics of the dataset, uncovering insights, and making informed decisions based on the data. It allows researchers, analysts, and data scientists to gain a comprehensive understanding of the data and formulate appropriate strategies
Blooming Together_ Growing a Community Garden Worksheet.docx
03-Data-Exploration.pptx
1. Chapter 03
Data Exploration
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
2. Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
3. Goal of Data Exploration
• Goal:
• Understand the basic characteristics of the data
• Examples for characteristics:
• Structure
• Size
• Completeness
• Relationships
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
4. Methods for Data Exploration
• Usually interactive and semi-automated
• Text editors, system calls (head/more/less), etc. to look at raw
data directly
• Helps to understand the structure
• Statistics and visualizations to learn about distributions and
relationships
• Exploration should also include meta data
• Feature names, trace links, etc.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
5. Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
6. Descriptive Statistics
• Summarize data through single value
• Do not predict anything about the data ( inductive statistics)
• Common statistics covered in this course
• Central tendency (mean/median/mode)
• Variability (standard deviation, interquartile range)
• Range of data (min/max)
• Other important statistics
• Kurtosis and skewness for the shape of distributions
• More measures for central tendency, e.g., trimmed means, harmonic
mean
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
7. Central Tendency
• „Typical“ value of the data
• Arithmetic mean
• 𝑚𝑒𝑎𝑛 𝑥 =
1
𝑛 𝑖=1
𝑛
𝑥𝑖 with 𝑥 = 𝑥1, … , 𝑥𝑛 ∈ ℝ𝑛
• Median
• The value that separates the higher half from the data of the lower half
• Mode
• The value that appears most in the data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
8. Variability
• Measure for the spread of the data
• Also called dispersion
• Standard deviation
• Measure for the difference of observation to the arithmetic mean
• 𝑠𝑑 𝑥 = 𝑖=1
𝑛 𝑥𝑖−𝑚𝑒𝑎𝑛 𝑥
2
𝑛−1
• Interquartile Range (IQR)
• Percentile: value below which a given percentage falls
• Difference between the 75% percentile and the 25% percentile
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The median is the 50% percentile
9. Range of data
• Range for which values are observed
• Can be infinite!
• Minimum
• Smallest observed value
• Maximum
• Largest observed value
• May be strongly distorted by invalid data
• Makes it also a good tool to discover invalid data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
10. Example
• Random typing on the keypad
• 𝑥 =
(1,2,1,1,3,4,5,2,3,4,5,1,3,2,1,6,5,4,9,4,3,6,1,5,6,8,4,6,5,1,3,2,1,6,8,7,6,1,3,1,6,8,4,7,6,4,3,5,4,9,7,4,3,1,4,6,8,7,9,1,4,6,1,3,8,6,7,4,9,6,5,1,3,6,8,7)
• central tendency:
• mean: 4.46052631579
• median: 4.0
• mode (count): 1 (14)
• variability
• sd: 2.41944311488
• IQR: 3.0
• range
• min: 1
• max: 9
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
11. Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
12. A Picture Says More than 1000 Words
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Numbers are made up and pie charts should actually be avoided
13. DescriptiveDeceptive Statistics
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
i
X y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68
ii
x y
10.00 9.14
8.00 8.14
13.00 8.74
9.00 8.77
11.00 9.26
14.00 8.10
6.00 6.13
4.00 3.10
12.00 9.13
7.00 7.26
5.00 4.74
iii
x y
10.00 7.46
8.00 6.77
13.00 12.74
9.00 7.11
11.00 7.81
14.00 8.84
6.00 6.08
4.00 5.39
12.00 8.15
7.00 6.42
5.00 5.73
iv
x y
8.00 6.58
8.00 5.76
8.00 7.71
8.00 8.84
8.00 8.47
8.00 7.04
8.00 5.25
19.00 12.50
8.00 5.56
8.00 7.91
8.00 6.89
Have the same
• Mean
• standard
deviation
• correlation
between x and y
• linear regression
Anscombe‘s Quartet
14. Exploring Single Features
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Plots of the Boston house prices data set
http://archive.ics.uci.edu/ml/machine-learning-databases/housing/
Extremley skewed
Mixture of two
normals after
taking the
logarithm
Looks like an artificially high value
Groups all higher incomes
15. Boxplots
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Median
75%
percentile
25%
percentile
Range of
data except
outliers
Outlier
The outlier definition can change. We
used „more than 1.5 times the IQR
away from the 25%/75% percentile.”
You should always check this in the
package you use.
16. Pairwise Scatterplots with Regressions
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
No correlation
visible
Strong linear
correlation
Histogram of data in
the column
17. Pairwise Plots with Classes
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Good separation
of blue, but green
and orange are
overlapping
Good
separation of
all three
classes
Density plots of
data in the
column
separated by
classes
18. Correlation Heatmap
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Colors show
strength of
correlation
Correlation
between
premiums and
losses
Correlation
between reasons
for accidents
There are different correlation
coefficients. We used Pearson‘s
coefficient, which measures linear
correlations.
19. Hexbin Plots for Many Instances
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Cannot see
structure due to
amount of data
Hexagonal bins
reveal the structure
20. Line Plots for Timeseries
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Linear trend
Regular noise pattern
Seasonal?
Range of values
21. Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
22. Summary
• Important to understand the data available
• Summary statistics provide a good overview
• Can be deceptive!
• Visualization is a powerful way to understand data
• Understanding of meta data and how domain experts
understand data equally important!
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science