SlideShare a Scribd company logo
1 of 22
Chapter 03
Data Exploration
Dr. Steffen Herbold
herbold@cs.uni-goettingen.de
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Goal of Data Exploration
• Goal:
• Understand the basic characteristics of the data
• Examples for characteristics:
• Structure
• Size
• Completeness
• Relationships
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Methods for Data Exploration
• Usually interactive and semi-automated
• Text editors, system calls (head/more/less), etc. to look at raw
data directly
• Helps to understand the structure
• Statistics and visualizations to learn about distributions and
relationships
• Exploration should also include meta data
• Feature names, trace links, etc.
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Descriptive Statistics
• Summarize data through single value
• Do not predict anything about the data ( inductive statistics)
• Common statistics covered in this course
• Central tendency (mean/median/mode)
• Variability (standard deviation, interquartile range)
• Range of data (min/max)
• Other important statistics
• Kurtosis and skewness for the shape of distributions
• More measures for central tendency, e.g., trimmed means, harmonic
mean
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Central Tendency
• „Typical“ value of the data
• Arithmetic mean
• 𝑚𝑒𝑎𝑛 𝑥 =
1
𝑛 𝑖=1
𝑛
𝑥𝑖 with 𝑥 = 𝑥1, … , 𝑥𝑛 ∈ ℝ𝑛
• Median
• The value that separates the higher half from the data of the lower half
• Mode
• The value that appears most in the data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Variability
• Measure for the spread of the data
• Also called dispersion
• Standard deviation
• Measure for the difference of observation to the arithmetic mean
• 𝑠𝑑 𝑥 = 𝑖=1
𝑛 𝑥𝑖−𝑚𝑒𝑎𝑛 𝑥
2
𝑛−1
• Interquartile Range (IQR)
• Percentile: value below which a given percentage falls
• Difference between the 75% percentile and the 25% percentile
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
The median is the 50% percentile
Range of data
• Range for which values are observed
• Can be infinite!
• Minimum
• Smallest observed value
• Maximum
• Largest observed value
• May be strongly distorted by invalid data
• Makes it also a good tool to discover invalid data
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Example
• Random typing on the keypad
• 𝑥 =
(1,2,1,1,3,4,5,2,3,4,5,1,3,2,1,6,5,4,9,4,3,6,1,5,6,8,4,6,5,1,3,2,1,6,8,7,6,1,3,1,6,8,4,7,6,4,3,5,4,9,7,4,3,1,4,6,8,7,9,1,4,6,1,3,8,6,7,4,9,6,5,1,3,6,8,7)
• central tendency:
• mean: 4.46052631579
• median: 4.0
• mode (count): 1 (14)
• variability
• sd: 2.41944311488
• IQR: 3.0
• range
• min: 1
• max: 9
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
A Picture Says More than 1000 Words
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Numbers are made up and pie charts should actually be avoided
DescriptiveDeceptive Statistics
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
i
X y
10.00 8.04
8.00 6.95
13.00 7.58
9.00 8.81
11.00 8.33
14.00 9.96
6.00 7.24
4.00 4.26
12.00 10.84
7.00 4.82
5.00 5.68
ii
x y
10.00 9.14
8.00 8.14
13.00 8.74
9.00 8.77
11.00 9.26
14.00 8.10
6.00 6.13
4.00 3.10
12.00 9.13
7.00 7.26
5.00 4.74
iii
x y
10.00 7.46
8.00 6.77
13.00 12.74
9.00 7.11
11.00 7.81
14.00 8.84
6.00 6.08
4.00 5.39
12.00 8.15
7.00 6.42
5.00 5.73
iv
x y
8.00 6.58
8.00 5.76
8.00 7.71
8.00 8.84
8.00 8.47
8.00 7.04
8.00 5.25
19.00 12.50
8.00 5.56
8.00 7.91
8.00 6.89
Have the same
• Mean
• standard
deviation
• correlation
between x and y
• linear regression
Anscombe‘s Quartet
Exploring Single Features
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Plots of the Boston house prices data set
http://archive.ics.uci.edu/ml/machine-learning-databases/housing/
Extremley skewed
Mixture of two
normals after
taking the
logarithm
Looks like an artificially high value
 Groups all higher incomes
Boxplots
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Median
75%
percentile
25%
percentile
Range of
data except
outliers
Outlier
The outlier definition can change. We
used „more than 1.5 times the IQR
away from the 25%/75% percentile.”
You should always check this in the
package you use.
Pairwise Scatterplots with Regressions
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
No correlation
visible
Strong linear
correlation
Histogram of data in
the column
Pairwise Plots with Classes
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Good separation
of blue, but green
and orange are
overlapping
Good
separation of
all three
classes
Density plots of
data in the
column
separated by
classes
Correlation Heatmap
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Colors show
strength of
correlation
Correlation
between
premiums and
losses
Correlation
between reasons
for accidents
There are different correlation
coefficients. We used Pearson‘s
coefficient, which measures linear
correlations.
Hexbin Plots for Many Instances
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Cannot see
structure due to
amount of data
Hexagonal bins
reveal the structure
Line Plots for Timeseries
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Linear trend
Regular noise pattern 
Seasonal?
Range of values
Outline
• Overview
• Summary Statistics
• Visualization for Data Exploration
• Summary
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science
Summary
• Important to understand the data available
• Summary statistics provide a good overview
• Can be deceptive!
• Visualization is a powerful way to understand data
• Understanding of meta data and how domain experts
understand data equally important!
Introduction to Data Science
https://sherbold.github.io/intro-to-data-science

More Related Content

Similar to 03-Data-Exploration.pptx

Data Science 101
Data Science 101Data Science 101
Data Science 101ideatoipo
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 adrianheilbut
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesMark Kromer
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptxPallabiSahoo5
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisDmitry Grapov
 
Data preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in BioinformaticsData preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in BioinformaticsElena Sügis
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
 
Esoteric Data structures
Esoteric Data structures Esoteric Data structures
Esoteric Data structures Mugisha Moses
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesElsevier
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsLeighton Pritchard
 
Statistis, Row Counts, Execution Plans and Query Tuning
Statistis, Row Counts, Execution Plans and Query TuningStatistis, Row Counts, Execution Plans and Query Tuning
Statistis, Row Counts, Execution Plans and Query TuningGrant Fritchey
 
Machine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The PredictionsMachine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The PredictionsLuca Zavarella
 

Similar to 03-Data-Exploration.pptx (20)

Statistics
StatisticsStatistics
Statistics
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013 grizzly - informal overview - pydata boston 2013
grizzly - informal overview - pydata boston 2013
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Multivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysisMultivarite and network tools for biological data analysis
Multivarite and network tools for biological data analysis
 
Daming
DamingDaming
Daming
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
Data preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in BioinformaticsData preprocessing and unsupervised learning methods in Bioinformatics
Data preprocessing and unsupervised learning methods in Bioinformatics
 
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...
 
Esoteric Data structures
Esoteric Data structures Esoteric Data structures
Esoteric Data structures
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Statistis, Row Counts, Execution Plans and Query Tuning
Statistis, Row Counts, Execution Plans and Query TuningStatistis, Row Counts, Execution Plans and Query Tuning
Statistis, Row Counts, Execution Plans and Query Tuning
 
Machine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The PredictionsMachine Learning From Raw Data To The Predictions
Machine Learning From Raw Data To The Predictions
 
Data visualization
Data visualizationData visualization
Data visualization
 

More from Shree Shree

More from Shree Shree (20)

10 rectangular options.pptx
10 rectangular options.pptx10 rectangular options.pptx
10 rectangular options.pptx
 
1732002B.pptx
1732002B.pptx1732002B.pptx
1732002B.pptx
 
9D52BDFC.pptx
9D52BDFC.pptx9D52BDFC.pptx
9D52BDFC.pptx
 
865FC50.pptx
865FC50.pptx865FC50.pptx
865FC50.pptx
 
9DE46AD6.pptx
9DE46AD6.pptx9DE46AD6.pptx
9DE46AD6.pptx
 
20DE6DE0.pptx
20DE6DE0.pptx20DE6DE0.pptx
20DE6DE0.pptx
 
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
2108-EN-TP-PhenomEssentialTalentAcquisitionToolkit.pdf
 
2C0A56C5.pptx
2C0A56C5.pptx2C0A56C5.pptx
2C0A56C5.pptx
 
1EF5E174.pptx
1EF5E174.pptx1EF5E174.pptx
1EF5E174.pptx
 
1BB40E18.pptx
1BB40E18.pptx1BB40E18.pptx
1BB40E18.pptx
 
8D0ECBB0.pptx
8D0ECBB0.pptx8D0ECBB0.pptx
8D0ECBB0.pptx
 
D1586BA6.pptx
D1586BA6.pptxD1586BA6.pptx
D1586BA6.pptx
 
FB09958.pptx
FB09958.pptxFB09958.pptx
FB09958.pptx
 
13072EF.pptx
13072EF.pptx13072EF.pptx
13072EF.pptx
 
E65B78F3.pptx
E65B78F3.pptxE65B78F3.pptx
E65B78F3.pptx
 
355C162.pptx
355C162.pptx355C162.pptx
355C162.pptx
 
795589F6.pptx
795589F6.pptx795589F6.pptx
795589F6.pptx
 
AD177DF3.pptx
AD177DF3.pptxAD177DF3.pptx
AD177DF3.pptx
 
961F1B3A.pptx
961F1B3A.pptx961F1B3A.pptx
961F1B3A.pptx
 
5F2D461.pptx
5F2D461.pptx5F2D461.pptx
5F2D461.pptx
 

Recently uploaded

Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxAnaBeatriceAblay2
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 

Recently uploaded (20)

Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 

03-Data-Exploration.pptx

  • 1. Chapter 03 Data Exploration Dr. Steffen Herbold herbold@cs.uni-goettingen.de Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 2. Outline • Overview • Summary Statistics • Visualization for Data Exploration • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 3. Goal of Data Exploration • Goal: • Understand the basic characteristics of the data • Examples for characteristics: • Structure • Size • Completeness • Relationships Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 4. Methods for Data Exploration • Usually interactive and semi-automated • Text editors, system calls (head/more/less), etc. to look at raw data directly • Helps to understand the structure • Statistics and visualizations to learn about distributions and relationships • Exploration should also include meta data • Feature names, trace links, etc. Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 5. Outline • Overview • Summary Statistics • Visualization for Data Exploration • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 6. Descriptive Statistics • Summarize data through single value • Do not predict anything about the data ( inductive statistics) • Common statistics covered in this course • Central tendency (mean/median/mode) • Variability (standard deviation, interquartile range) • Range of data (min/max) • Other important statistics • Kurtosis and skewness for the shape of distributions • More measures for central tendency, e.g., trimmed means, harmonic mean Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 7. Central Tendency • „Typical“ value of the data • Arithmetic mean • 𝑚𝑒𝑎𝑛 𝑥 = 1 𝑛 𝑖=1 𝑛 𝑥𝑖 with 𝑥 = 𝑥1, … , 𝑥𝑛 ∈ ℝ𝑛 • Median • The value that separates the higher half from the data of the lower half • Mode • The value that appears most in the data Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 8. Variability • Measure for the spread of the data • Also called dispersion • Standard deviation • Measure for the difference of observation to the arithmetic mean • 𝑠𝑑 𝑥 = 𝑖=1 𝑛 𝑥𝑖−𝑚𝑒𝑎𝑛 𝑥 2 𝑛−1 • Interquartile Range (IQR) • Percentile: value below which a given percentage falls • Difference between the 75% percentile and the 25% percentile Introduction to Data Science https://sherbold.github.io/intro-to-data-science The median is the 50% percentile
  • 9. Range of data • Range for which values are observed • Can be infinite! • Minimum • Smallest observed value • Maximum • Largest observed value • May be strongly distorted by invalid data • Makes it also a good tool to discover invalid data Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 10. Example • Random typing on the keypad • 𝑥 = (1,2,1,1,3,4,5,2,3,4,5,1,3,2,1,6,5,4,9,4,3,6,1,5,6,8,4,6,5,1,3,2,1,6,8,7,6,1,3,1,6,8,4,7,6,4,3,5,4,9,7,4,3,1,4,6,8,7,9,1,4,6,1,3,8,6,7,4,9,6,5,1,3,6,8,7) • central tendency: • mean: 4.46052631579 • median: 4.0 • mode (count): 1 (14) • variability • sd: 2.41944311488 • IQR: 3.0 • range • min: 1 • max: 9 Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 11. Outline • Overview • Summary Statistics • Visualization for Data Exploration • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 12. A Picture Says More than 1000 Words Introduction to Data Science https://sherbold.github.io/intro-to-data-science Numbers are made up and pie charts should actually be avoided
  • 13. DescriptiveDeceptive Statistics Introduction to Data Science https://sherbold.github.io/intro-to-data-science i X y 10.00 8.04 8.00 6.95 13.00 7.58 9.00 8.81 11.00 8.33 14.00 9.96 6.00 7.24 4.00 4.26 12.00 10.84 7.00 4.82 5.00 5.68 ii x y 10.00 9.14 8.00 8.14 13.00 8.74 9.00 8.77 11.00 9.26 14.00 8.10 6.00 6.13 4.00 3.10 12.00 9.13 7.00 7.26 5.00 4.74 iii x y 10.00 7.46 8.00 6.77 13.00 12.74 9.00 7.11 11.00 7.81 14.00 8.84 6.00 6.08 4.00 5.39 12.00 8.15 7.00 6.42 5.00 5.73 iv x y 8.00 6.58 8.00 5.76 8.00 7.71 8.00 8.84 8.00 8.47 8.00 7.04 8.00 5.25 19.00 12.50 8.00 5.56 8.00 7.91 8.00 6.89 Have the same • Mean • standard deviation • correlation between x and y • linear regression Anscombe‘s Quartet
  • 14. Exploring Single Features Introduction to Data Science https://sherbold.github.io/intro-to-data-science Plots of the Boston house prices data set http://archive.ics.uci.edu/ml/machine-learning-databases/housing/ Extremley skewed Mixture of two normals after taking the logarithm Looks like an artificially high value  Groups all higher incomes
  • 15. Boxplots Introduction to Data Science https://sherbold.github.io/intro-to-data-science Median 75% percentile 25% percentile Range of data except outliers Outlier The outlier definition can change. We used „more than 1.5 times the IQR away from the 25%/75% percentile.” You should always check this in the package you use.
  • 16. Pairwise Scatterplots with Regressions Introduction to Data Science https://sherbold.github.io/intro-to-data-science No correlation visible Strong linear correlation Histogram of data in the column
  • 17. Pairwise Plots with Classes Introduction to Data Science https://sherbold.github.io/intro-to-data-science Good separation of blue, but green and orange are overlapping Good separation of all three classes Density plots of data in the column separated by classes
  • 18. Correlation Heatmap Introduction to Data Science https://sherbold.github.io/intro-to-data-science Colors show strength of correlation Correlation between premiums and losses Correlation between reasons for accidents There are different correlation coefficients. We used Pearson‘s coefficient, which measures linear correlations.
  • 19. Hexbin Plots for Many Instances Introduction to Data Science https://sherbold.github.io/intro-to-data-science Cannot see structure due to amount of data Hexagonal bins reveal the structure
  • 20. Line Plots for Timeseries Introduction to Data Science https://sherbold.github.io/intro-to-data-science Linear trend Regular noise pattern  Seasonal? Range of values
  • 21. Outline • Overview • Summary Statistics • Visualization for Data Exploration • Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
  • 22. Summary • Important to understand the data available • Summary statistics provide a good overview • Can be deceptive! • Visualization is a powerful way to understand data • Understanding of meta data and how domain experts understand data equally important! Introduction to Data Science https://sherbold.github.io/intro-to-data-science