SlideShare a Scribd company logo
1 of 51
Week04 : Exploratory Data Analysis
ICT 706 Data Analytics
2
Learning Outcomes
• Understand key elements in exploratory data analysis (EDA)
• Explain and use common summary statistics for EDA
• Plot and explain common graphs and charts for EDA
3
Data Analytics
• What is Data Analytics?
• Study data to answer questions
• Why?
• More organizations are collecting more data than ever before
• Business, Government, Healthcare, Charity – everything!
• This data holds many insights into their operations and beyond
• Data Analytics is required to extract these insights
• Requires analytical, statistical and computational skills
4
To Answer Key Questions
• What movies (or books) customers would like to watch (or read)?
• What movies to order from studio and how many?
• Who are our best customers?
• When is there a flu epidemic in region in the country?
• Which customers are most likely not to have an accident?
• When a customer is likely to jump ship & go to a competitor?
• What is the one item you want to have in your store in case of a hurricane?
• What is the one thing that will improve a lawyer’s chance to win a case?
• What are some questions one can answer with a loyalty card?
• What is the number one reason for the success of cricket player?
• … …
5
Data Analytic Tasks
• Exploratory Data Analysis (EDA)  our focus this week!
• A preliminary exploration of the data to better understand its characteristics
• Descriptive Modelling
• Find human-interpretable patterns that describe the data. E.g.
• Association Rule Discovery
• Clustering
• Predictive Modelling
• Use some attributes to predict unknown or future values of other attributes.
E.g.
• Classification
• Regression
6
EDA
• If you are going to find out anything about a data set you must first
understand the data
• but most interesting data sets are too big to inspect manually
• EDA:
• Helps to select the right tool for preprocessing or analysis
• Makes use of humans’ abilities to recognize patterns
• Focuses on:
• Summary statistics
• Visualization
• Clustering and anomaly detection
7
1. Summary Statistics
• Getting an overall sense of the data set
• Numbers that summarize properties of the data
• Frequency
• frequency, mode, percentiles
• Location
• Min & Max, Mean & Median
• Spread
• Variance & Standard Deviation, Distribution
8
9
10
11
12
13
14
Credit card company
15
16
17
18
19
Mean or Median?
20
Mean or Median ??
21
Question: Mean or Median?
22
23
24
Frequency and Mode
• Frequency: the number of times the value occurs
• Relative frequency: the percentage that the value occurs
• Mode: the most frequent attribute value
• (3, 5, 3, 10, 4)
• Frequency and mode are typically used with categorical data
Category Apple Orange Melon Kiwi Plum Total
Frequency 10 15 8 10 7 50
Relative
Frequency
20% 30% 16% 20% 14% 100%
25
Percentiles
• For continuous data, a percentile is more useful.
• Given an ordinal or continuous attribute x, its pth percentile ([0, 100]) is
• a value xp such that p% of the observed values are less than xp.
• E.g. the 20th percentile is the value below which 20% of all values of the
observations may be found.
26
Quartiles
• Values that divide a list of numbers into quarters Q1, Q2, Q3
• Percentiles 25th,50th, 75th
• Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8
• Put the list of numbers in order
• Then cut the list into four equal parts
• If a quartile is half way, use the average
27
Quartile 1 (Q1) = 3
Quartile 2 (Q2) = 5.5
Quartile 3 (Q3) = 7
Mean and Median
• Measure of the locations of a set of points
• Mean
• average of all values
• very sensitive to outliers
• Median
• Q2 or 50th percentile: the value whose occurrence lies in the middle of a set of ordered
values
• median(1, 20, 25, 30, 31) = ?
• median(1, 2, 5, 92) = ?
𝑚𝑒𝑎𝑛 1,2,5,92 =
1 + 2 + 5 + 92
4
= 25
28
Extreme example
• Income in small town of 6 people
• $25,000 $27,000 $29,000
• $35,000 $37,000 $38,000
• Mean is $31,830 and median is $32,000
• Bill Gates moves to town
• $25,000 $27,000 $29,000
• $35,000 $37,000 $38,000 $40,000,000
• Mean is $5,741,571 median is $35,000
• Mean is pulled by the outlier while the median is not. The median is
a better of measure of center for data with extreme values.
• E.g. house prices, incomes, student exam marks, ...
29
Range and Variance
• Measures of data spread
• Range: the difference between the max and min
• (1, 2, 5, 70, 92): range = 92 – 1 = 91
• Variance: The average of the squared differences from the Mean
• Standard deviation: the square root of Variance
• Sensitive to outliers
• Inter-quartile range: Q3 – Q1
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑥 = 𝑠 𝑥
2
=
1
𝑛
𝑖=1
𝑛
(𝑥𝑖 − 𝑥)2
Value 2 3 3 4 Mean = 3
Diff 1 0 0 1
Diff2 1 0 0 1 Variance = 2/4 = 0.5
30
2. Visualization
• Convert data into a visual or tabular format for easier understanding
• One of the most powerful and appealing techniques for data
exploration
• Humans have a well developed ability to analyze large amounts of
information that is presented visually
• Can detect general patterns and trends
• Can detect outliers and unusual patterns
31
Visualization
• Data objects, their attributes, and the relationships are translated into
graphical elements such as points, lines, shapes, and colors
• Example:
• Objects are often represented as points
• Their attribute values can be represented as the position of the points or the
characteristics of the points, e.g., color, size, and shape
• If position is used, then the relationships of points, i.e., whether they form
groups or a point is an outlier, is easily perceived.
32
Visualization Example
• Sea Surface Temperature for July 1982
• Tens of thousands of data points are summarized in a single figure
33
EDA Strategy
• Examine variables one by one, then look at the relationships among
the different variables
• Start with graphs, then add numerical summaries of specific aspects
of the data
• Be aware of attribute types
• Categorical vs. Numeric
34
EDA: One Variable
• Summary statistics
• Categorical: frequency table
• Numeric: mean, median, standard deviation, range, etc.
• Graphical displays
• Categorical: bar chart, pie chart, etc.
• Numeric: histogram, boxplot, etc.
• Probability models
• Categorical: Binomial distribution (won’t cover)
• Numeric: Normal curve (won’t cover)
35
Example Scatter Chart: Age
0
10
20
30
40
50
60
70
80
90
Age(inyears),
84 = Maximum
2 = Minimum
28 = Mode (Occurs twice)
33 = Mean
36 = Median
Summary Statistics
Example Scatter Chart 2: Age
0
10
20
30
40
50
60
70
80
90
Age(inyears).
Example Scatter Chart 1: Age
0
10
20
30
40
50
60
70
80
90
Age(inyears),
SD = 7.6 SD = 20.4
36
Frequency Table
Class Frequency Relative Frequency
Highest Degree Obtained Number of CEOs Proportion
None 1 0.04
Bachelors 7 0.28
Masters 11 0.44
Doctorate / Law 6 0.24
Totals 25 1.00
37
Bar Graph
• The bar graph quickly compares the
degrees of the four groups
• The heights of the four bars show the
counts for the four degree categories
38
Pie Chart
• A pie chart helps us see what part of
the whole group forms
• To make a pie chart, you must
include all the categories that make
up a whole
• WARNING: pie charts are often a
poor choice – they show only relative
data. A bar chart is often better.
39
Histogram
• Split data range into equal-sized bins.
• Count the number of points in each bin.
40
The bins are:
3.0 ≤ rate < 4.0
4.0 ≤ rate < 5.0
5.0 ≤ rate < 6.0
6.0 ≤ rate < 7.0
7.0 ≤ rate < 8.0
8.0 ≤ rate < 9.0
9.0 ≤ rate < 10.0
10.0 ≤ rate < 11.0
11.0 ≤ rate < 12.0
12.0 ≤ rate < 13.0
13.0 ≤ rate < 14.0
14.0 ≤ rate < 15.0
Histograms The bins are:
3.0 ≤ rate < 4.0
4.0 ≤ rate < 5.0
5.0 ≤ rate < 6.0
6.0 ≤ rate < 7.0
7.0 ≤ rate < 8.0
8.0 ≤ rate < 9.0
9.0 ≤ rate < 10.0
10.0 ≤ rate < 11.0
11.0 ≤ rate < 12.0
12.0 ≤ rate < 13.0
13.0 ≤ rate < 14.0
14.0 ≤ rate < 15.0
41
Histograms
The bins are:
2.0 ≤ rate < 4.0
4.0 ≤ rate < 6.0
6.0 ≤ rate < 8.0
8.0 ≤ rate < 10.0
10.0 ≤ rate < 12.0
12.0 ≤ rate < 14.0
14.0 ≤ rate < 16.0
16.0 ≤ rate < 18.0
42
Histograms
• Where did the bins come from?
• They were chosen rather arbitrarily
• Does choosing other bins change the picture?
• Yes!! And sometimes dramatically
• What do we do about this?
• Some pretty smart people have come up with some “optimal” bin widths
• we will rely on their suggestions
43
Frequency Distribution Example 2
0
2
4
6
8
10
12
14
1 2 3 4 5 6 7 8 9 10 11
Age (years)
Frequency
Distribution for variable AGE
Histogram
• The histogram can illustrate features related to
the distribution of the data, e.g.,
• Shape
• Symmetric, skewed right, skewed left, bell shaped
• presence of outliers
• presence of multiple modes
• "a camel with two humps" is a
common grade distribution for
ICT courses!
44
Box Plots
• Box Plots
• Show data distribution
• A graph of five numbers (often called the
five number summary)
• Minimum
• 1st quartile
• Median
• 3rd quartile
• Maximum
• Various definitions for min-max
• Min & Max of all data
• 9th and 91st percentile
• 2nd and 98th percentile
• 1.5 Inter-quartile range
outlier
9th percentile
25th percentile
75th percentile
50th percentile
91st percentile
45
EDA: Two Variables
• Looking at TWO variables at once helps us to explore relationships
between different attributes.
• sometimes we can see clear relationships and clusters
• next week we will look at clustering in more detail
• this week we focus on exploratory analysis of two variables
• Three combinations of variables
• 2 categorical variables
• Contingency tables
• 1 categorical and 1 numeric variable
• Side-by-side box plots, counts, etc.
• 2 numeric variables
• Scatter plots, correlations, regressions
46
Contingency Tables
• Show the frequency distribution of one variable in rows and another in
columns
• E.g. a study of sex differences in handedness with 100 samples
• Sex (male, female) and Handedness (right, left)
47
Marginal totals
Grand total
Side-by-side box plots
• Side-by-side box plots are graphical summaries of data when one
variable is categorical and the other numeric
• Used to compare the distributions associated with the numeric variable
across the levels of the categorical variable
48
Scatter Plots
• One variable on x-axis and another on y-axis
• Attributes values determine the position
• Arrays of scatter plots can compactly summarize the relationships of pairs of
attributes
49
Describing scatter plots
• Form
• Linear, quadratic, exponential
• Direction
• Positive association
• An increase in one variable is accompanied by an
increase in the other
• Negatively associated
• A decrease in one variable is accompanied by an
increase in the other
• Strength
• How closely the points follow a clear form
50
• Form: Linear
• Direction: Positive
• Strength: Strong
Course Reflection (in census week)
• How is your learning journey going ?
• Your feedback
• Areas of improvement
51

More Related Content

What's hot (20)

ForecastIT 3. Simple Exponential Smoothing
ForecastIT 3. Simple Exponential SmoothingForecastIT 3. Simple Exponential Smoothing
ForecastIT 3. Simple Exponential Smoothing
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Bi assignment
Bi assignmentBi assignment
Bi assignment
 
Descriptive statistics ii
Descriptive statistics iiDescriptive statistics ii
Descriptive statistics ii
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
01 intro
01 intro01 intro
01 intro
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Multivariate analysis
Multivariate analysisMultivariate analysis
Multivariate analysis
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
R Datatypes
R DatatypesR Datatypes
R Datatypes
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Data cube
Data cubeData cube
Data cube
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
Introduction To Analytics
Introduction To AnalyticsIntroduction To Analytics
Introduction To Analytics
 
Data management overview
Data management overviewData management overview
Data management overview
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
 
Scatter Plots
Scatter PlotsScatter Plots
Scatter Plots
 
Introduction to data mining technique
Introduction to data mining techniqueIntroduction to data mining technique
Introduction to data mining technique
 
Data analysis
Data analysisData analysis
Data analysis
 

Similar to Exploratory Data Analysis week 4

Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptxgina458018
 
Lecture 4 - Charts and graphs.pptx
Lecture 4 - Charts and graphs.pptxLecture 4 - Charts and graphs.pptx
Lecture 4 - Charts and graphs.pptxaazmeerrahman
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statisticsAmira Talic
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingTony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingHarry Potter
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptxImXaib
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISBabasID2
 
3. Statistical Analysis.pptx
3. Statistical Analysis.pptx3. Statistical Analysis.pptx
3. Statistical Analysis.pptxjeyanthisivakumar
 
Biostatistics CH Lecture Pack
Biostatistics CH Lecture PackBiostatistics CH Lecture Pack
Biostatistics CH Lecture PackShaun Cochrane
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAnand Thokal
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptxPallabiSahoo5
 
Statistics for analytics
Statistics for analyticsStatistics for analytics
Statistics for analyticsdeepika4721
 
datamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdfdatamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdfssuser3e6464
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptxCalvinAdorDionisio
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysiskompellark
 

Similar to Exploratory Data Analysis week 4 (20)

Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptx
 
Lecture 4 - Charts and graphs.pptx
Lecture 4 - Charts and graphs.pptxLecture 4 - Charts and graphs.pptx
Lecture 4 - Charts and graphs.pptx
 
Eda sri
Eda sriEda sri
Eda sri
 
Introduction to statistics
Introduction to statisticsIntroduction to statistics
Introduction to statistics
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data_Preparation.pptx
Data_Preparation.pptxData_Preparation.pptx
Data_Preparation.pptx
 
EXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSISEXPLORATORY DATA ANALYSIS
EXPLORATORY DATA ANALYSIS
 
EDA.ppt
EDA.pptEDA.ppt
EDA.ppt
 
3. Statistical Analysis.pptx
3. Statistical Analysis.pptx3. Statistical Analysis.pptx
3. Statistical Analysis.pptx
 
Biostatistics CH Lecture Pack
Biostatistics CH Lecture PackBiostatistics CH Lecture Pack
Biostatistics CH Lecture Pack
 
2. basic statistics
2. basic statistics2. basic statistics
2. basic statistics
 
IV STATISTICS I.pdf
IV STATISTICS I.pdfIV STATISTICS I.pdf
IV STATISTICS I.pdf
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Statistics for analytics
Statistics for analyticsStatistics for analytics
Statistics for analytics
 
datamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdfdatamining_LectureTwo(Data Pipeline) 2.pdf
datamining_LectureTwo(Data Pipeline) 2.pdf
 
Intro to Statistics.pptx
Intro to Statistics.pptxIntro to Statistics.pptx
Intro to Statistics.pptx
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptx
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysis
 

More from Manzur Ashraf

2022-03-22 15.25.49.pdf
2022-03-22 15.25.49.pdf2022-03-22 15.25.49.pdf
2022-03-22 15.25.49.pdfManzur Ashraf
 
Dhumlin upohar ramadan
Dhumlin upohar ramadanDhumlin upohar ramadan
Dhumlin upohar ramadanManzur Ashraf
 
Java related basic tutorial
Java related basic tutorialJava related basic tutorial
Java related basic tutorialManzur Ashraf
 
‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?
‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?
‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?Manzur Ashraf
 
Can Science Come Back to Islam? from New Scientist 23 Oct 1980
Can Science Come Back to Islam?  from New Scientist 23 Oct 1980Can Science Come Back to Islam?  from New Scientist 23 Oct 1980
Can Science Come Back to Islam? from New Scientist 23 Oct 1980Manzur Ashraf
 
Kakadu report-web-optimization
Kakadu report-web-optimizationKakadu report-web-optimization
Kakadu report-web-optimizationManzur Ashraf
 
Speed and Performance Optimization Technique of a website
Speed and Performance Optimization Technique of a websiteSpeed and Performance Optimization Technique of a website
Speed and Performance Optimization Technique of a websiteManzur Ashraf
 
SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015
SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015
SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015Manzur Ashraf
 
website status and seo assessment altec 2015
website status and seo assessment altec 2015website status and seo assessment altec 2015
website status and seo assessment altec 2015Manzur Ashraf
 
Hajj: an experience and a guide from Dr Manzur Ashraf
Hajj: an experience and a guide from Dr Manzur AshrafHajj: an experience and a guide from Dr Manzur Ashraf
Hajj: an experience and a guide from Dr Manzur AshrafManzur Ashraf
 
Selective Duas from the book muslim-devotion
Selective Duas from the book muslim-devotionSelective Duas from the book muslim-devotion
Selective Duas from the book muslim-devotionManzur Ashraf
 
SEO Marketing / Competitive Analysis Report (sample)
  SEO Marketing / Competitive Analysis Report (sample)  SEO Marketing / Competitive Analysis Report (sample)
SEO Marketing / Competitive Analysis Report (sample)Manzur Ashraf
 
Characteristics of a Muslim (Bangla)
Characteristics of a Muslim (Bangla)Characteristics of a Muslim (Bangla)
Characteristics of a Muslim (Bangla)Manzur Ashraf
 
Bani israel 14_principles
Bani israel 14_principlesBani israel 14_principles
Bani israel 14_principlesManzur Ashraf
 
Seo basis-seminar-2015
Seo basis-seminar-2015Seo basis-seminar-2015
Seo basis-seminar-2015Manzur Ashraf
 

More from Manzur Ashraf (18)

2022-03-22 15.25.49.pdf
2022-03-22 15.25.49.pdf2022-03-22 15.25.49.pdf
2022-03-22 15.25.49.pdf
 
Dhumlin upohar ramadan
Dhumlin upohar ramadanDhumlin upohar ramadan
Dhumlin upohar ramadan
 
Java related basic tutorial
Java related basic tutorialJava related basic tutorial
Java related basic tutorial
 
‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?
‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?
‘WHAT ARE JAMAAYA, LOYALTY TO LEADERS & THEIR LINK TO RELIGIOUS EXTREMISM?
 
Can Science Come Back to Islam? from New Scientist 23 Oct 1980
Can Science Come Back to Islam?  from New Scientist 23 Oct 1980Can Science Come Back to Islam?  from New Scientist 23 Oct 1980
Can Science Come Back to Islam? from New Scientist 23 Oct 1980
 
Kakadu report-web-optimization
Kakadu report-web-optimizationKakadu report-web-optimization
Kakadu report-web-optimization
 
Speed and Performance Optimization Technique of a website
Speed and Performance Optimization Technique of a websiteSpeed and Performance Optimization Technique of a website
Speed and Performance Optimization Technique of a website
 
SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015
SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015
SEO for Businesses - BUET ALUMNI Melbourne Chapter Talk Nov 2015
 
website status and seo assessment altec 2015
website status and seo assessment altec 2015website status and seo assessment altec 2015
website status and seo assessment altec 2015
 
Hajj: an experience and a guide from Dr Manzur Ashraf
Hajj: an experience and a guide from Dr Manzur AshrafHajj: an experience and a guide from Dr Manzur Ashraf
Hajj: an experience and a guide from Dr Manzur Ashraf
 
Selective Duas from the book muslim-devotion
Selective Duas from the book muslim-devotionSelective Duas from the book muslim-devotion
Selective Duas from the book muslim-devotion
 
SEO Marketing / Competitive Analysis Report (sample)
  SEO Marketing / Competitive Analysis Report (sample)  SEO Marketing / Competitive Analysis Report (sample)
SEO Marketing / Competitive Analysis Report (sample)
 
Cclc manzur
Cclc manzurCclc manzur
Cclc manzur
 
Characteristics of a Muslim (Bangla)
Characteristics of a Muslim (Bangla)Characteristics of a Muslim (Bangla)
Characteristics of a Muslim (Bangla)
 
Sura Mulk
Sura MulkSura Mulk
Sura Mulk
 
Bani israel 14_principles
Bani israel 14_principlesBani israel 14_principles
Bani israel 14_principles
 
Sura kahaf-2015
Sura kahaf-2015Sura kahaf-2015
Sura kahaf-2015
 
Seo basis-seminar-2015
Seo basis-seminar-2015Seo basis-seminar-2015
Seo basis-seminar-2015
 

Recently uploaded

SBP-Market-Operations and market managment
SBP-Market-Operations and market managmentSBP-Market-Operations and market managment
SBP-Market-Operations and market managmentfactical
 
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Roomdivyansh0kumar0
 
Unveiling the Top Chartered Accountants in India and Their Staggering Net Worth
Unveiling the Top Chartered Accountants in India and Their Staggering Net WorthUnveiling the Top Chartered Accountants in India and Their Staggering Net Worth
Unveiling the Top Chartered Accountants in India and Their Staggering Net WorthShaheen Kumar
 
Classical Theory of Macroeconomics by Adam Smith
Classical Theory of Macroeconomics by Adam SmithClassical Theory of Macroeconomics by Adam Smith
Classical Theory of Macroeconomics by Adam SmithAdamYassin2
 
20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdf20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdfAdnet Communications
 
Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...
Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...
Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...First NO1 World Amil baba in Faisalabad
 
Attachment Of Assets......................
Attachment Of Assets......................Attachment Of Assets......................
Attachment Of Assets......................AmanBajaj36
 
(办理学位证)加拿大萨省大学毕业证成绩单原版一比一
(办理学位证)加拿大萨省大学毕业证成绩单原版一比一(办理学位证)加拿大萨省大学毕业证成绩单原版一比一
(办理学位证)加拿大萨省大学毕业证成绩单原版一比一S SDS
 
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptxOAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptxhiddenlevers
 
Quantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector CompaniesQuantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector Companiesprashantbhati354
 
Vip B Aizawl Call Girls #9907093804 Contact Number Escorts Service Aizawl
Vip B Aizawl Call Girls #9907093804 Contact Number Escorts Service AizawlVip B Aizawl Call Girls #9907093804 Contact Number Escorts Service Aizawl
Vip B Aizawl Call Girls #9907093804 Contact Number Escorts Service Aizawlmakika9823
 
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyInterimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyTyöeläkeyhtiö Elo
 
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130Suhani Kapoor
 
Andheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot ModelsAndheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot Modelshematsharma006
 
VIP Kolkata Call Girl Serampore 👉 8250192130 Available With Room
VIP Kolkata Call Girl Serampore 👉 8250192130  Available With RoomVIP Kolkata Call Girl Serampore 👉 8250192130  Available With Room
VIP Kolkata Call Girl Serampore 👉 8250192130 Available With Roomdivyansh0kumar0
 
How Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of ReportingHow Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of ReportingAggregage
 
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

SBP-Market-Operations and market managment
SBP-Market-Operations and market managmentSBP-Market-Operations and market managment
SBP-Market-Operations and market managment
 
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With RoomVIP Kolkata Call Girl Jodhpur Park 👉 8250192130  Available With Room
VIP Kolkata Call Girl Jodhpur Park 👉 8250192130 Available With Room
 
Unveiling the Top Chartered Accountants in India and Their Staggering Net Worth
Unveiling the Top Chartered Accountants in India and Their Staggering Net WorthUnveiling the Top Chartered Accountants in India and Their Staggering Net Worth
Unveiling the Top Chartered Accountants in India and Their Staggering Net Worth
 
Classical Theory of Macroeconomics by Adam Smith
Classical Theory of Macroeconomics by Adam SmithClassical Theory of Macroeconomics by Adam Smith
Classical Theory of Macroeconomics by Adam Smith
 
20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdf20240417-Calibre-April-2024-Investor-Presentation.pdf
20240417-Calibre-April-2024-Investor-Presentation.pdf
 
Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...
Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...
Authentic No 1 Amil Baba In Pakistan Authentic No 1 Amil Baba In Karachi No 1...
 
Attachment Of Assets......................
Attachment Of Assets......................Attachment Of Assets......................
Attachment Of Assets......................
 
(办理学位证)加拿大萨省大学毕业证成绩单原版一比一
(办理学位证)加拿大萨省大学毕业证成绩单原版一比一(办理学位证)加拿大萨省大学毕业证成绩单原版一比一
(办理学位证)加拿大萨省大学毕业证成绩单原版一比一
 
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptxOAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
OAT_RI_Ep19 WeighingTheRisks_Apr24_TheYellowMetal.pptx
 
Quantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector CompaniesQuantitative Analysis of Retail Sector Companies
Quantitative Analysis of Retail Sector Companies
 
Vip B Aizawl Call Girls #9907093804 Contact Number Escorts Service Aizawl
Vip B Aizawl Call Girls #9907093804 Contact Number Escorts Service AizawlVip B Aizawl Call Girls #9907093804 Contact Number Escorts Service Aizawl
Vip B Aizawl Call Girls #9907093804 Contact Number Escorts Service Aizawl
 
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance CompanyInterimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
Interimreport1 January–31 March2024 Elo Mutual Pension Insurance Company
 
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
VIP Call Girls Service Dilsukhnagar Hyderabad Call +91-8250192130
 
Andheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot ModelsAndheri Call Girls In 9825968104 Mumbai Hot Models
Andheri Call Girls In 9825968104 Mumbai Hot Models
 
VIP Kolkata Call Girl Serampore 👉 8250192130 Available With Room
VIP Kolkata Call Girl Serampore 👉 8250192130  Available With RoomVIP Kolkata Call Girl Serampore 👉 8250192130  Available With Room
VIP Kolkata Call Girl Serampore 👉 8250192130 Available With Room
 
Monthly Economic Monitoring of Ukraine No 231, April 2024
Monthly Economic Monitoring of Ukraine No 231, April 2024Monthly Economic Monitoring of Ukraine No 231, April 2024
Monthly Economic Monitoring of Ukraine No 231, April 2024
 
🔝+919953056974 🔝young Delhi Escort service Pusa Road
🔝+919953056974 🔝young Delhi Escort service Pusa Road🔝+919953056974 🔝young Delhi Escort service Pusa Road
🔝+919953056974 🔝young Delhi Escort service Pusa Road
 
Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024Commercial Bank Economic Capsule - April 2024
Commercial Bank Economic Capsule - April 2024
 
How Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of ReportingHow Automation is Driving Efficiency Through the Last Mile of Reporting
How Automation is Driving Efficiency Through the Last Mile of Reporting
 
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in  Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Nand Nagri (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 

Exploratory Data Analysis week 4

  • 1. Week04 : Exploratory Data Analysis ICT 706 Data Analytics
  • 2. 2
  • 3. Learning Outcomes • Understand key elements in exploratory data analysis (EDA) • Explain and use common summary statistics for EDA • Plot and explain common graphs and charts for EDA 3
  • 4. Data Analytics • What is Data Analytics? • Study data to answer questions • Why? • More organizations are collecting more data than ever before • Business, Government, Healthcare, Charity – everything! • This data holds many insights into their operations and beyond • Data Analytics is required to extract these insights • Requires analytical, statistical and computational skills 4
  • 5. To Answer Key Questions • What movies (or books) customers would like to watch (or read)? • What movies to order from studio and how many? • Who are our best customers? • When is there a flu epidemic in region in the country? • Which customers are most likely not to have an accident? • When a customer is likely to jump ship & go to a competitor? • What is the one item you want to have in your store in case of a hurricane? • What is the one thing that will improve a lawyer’s chance to win a case? • What are some questions one can answer with a loyalty card? • What is the number one reason for the success of cricket player? • … … 5
  • 6. Data Analytic Tasks • Exploratory Data Analysis (EDA)  our focus this week! • A preliminary exploration of the data to better understand its characteristics • Descriptive Modelling • Find human-interpretable patterns that describe the data. E.g. • Association Rule Discovery • Clustering • Predictive Modelling • Use some attributes to predict unknown or future values of other attributes. E.g. • Classification • Regression 6
  • 7. EDA • If you are going to find out anything about a data set you must first understand the data • but most interesting data sets are too big to inspect manually • EDA: • Helps to select the right tool for preprocessing or analysis • Makes use of humans’ abilities to recognize patterns • Focuses on: • Summary statistics • Visualization • Clustering and anomaly detection 7
  • 8. 1. Summary Statistics • Getting an overall sense of the data set • Numbers that summarize properties of the data • Frequency • frequency, mode, percentiles • Location • Min & Max, Mean & Median • Spread • Variance & Standard Deviation, Distribution 8
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 16. 16
  • 17. 17
  • 18. 18
  • 19. 19
  • 21. Mean or Median ?? 21
  • 22. Question: Mean or Median? 22
  • 23. 23
  • 24. 24
  • 25. Frequency and Mode • Frequency: the number of times the value occurs • Relative frequency: the percentage that the value occurs • Mode: the most frequent attribute value • (3, 5, 3, 10, 4) • Frequency and mode are typically used with categorical data Category Apple Orange Melon Kiwi Plum Total Frequency 10 15 8 10 7 50 Relative Frequency 20% 30% 16% 20% 14% 100% 25
  • 26. Percentiles • For continuous data, a percentile is more useful. • Given an ordinal or continuous attribute x, its pth percentile ([0, 100]) is • a value xp such that p% of the observed values are less than xp. • E.g. the 20th percentile is the value below which 20% of all values of the observations may be found. 26
  • 27. Quartiles • Values that divide a list of numbers into quarters Q1, Q2, Q3 • Percentiles 25th,50th, 75th • Example: 1, 3, 3, 4, 5, 6, 6, 7, 8, 8 • Put the list of numbers in order • Then cut the list into four equal parts • If a quartile is half way, use the average 27 Quartile 1 (Q1) = 3 Quartile 2 (Q2) = 5.5 Quartile 3 (Q3) = 7
  • 28. Mean and Median • Measure of the locations of a set of points • Mean • average of all values • very sensitive to outliers • Median • Q2 or 50th percentile: the value whose occurrence lies in the middle of a set of ordered values • median(1, 20, 25, 30, 31) = ? • median(1, 2, 5, 92) = ? 𝑚𝑒𝑎𝑛 1,2,5,92 = 1 + 2 + 5 + 92 4 = 25 28
  • 29. Extreme example • Income in small town of 6 people • $25,000 $27,000 $29,000 • $35,000 $37,000 $38,000 • Mean is $31,830 and median is $32,000 • Bill Gates moves to town • $25,000 $27,000 $29,000 • $35,000 $37,000 $38,000 $40,000,000 • Mean is $5,741,571 median is $35,000 • Mean is pulled by the outlier while the median is not. The median is a better of measure of center for data with extreme values. • E.g. house prices, incomes, student exam marks, ... 29
  • 30. Range and Variance • Measures of data spread • Range: the difference between the max and min • (1, 2, 5, 70, 92): range = 92 – 1 = 91 • Variance: The average of the squared differences from the Mean • Standard deviation: the square root of Variance • Sensitive to outliers • Inter-quartile range: Q3 – Q1 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑥 = 𝑠 𝑥 2 = 1 𝑛 𝑖=1 𝑛 (𝑥𝑖 − 𝑥)2 Value 2 3 3 4 Mean = 3 Diff 1 0 0 1 Diff2 1 0 0 1 Variance = 2/4 = 0.5 30
  • 31. 2. Visualization • Convert data into a visual or tabular format for easier understanding • One of the most powerful and appealing techniques for data exploration • Humans have a well developed ability to analyze large amounts of information that is presented visually • Can detect general patterns and trends • Can detect outliers and unusual patterns 31
  • 32. Visualization • Data objects, their attributes, and the relationships are translated into graphical elements such as points, lines, shapes, and colors • Example: • Objects are often represented as points • Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape • If position is used, then the relationships of points, i.e., whether they form groups or a point is an outlier, is easily perceived. 32
  • 33. Visualization Example • Sea Surface Temperature for July 1982 • Tens of thousands of data points are summarized in a single figure 33
  • 34. EDA Strategy • Examine variables one by one, then look at the relationships among the different variables • Start with graphs, then add numerical summaries of specific aspects of the data • Be aware of attribute types • Categorical vs. Numeric 34
  • 35. EDA: One Variable • Summary statistics • Categorical: frequency table • Numeric: mean, median, standard deviation, range, etc. • Graphical displays • Categorical: bar chart, pie chart, etc. • Numeric: histogram, boxplot, etc. • Probability models • Categorical: Binomial distribution (won’t cover) • Numeric: Normal curve (won’t cover) 35
  • 36. Example Scatter Chart: Age 0 10 20 30 40 50 60 70 80 90 Age(inyears), 84 = Maximum 2 = Minimum 28 = Mode (Occurs twice) 33 = Mean 36 = Median Summary Statistics Example Scatter Chart 2: Age 0 10 20 30 40 50 60 70 80 90 Age(inyears). Example Scatter Chart 1: Age 0 10 20 30 40 50 60 70 80 90 Age(inyears), SD = 7.6 SD = 20.4 36
  • 37. Frequency Table Class Frequency Relative Frequency Highest Degree Obtained Number of CEOs Proportion None 1 0.04 Bachelors 7 0.28 Masters 11 0.44 Doctorate / Law 6 0.24 Totals 25 1.00 37
  • 38. Bar Graph • The bar graph quickly compares the degrees of the four groups • The heights of the four bars show the counts for the four degree categories 38
  • 39. Pie Chart • A pie chart helps us see what part of the whole group forms • To make a pie chart, you must include all the categories that make up a whole • WARNING: pie charts are often a poor choice – they show only relative data. A bar chart is often better. 39
  • 40. Histogram • Split data range into equal-sized bins. • Count the number of points in each bin. 40 The bins are: 3.0 ≤ rate < 4.0 4.0 ≤ rate < 5.0 5.0 ≤ rate < 6.0 6.0 ≤ rate < 7.0 7.0 ≤ rate < 8.0 8.0 ≤ rate < 9.0 9.0 ≤ rate < 10.0 10.0 ≤ rate < 11.0 11.0 ≤ rate < 12.0 12.0 ≤ rate < 13.0 13.0 ≤ rate < 14.0 14.0 ≤ rate < 15.0
  • 41. Histograms The bins are: 3.0 ≤ rate < 4.0 4.0 ≤ rate < 5.0 5.0 ≤ rate < 6.0 6.0 ≤ rate < 7.0 7.0 ≤ rate < 8.0 8.0 ≤ rate < 9.0 9.0 ≤ rate < 10.0 10.0 ≤ rate < 11.0 11.0 ≤ rate < 12.0 12.0 ≤ rate < 13.0 13.0 ≤ rate < 14.0 14.0 ≤ rate < 15.0 41
  • 42. Histograms The bins are: 2.0 ≤ rate < 4.0 4.0 ≤ rate < 6.0 6.0 ≤ rate < 8.0 8.0 ≤ rate < 10.0 10.0 ≤ rate < 12.0 12.0 ≤ rate < 14.0 14.0 ≤ rate < 16.0 16.0 ≤ rate < 18.0 42
  • 43. Histograms • Where did the bins come from? • They were chosen rather arbitrarily • Does choosing other bins change the picture? • Yes!! And sometimes dramatically • What do we do about this? • Some pretty smart people have come up with some “optimal” bin widths • we will rely on their suggestions 43
  • 44. Frequency Distribution Example 2 0 2 4 6 8 10 12 14 1 2 3 4 5 6 7 8 9 10 11 Age (years) Frequency Distribution for variable AGE Histogram • The histogram can illustrate features related to the distribution of the data, e.g., • Shape • Symmetric, skewed right, skewed left, bell shaped • presence of outliers • presence of multiple modes • "a camel with two humps" is a common grade distribution for ICT courses! 44
  • 45. Box Plots • Box Plots • Show data distribution • A graph of five numbers (often called the five number summary) • Minimum • 1st quartile • Median • 3rd quartile • Maximum • Various definitions for min-max • Min & Max of all data • 9th and 91st percentile • 2nd and 98th percentile • 1.5 Inter-quartile range outlier 9th percentile 25th percentile 75th percentile 50th percentile 91st percentile 45
  • 46. EDA: Two Variables • Looking at TWO variables at once helps us to explore relationships between different attributes. • sometimes we can see clear relationships and clusters • next week we will look at clustering in more detail • this week we focus on exploratory analysis of two variables • Three combinations of variables • 2 categorical variables • Contingency tables • 1 categorical and 1 numeric variable • Side-by-side box plots, counts, etc. • 2 numeric variables • Scatter plots, correlations, regressions 46
  • 47. Contingency Tables • Show the frequency distribution of one variable in rows and another in columns • E.g. a study of sex differences in handedness with 100 samples • Sex (male, female) and Handedness (right, left) 47 Marginal totals Grand total
  • 48. Side-by-side box plots • Side-by-side box plots are graphical summaries of data when one variable is categorical and the other numeric • Used to compare the distributions associated with the numeric variable across the levels of the categorical variable 48
  • 49. Scatter Plots • One variable on x-axis and another on y-axis • Attributes values determine the position • Arrays of scatter plots can compactly summarize the relationships of pairs of attributes 49
  • 50. Describing scatter plots • Form • Linear, quadratic, exponential • Direction • Positive association • An increase in one variable is accompanied by an increase in the other • Negatively associated • A decrease in one variable is accompanied by an increase in the other • Strength • How closely the points follow a clear form 50 • Form: Linear • Direction: Positive • Strength: Strong
  • 51. Course Reflection (in census week) • How is your learning journey going ? • Your feedback • Areas of improvement 51