SlideShare a Scribd company logo
1 of 51
Data Mining and Data Warehousing
CSE-4107
Md. Manowarul Islam
Assistant Professor, Dept. of CSE
Jagannath University
Md. Manowarul Islam, Dept. Of CSE, JnU
Chapter 2: Data Preprocessing
 Why preprocess the data?
 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Md. Manowarul Islam, Dept. Of CSE, JnU
What is Data Mining?
 Data mining is the use of efficient techniques for the analysis
of very large collections of data and the extraction of useful
and possibly unexpected patterns in data.
 “Data mining is the analysis of (often large) observational data
sets to find unsuspected relationships and to summarize the
data in novel ways that are both understandable and useful to
the data analyst” (Hand, Mannila, Smyth)
 “Data mining is the discovery of models for data” (Rajaraman,
Ullman)
 We can have the following types of models
 Models that explain the data (e.g., a single function)
 Models that predict the future data instances.
 Models that summarize the data
 Models the extract the most prominent features of the data.
Md. Manowarul Islam, Dept. Of CSE, JnU
Why do we need data mining?
 Really huge amounts of complex data generated from
multiple sources and interconnected in different ways
 Scientific data from different disciplines
 Huge text collections
 Transaction data
 Behavioral data
 Networked data
 All these types of data can be combined in many ways
 We need to analyze this data to extract knowledge
 Knowledge can be used for commercial or scientific
purposes.
 Our solutions should scale to the size of the data
Md. Manowarul Islam, Dept. Of CSE, JnU
The data analysis pipeline
 Mining is not the only step in the analysis process
 Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning
is required to make sense of the data
 Techniques: Sampling, Dimensionality Reduction, Feature selection.
 A dirty work, but it is often the most important step for the analysis.
 Post-Processing: Make the data actionable and useful to the user
 Statistical analysis of importance
 Visualization.
 Pre- and Post-processing are often data mining tasks as well
Data
Preprocessing
Data
Mining
Result
Post-processing
Md. Manowarul Islam, Dept. Of CSE, JnU
Data Quality
 Examples of data quality problems:
 Noise and outliers
 Missing values
 Duplicate data
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 10000K Yes
6 No NULL 60K No
7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
9 No Single 90K No
10
A mistake or a millionaire?
Missing values
Inconsistent duplicate entries
Md. Manowarul Islam, Dept. Of CSE, JnU
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing
only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Md. Manowarul Islam, Dept. Of CSE, JnU
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
 Duplicate records also need data cleaning
Md. Manowarul Islam, Dept. Of CSE, JnU
Why Is Data Preprocessing Important?
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
Md. Manowarul Islam, Dept. Of CSE, JnU
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data transformation
 Normalization and aggregation
 Data reduction
 Obtains reduced representation in volume but produces the same
or similar analytical results
 Data discretization
 Part of data reduction but with particular importance, especially
for numerical data
Md. Manowarul Islam, Dept. Of CSE, JnU
Forms of Data Preprocessing
Md. Manowarul Islam, Dept. Of CSE, JnU
Chapter 2: Data Preprocessing
 Why preprocess the data?
 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary
Md. Manowarul Islam, Dept. Of CSE, JnU
Mining Data Descriptive Characteristics
 Motivation
 To better understand the data: central tendency, variation
and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
Md. Manowarul Islam, Dept. Of CSE, JnU
Central Tendency
 A measure of central tendency is a value at
the center or middle of a data set.
 Mean, median, mode
Md. Manowarul Islam, Dept. Of CSE, JnU
Terminology
 Population
 A collection of items of interest in research
 A complete set of things
 A group that you wish to generalize your research to
 An example – All the trees in Battle Park
 Sample
 A subset of a population
 The size smaller than the size of a population
 An example – 100 trees randomly selected from Battle Park
Md. Manowarul Islam, Dept. Of CSE, JnU
Sample vs. Population
Population Sample
Md. Manowarul Islam, Dept. Of CSE, JnU
Measures of Central Tendency – Mean
 Mean – Most commonly used measure of central tendency
 Average of all observations
 The sum of all the scores divided by the number of scores
 Note: Assuming that each observation is equally significant
Md. Manowarul Islam, Dept. Of CSE, JnU
n
x
x
n
i
i


 1
Sample mean: Population mean:
N
x
N
i
i


 1

Measures of Central Tendency – Mean
Md. Manowarul Islam, Dept. Of CSE, JnU
 Example I
- Data: 8, 4, 2, 6, 10
6
5
)
10
6
2
4
8
(
5
5
1









i
i
x
x
 Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
38
.
14
10
)
5
.
24
2
.
10
8
.
9
(
10
10
1











i
i
x
x
Measures of Central Tendency – Mean
Md. Manowarul Islam, Dept. Of CSE, JnU
Weighted Mean
 We can also calculate a weighted mean using some
weighting factor:
e.g. What is the average income of all
people in cities A, B, and C:
City Avg. Income Population
A $23,000 100,000
B $20,000 50,000
C $25,000 150,000
Here, population is the weighting factor and the average income is the
variable of interest




 n
i
i
n
i
i
i
w
x
w
x
1
1
Md. Manowarul Islam, Dept. Of CSE, JnU
 Median – This is the value of a variable such that half of
the observations are above and half are below this value
i.e. this value divides the distribution into two groups of
equal size
 When the number of observations is odd, the median is
simply equal to the middle value
 When the number of observations is even, we take the
median to be the average of the two values in the middle
of the distribution
Measures of Central Tendency – Median
Md. Manowarul Islam, Dept. Of CSE, JnU
Measures of Central Tendency – Median
˜
x { xn /21, n odd
1
2
(xn /21  xn /2), n even
Md. Manowarul Islam, Dept. Of CSE, JnU
 Example I
 Data: 8, 4, 2, 6, 10 (mean: 6)
• Example II
– Sample: 10 trees randomly selected from Battle Park
– Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
(mean: 14.38)
2, 4, 6, 8, 10 median: 6
7.8, 9.8, 10.1, 10.2, 13.9, 14.5, 15.5, 17.5, 20.0, 24.5
median: (13.9 + 14.5) / 2 = 14.2
Measures of Central Tendency – Median
Md. Manowarul Islam, Dept. Of CSE, JnU
 For calculation of median in a continuous
frequency distribution the following formula will be
employed. Algebraically,
Measures of Central Tendency – Median
L1=lower bound of median class
N = total population
f= frequency of median class
cf= Cumulative frequencies of previous class
i = class interval
Md. Manowarul Islam, Dept. Of CSE, JnU
Age Group Frequency of
Median class(f)
Cumulative
frequencies(cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150
Total 150
Measures of Central Tendency – Median
Md. Manowarul Islam, Dept. Of CSE, JnU
Age
Group
Frequency
of Median
class(f)
Cumulative
frequencies
(cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150
Total 150
Measures of Central Tendency – Median
Find the value of N/2
Find cf for median class
L1=lower bound of median class
N = total population
f= frequency of median class
cf= Cumulative frequencies of previous class
i = class interval
Md. Manowarul Islam, Dept. Of CSE, JnU
 Mode – Mode is the most frequent value or score in the
distribution.
 It is defined as that value of the item in a series
 Example I
80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110
111 115 119 120 127 128 131 131 140 162
Measures of Central Tendency – Mode
mode!!
Md. Manowarul Islam, Dept. Of CSE, JnU
 The exact value of mode can be obtained by the following
formula.
Measures of Central Tendency – Mode
L1=lower bound of modal class
f1= frequency of modal class
f0= frequency of previous class
f2= frequency of next class
i = class interval
Md. Manowarul Islam, Dept. Of CSE, JnU
Monthly rent (Rs) Number of Libraries (f)
500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 & Above 12
Total 65
Measures of Central Tendency – Mode
Md. Manowarul Islam, Dept. Of CSE, JnU
Monthly
rent (Rs)
Number of
Libraries (f)
500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 & Above 12
Total 65
Measures of Central Tendency – Mode
L1=lower bound of modal class
f1= frequency of modal class
f0= frequency of previous class
f2= frequency of next class
i = class interval
Md. Manowarul Islam, Dept. Of CSE, JnU
 Value that occurs most frequently in the data
 Empirical formula:
)
(
3 median
mean
mode
mean 



Measures of Central Tendency – Mode
Md. Manowarul Islam, Dept. Of CSE, JnU
Symmetric vs. Skewed Data
 Median, mean and mode of
symmetric, positively and
negatively skewed data
Md. Manowarul Islam, Dept. Of CSE, JnU
Data Skewed Right
• Here we see that the data is skewed to the right
and the position of the Mean is to the right of the
Median.
– One may surmise that there is data that is tending to
spread the data out at the high end, thereby affecting
the value of the mean.
Md. Manowarul Islam, Dept. Of CSE, JnU
 Here we see that the data is skewed to the left and
the position of the Mean is to the left of the
Median.
 One may surmise that there is data that is tending to
spread the data out at the low end, thereby affecting the
value of the mean.
Data Skewed left
Md. Manowarul Islam, Dept. Of CSE, JnU
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, M, Q3, max
 Boxplot: ends of the box are the quartiles, median is marked, whiskers, and
plot outlier individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
 Standard deviation s (or σ) is the square root of variance s2 (or σ2)
 
  







n
i
n
i
i
i
n
i
i x
n
x
n
x
x
n
s
1 1
2
2
1
2
2
]
)
(
1
[
1
1
)
(
1
1

 





n
i
i
n
i
i x
N
x
N 1
2
2
1
2
2 1
)
(
1



Md. Manowarul Islam, Dept. Of CSE, JnU
Summary Measures
Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Coefficient of Variation
Range
Interquartile Range
Geometric Mean
Skewness
Central Tendency Variation Shape
Quartiles
Md. Manowarul Islam, Dept. Of CSE, JnU
Quartiles
 Quartiles split the ranked data into 4 segments
with an equal number of values per segment
25
%
25
%
25
%
25
%
 The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
 Q2 is the same as the median (50% are smaller, 50%
are larger)
 Only 25% of the observations are greater than the
third quartile
Q1 Q2 Q3
Md. Manowarul Islam, Dept. Of CSE, JnU
Quartiles
 Find a quartile by determining the value in the
appropriate position in the ranked data, where
 First quartile position : Q1 at (n+1)/4
 Second quartile position : Q2 at (n+1)/2 (median)
 Third quartile position : Q3 at 3(n+1)/4
where n is the number of observed values
Md. Manowarul Islam, Dept. Of CSE, JnU
Interquartile Range
 Can eliminate some outlier problems by using the
interquartile range
 Eliminate some high- and low-valued observations and
calculate the range from the remaining values
 Interquartile range = 3rd quartile – 1st quartile
= Q3 – Q1
Md. Manowarul Islam, Dept. Of CSE, JnU
12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Order the data
Inter-Quartile Range = 9 - 5½ = 3½
Example 1: Find the median and quartiles for the data below.
Lower
Quartile
= 5½
Q1
Upper
Quartile
= 9
Q3
Median
= 8
Q2
Quartiles
Md. Manowarul Islam, Dept. Of CSE, JnU
Upper
Quartile
= 10
Q3
Lower
Quartile
= 4
Q1
Median
= 8
Q2
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,
6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10
Order the data
Inter-Quartile Range = 10 - 4 = 6
Example 2: Find the median and quartiles for the data below.
Quartiles
Md. Manowarul Islam, Dept. Of CSE, JnU
 Simplest measure of variation
 Difference between the largest and the smallest
observations:
 Disadvantages = ignores distribution of data and
sensitive to outliers
Range
Range = Xlargest – Xsmallest
Md. Manowarul Islam, Dept. Of CSE, JnU
Boxplot Analysis
 Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
 The median is marked by a line within the box
 Whiskers: two lines outside the box extend to
Minimum and Maximum
Md. Manowarul Islam, Dept. Of CSE, JnU
Lower
Quartile
= 5½
Q1
Upper
Quartile
= 9
Q3
Median
= 8
Q2
4 5 6 7 8 9 10 11 12
4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12
Example 1: Draw a Box plot for the data below
Drawing a Box Plot
Md. Manowarul Islam, Dept. Of CSE, JnU
Upper
Quartile
= 10
Q3
Lower
Quartile
= 4
Q1
Median
= 8
Q2
3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15,
Example 2: Draw a Box plot for the data below
3 4 5 6 7 8 9 10 11 12 13 14 15
Drawing a Box Plot
Md. Manowarul Islam, Dept. Of CSE, JnU
Upper
Quartile
= 180
Qu
Lower
Quartile
= 158
QL
Median
= 171
Q2
Question: Stuart recorded the heights in cm of boys in his
class as shown below. Draw a box plot for this data.
137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186
130 140 150 160 170 180 190
cm
Drawing a Box Plot
Md. Manowarul Islam, Dept. Of CSE, JnU
1. The boys are taller on average.
Question: Gemma recorded the heights in cm of girls in the same class and
constructed a box plot from the data. The box plots for both boys and girls
are shown below. Use the box plots to choose some correct statements
comparing heights of boys and girls in the class. Justify your answers.
130 140 150 160 170 180 190
Boys
Girls
cm
2. The smallest person is a girl.
3. The tallest person is a boy.
Drawing a Box Plot
Md. Manowarul Islam, Dept. Of CSE, JnU
 outliers – Sometimes there are extreme values that are
separated from the rest of the data. These extreme
values are called outliers. Outliers affect the mean.
 The 1.5  IQR Rule for Outliers
 Call an observation an outlier if it falls more than 1.5 
IQR above the third quartile or below the first quartile.
 X < Q1 – 1.5  IQR
 X > Q3+ 1.5  IQR
outliers
Md. Manowarul Islam, Dept. Of CSE, JnU
 In the New York travel time data, we found Q1 = 15
minutes, Q3 = 42.5 minutes, and IQR = 27.5 minutes.
 For these data, 1.5  IQR = 1.5(27.5) = 41.25
 Q1 – 1.5  IQR = 15 – 41.25 = –26.25 (near 0)
 Q3+ 1.5  IQR = 42.5 + 41.25 = 83.75 (~80)
 Any travel time close to 0 minutes or longer than about
80 minutes is considered an outlier.
outliers
Md. Manowarul Islam, Dept. Of CSE, JnU
 Consider our NY travel times data. Construct a boxplot.
M = 22.5 Q3= 42.5
Q1 = 15
Min=5
10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85
Max=85
This is an
outlier by the
1.5 x IQR rule
Boxplots and outliers
Md. Manowarul Islam, Dept. Of CSE, JnU
Thank you

More Related Content

Similar to Lecture_03_04_preprocessing_aadaasdasdas.ppt

Sampling and statistical inference
Sampling and statistical inferenceSampling and statistical inference
Sampling and statistical inferenceBhavik A Shah
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxMuhammadNafees42
 
MLPA for health care presentation smc
MLPA for health care presentation   smcMLPA for health care presentation   smc
MLPA for health care presentation smcShaun Comfort
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxdarwinming1
 
Ppt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inferencePpt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inferencevasu Chemistry
 
Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Mazhar Poohlah
 
Computing Descriptive Statistics © 2014 Argos.docx
 Computing Descriptive Statistics     © 2014 Argos.docx Computing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docxaryan532920
 
Computing Descriptive Statistics © 2014 Argos.docx
Computing Descriptive Statistics     © 2014 Argos.docxComputing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docxAASTHA76
 
probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdfhabtamu292245
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxswapnaraghav
 
Statistics pres 10 27 2015 roy sabo
Statistics pres 10 27 2015   roy saboStatistics pres 10 27 2015   roy sabo
Statistics pres 10 27 2015 roy sabotjcarter
 
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth AlgorithmMining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth AlgorithmWaqas Tariq
 
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth AlgorithmMining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth AlgorithmWaqas Tariq
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...Damian R. Mingle, MBA
 

Similar to Lecture_03_04_preprocessing_aadaasdasdas.ppt (20)

Unit 1 Introduction
Unit 1 IntroductionUnit 1 Introduction
Unit 1 Introduction
 
Sampling and statistical inference
Sampling and statistical inferenceSampling and statistical inference
Sampling and statistical inference
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
 
MLPA for health care presentation smc
MLPA for health care presentation   smcMLPA for health care presentation   smc
MLPA for health care presentation smc
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docx
 
Ppt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inferencePpt for 1.1 introduction to statistical inference
Ppt for 1.1 introduction to statistical inference
 
Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14
 
Presentation of BRM.pptx
Presentation of BRM.pptxPresentation of BRM.pptx
Presentation of BRM.pptx
 
Computing Descriptive Statistics © 2014 Argos.docx
 Computing Descriptive Statistics     © 2014 Argos.docx Computing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docx
 
Computing Descriptive Statistics © 2014 Argos.docx
Computing Descriptive Statistics     © 2014 Argos.docxComputing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docx
 
probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdf
 
Intro to Biostat. ppt
Intro to Biostat. pptIntro to Biostat. ppt
Intro to Biostat. ppt
 
Data science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptxData science notes for ASDS calicut 2.pptx
Data science notes for ASDS calicut 2.pptx
 
Statistics pres 10 27 2015 roy sabo
Statistics pres 10 27 2015   roy saboStatistics pres 10 27 2015   roy sabo
Statistics pres 10 27 2015 roy sabo
 
Data Analysis
Data Analysis Data Analysis
Data Analysis
 
Lab 1 intro
Lab 1 introLab 1 intro
Lab 1 intro
 
Statistics
StatisticsStatistics
Statistics
 
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth AlgorithmMining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
 
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth AlgorithmMining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
Mining of Prevalent Ailments in a Health Database Using Fp-Growth Algorithm
 
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
A discriminative-feature-space-for-detecting-and-recognizing-pathologies-of-t...
 

Recently uploaded

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 

Recently uploaded (20)

chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 

Lecture_03_04_preprocessing_aadaasdasdas.ppt

  • 1. Data Mining and Data Warehousing CSE-4107 Md. Manowarul Islam Assistant Professor, Dept. of CSE Jagannath University
  • 2. Md. Manowarul Islam, Dept. Of CSE, JnU Chapter 2: Data Preprocessing  Why preprocess the data?  Descriptive data summarization  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation  Summary
  • 3. Md. Manowarul Islam, Dept. Of CSE, JnU What is Data Mining?  Data mining is the use of efficient techniques for the analysis of very large collections of data and the extraction of useful and possibly unexpected patterns in data.  “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst” (Hand, Mannila, Smyth)  “Data mining is the discovery of models for data” (Rajaraman, Ullman)  We can have the following types of models  Models that explain the data (e.g., a single function)  Models that predict the future data instances.  Models that summarize the data  Models the extract the most prominent features of the data.
  • 4. Md. Manowarul Islam, Dept. Of CSE, JnU Why do we need data mining?  Really huge amounts of complex data generated from multiple sources and interconnected in different ways  Scientific data from different disciplines  Huge text collections  Transaction data  Behavioral data  Networked data  All these types of data can be combined in many ways  We need to analyze this data to extract knowledge  Knowledge can be used for commercial or scientific purposes.  Our solutions should scale to the size of the data
  • 5. Md. Manowarul Islam, Dept. Of CSE, JnU The data analysis pipeline  Mining is not the only step in the analysis process  Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning is required to make sense of the data  Techniques: Sampling, Dimensionality Reduction, Feature selection.  A dirty work, but it is often the most important step for the analysis.  Post-Processing: Make the data actionable and useful to the user  Statistical analysis of importance  Visualization.  Pre- and Post-processing are often data mining tasks as well Data Preprocessing Data Mining Result Post-processing
  • 6. Md. Manowarul Islam, Dept. Of CSE, JnU Data Quality  Examples of data quality problems:  Noise and outliers  Missing values  Duplicate data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 10000K Yes 6 No NULL 60K No 7 Yes Divorced 220K NULL 8 No Single 85K Yes 9 No Married 90K No 9 No Single 90K No 10 A mistake or a millionaire? Missing values Inconsistent duplicate entries
  • 7. Md. Manowarul Islam, Dept. Of CSE, JnU Why Data Preprocessing?  Data in the real world is dirty  incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., occupation=“ ”  noisy: containing errors or outliers  e.g., Salary=“-10”  inconsistent: containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records
  • 8. Md. Manowarul Islam, Dept. Of CSE, JnU Why Is Data Dirty?  Incomplete data may come from  “Not applicable” data value when collected  Different considerations between the time when the data was collected and when it is analyzed.  Human/hardware/software problems  Noisy data (incorrect values) may come from  Faulty data collection instruments  Human or computer error at data entry  Errors in data transmission  Inconsistent data may come from  Different data sources  Functional dependency violation (e.g., modify some linked data)  Duplicate records also need data cleaning
  • 9. Md. Manowarul Islam, Dept. Of CSE, JnU Why Is Data Preprocessing Important?  No quality data, no quality mining results!  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics.  Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
  • 10. Md. Manowarul Islam, Dept. Of CSE, JnU Major Tasks in Data Preprocessing  Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data integration  Integration of multiple databases, data cubes, or files  Data transformation  Normalization and aggregation  Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results  Data discretization  Part of data reduction but with particular importance, especially for numerical data
  • 11. Md. Manowarul Islam, Dept. Of CSE, JnU Forms of Data Preprocessing
  • 12. Md. Manowarul Islam, Dept. Of CSE, JnU Chapter 2: Data Preprocessing  Why preprocess the data?  Descriptive data summarization  Data cleaning  Data integration and transformation  Data reduction  Discretization and concept hierarchy generation  Summary
  • 13. Md. Manowarul Islam, Dept. Of CSE, JnU Mining Data Descriptive Characteristics  Motivation  To better understand the data: central tendency, variation and spread  Data dispersion characteristics  median, max, min, quantiles, outliers, variance, etc.  Numerical dimensions correspond to sorted intervals  Data dispersion: analyzed with multiple granularities of precision  Boxplot or quantile analysis on sorted intervals  Dispersion analysis on computed measures  Folding measures into numerical dimensions  Boxplot or quantile analysis on the transformed cube
  • 14. Md. Manowarul Islam, Dept. Of CSE, JnU Central Tendency  A measure of central tendency is a value at the center or middle of a data set.  Mean, median, mode
  • 15. Md. Manowarul Islam, Dept. Of CSE, JnU Terminology  Population  A collection of items of interest in research  A complete set of things  A group that you wish to generalize your research to  An example – All the trees in Battle Park  Sample  A subset of a population  The size smaller than the size of a population  An example – 100 trees randomly selected from Battle Park
  • 16. Md. Manowarul Islam, Dept. Of CSE, JnU Sample vs. Population Population Sample
  • 17. Md. Manowarul Islam, Dept. Of CSE, JnU Measures of Central Tendency – Mean  Mean – Most commonly used measure of central tendency  Average of all observations  The sum of all the scores divided by the number of scores  Note: Assuming that each observation is equally significant
  • 18. Md. Manowarul Islam, Dept. Of CSE, JnU n x x n i i    1 Sample mean: Population mean: N x N i i    1  Measures of Central Tendency – Mean
  • 19. Md. Manowarul Islam, Dept. Of CSE, JnU  Example I - Data: 8, 4, 2, 6, 10 6 5 ) 10 6 2 4 8 ( 5 5 1          i i x x  Example II – Sample: 10 trees randomly selected from Battle Park – Diameter (inches): 9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5 38 . 14 10 ) 5 . 24 2 . 10 8 . 9 ( 10 10 1            i i x x Measures of Central Tendency – Mean
  • 20. Md. Manowarul Islam, Dept. Of CSE, JnU Weighted Mean  We can also calculate a weighted mean using some weighting factor: e.g. What is the average income of all people in cities A, B, and C: City Avg. Income Population A $23,000 100,000 B $20,000 50,000 C $25,000 150,000 Here, population is the weighting factor and the average income is the variable of interest      n i i n i i i w x w x 1 1
  • 21. Md. Manowarul Islam, Dept. Of CSE, JnU  Median – This is the value of a variable such that half of the observations are above and half are below this value i.e. this value divides the distribution into two groups of equal size  When the number of observations is odd, the median is simply equal to the middle value  When the number of observations is even, we take the median to be the average of the two values in the middle of the distribution Measures of Central Tendency – Median
  • 22. Md. Manowarul Islam, Dept. Of CSE, JnU Measures of Central Tendency – Median ˜ x { xn /21, n odd 1 2 (xn /21  xn /2), n even
  • 23. Md. Manowarul Islam, Dept. Of CSE, JnU  Example I  Data: 8, 4, 2, 6, 10 (mean: 6) • Example II – Sample: 10 trees randomly selected from Battle Park – Diameter (inches): 9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5 (mean: 14.38) 2, 4, 6, 8, 10 median: 6 7.8, 9.8, 10.1, 10.2, 13.9, 14.5, 15.5, 17.5, 20.0, 24.5 median: (13.9 + 14.5) / 2 = 14.2 Measures of Central Tendency – Median
  • 24. Md. Manowarul Islam, Dept. Of CSE, JnU  For calculation of median in a continuous frequency distribution the following formula will be employed. Algebraically, Measures of Central Tendency – Median L1=lower bound of median class N = total population f= frequency of median class cf= Cumulative frequencies of previous class i = class interval
  • 25. Md. Manowarul Islam, Dept. Of CSE, JnU Age Group Frequency of Median class(f) Cumulative frequencies(cf) 0-20 15 15 20-40 32 47 40-60 54 101 60-80 30 131 80-100 19 150 Total 150 Measures of Central Tendency – Median
  • 26. Md. Manowarul Islam, Dept. Of CSE, JnU Age Group Frequency of Median class(f) Cumulative frequencies (cf) 0-20 15 15 20-40 32 47 40-60 54 101 60-80 30 131 80-100 19 150 Total 150 Measures of Central Tendency – Median Find the value of N/2 Find cf for median class L1=lower bound of median class N = total population f= frequency of median class cf= Cumulative frequencies of previous class i = class interval
  • 27. Md. Manowarul Islam, Dept. Of CSE, JnU  Mode – Mode is the most frequent value or score in the distribution.  It is defined as that value of the item in a series  Example I 80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162 Measures of Central Tendency – Mode mode!!
  • 28. Md. Manowarul Islam, Dept. Of CSE, JnU  The exact value of mode can be obtained by the following formula. Measures of Central Tendency – Mode L1=lower bound of modal class f1= frequency of modal class f0= frequency of previous class f2= frequency of next class i = class interval
  • 29. Md. Manowarul Islam, Dept. Of CSE, JnU Monthly rent (Rs) Number of Libraries (f) 500-1000 5 1000-1500 10 1500-2000 8 2000-2500 16 2500-3000 14 3000 & Above 12 Total 65 Measures of Central Tendency – Mode
  • 30. Md. Manowarul Islam, Dept. Of CSE, JnU Monthly rent (Rs) Number of Libraries (f) 500-1000 5 1000-1500 10 1500-2000 8 2000-2500 16 2500-3000 14 3000 & Above 12 Total 65 Measures of Central Tendency – Mode L1=lower bound of modal class f1= frequency of modal class f0= frequency of previous class f2= frequency of next class i = class interval
  • 31. Md. Manowarul Islam, Dept. Of CSE, JnU  Value that occurs most frequently in the data  Empirical formula: ) ( 3 median mean mode mean     Measures of Central Tendency – Mode
  • 32. Md. Manowarul Islam, Dept. Of CSE, JnU Symmetric vs. Skewed Data  Median, mean and mode of symmetric, positively and negatively skewed data
  • 33. Md. Manowarul Islam, Dept. Of CSE, JnU Data Skewed Right • Here we see that the data is skewed to the right and the position of the Mean is to the right of the Median. – One may surmise that there is data that is tending to spread the data out at the high end, thereby affecting the value of the mean.
  • 34. Md. Manowarul Islam, Dept. Of CSE, JnU  Here we see that the data is skewed to the left and the position of the Mean is to the left of the Median.  One may surmise that there is data that is tending to spread the data out at the low end, thereby affecting the value of the mean. Data Skewed left
  • 35. Md. Manowarul Islam, Dept. Of CSE, JnU Measuring the Dispersion of Data  Quartiles, outliers and boxplots  Quartiles: Q1 (25th percentile), Q3 (75th percentile)  Inter-quartile range: IQR = Q3 – Q1  Five number summary: min, Q1, M, Q3, max  Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually  Outlier: usually, a value higher/lower than 1.5 x IQR  Variance and standard deviation (sample: s, population: σ)  Variance: (algebraic, scalable computation)  Standard deviation s (or σ) is the square root of variance s2 (or σ2)             n i n i i i n i i x n x n x x n s 1 1 2 2 1 2 2 ] ) ( 1 [ 1 1 ) ( 1 1         n i i n i i x N x N 1 2 2 1 2 2 1 ) ( 1   
  • 36. Md. Manowarul Islam, Dept. Of CSE, JnU Summary Measures Arithmetic Mean Median Mode Describing Data Numerically Variance Standard Deviation Coefficient of Variation Range Interquartile Range Geometric Mean Skewness Central Tendency Variation Shape Quartiles
  • 37. Md. Manowarul Islam, Dept. Of CSE, JnU Quartiles  Quartiles split the ranked data into 4 segments with an equal number of values per segment 25 % 25 % 25 % 25 %  The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger  Q2 is the same as the median (50% are smaller, 50% are larger)  Only 25% of the observations are greater than the third quartile Q1 Q2 Q3
  • 38. Md. Manowarul Islam, Dept. Of CSE, JnU Quartiles  Find a quartile by determining the value in the appropriate position in the ranked data, where  First quartile position : Q1 at (n+1)/4  Second quartile position : Q2 at (n+1)/2 (median)  Third quartile position : Q3 at 3(n+1)/4 where n is the number of observed values
  • 39. Md. Manowarul Islam, Dept. Of CSE, JnU Interquartile Range  Can eliminate some outlier problems by using the interquartile range  Eliminate some high- and low-valued observations and calculate the range from the remaining values  Interquartile range = 3rd quartile – 1st quartile = Q3 – Q1
  • 40. Md. Manowarul Islam, Dept. Of CSE, JnU 12, 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 10 4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12 Order the data Inter-Quartile Range = 9 - 5½ = 3½ Example 1: Find the median and quartiles for the data below. Lower Quartile = 5½ Q1 Upper Quartile = 9 Q3 Median = 8 Q2 Quartiles
  • 41. Md. Manowarul Islam, Dept. Of CSE, JnU Upper Quartile = 10 Q3 Lower Quartile = 4 Q1 Median = 8 Q2 3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15, 6, 3, 9, 8, 4, 10, 8, 4, 15, 8, 10 Order the data Inter-Quartile Range = 10 - 4 = 6 Example 2: Find the median and quartiles for the data below. Quartiles
  • 42. Md. Manowarul Islam, Dept. Of CSE, JnU  Simplest measure of variation  Difference between the largest and the smallest observations:  Disadvantages = ignores distribution of data and sensitive to outliers Range Range = Xlargest – Xsmallest
  • 43. Md. Manowarul Islam, Dept. Of CSE, JnU Boxplot Analysis  Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum  Boxplot  Data is represented with a box  The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ  The median is marked by a line within the box  Whiskers: two lines outside the box extend to Minimum and Maximum
  • 44. Md. Manowarul Islam, Dept. Of CSE, JnU Lower Quartile = 5½ Q1 Upper Quartile = 9 Q3 Median = 8 Q2 4 5 6 7 8 9 10 11 12 4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 10, 12 Example 1: Draw a Box plot for the data below Drawing a Box Plot
  • 45. Md. Manowarul Islam, Dept. Of CSE, JnU Upper Quartile = 10 Q3 Lower Quartile = 4 Q1 Median = 8 Q2 3, 4, 4, 6, 8, 8, 8, 9, 10, 10, 15, Example 2: Draw a Box plot for the data below 3 4 5 6 7 8 9 10 11 12 13 14 15 Drawing a Box Plot
  • 46. Md. Manowarul Islam, Dept. Of CSE, JnU Upper Quartile = 180 Qu Lower Quartile = 158 QL Median = 171 Q2 Question: Stuart recorded the heights in cm of boys in his class as shown below. Draw a box plot for this data. 137, 148, 155, 158, 165, 166, 166, 171, 171, 173, 175, 180, 184, 186, 186 130 140 150 160 170 180 190 cm Drawing a Box Plot
  • 47. Md. Manowarul Islam, Dept. Of CSE, JnU 1. The boys are taller on average. Question: Gemma recorded the heights in cm of girls in the same class and constructed a box plot from the data. The box plots for both boys and girls are shown below. Use the box plots to choose some correct statements comparing heights of boys and girls in the class. Justify your answers. 130 140 150 160 170 180 190 Boys Girls cm 2. The smallest person is a girl. 3. The tallest person is a boy. Drawing a Box Plot
  • 48. Md. Manowarul Islam, Dept. Of CSE, JnU  outliers – Sometimes there are extreme values that are separated from the rest of the data. These extreme values are called outliers. Outliers affect the mean.  The 1.5  IQR Rule for Outliers  Call an observation an outlier if it falls more than 1.5  IQR above the third quartile or below the first quartile.  X < Q1 – 1.5  IQR  X > Q3+ 1.5  IQR outliers
  • 49. Md. Manowarul Islam, Dept. Of CSE, JnU  In the New York travel time data, we found Q1 = 15 minutes, Q3 = 42.5 minutes, and IQR = 27.5 minutes.  For these data, 1.5  IQR = 1.5(27.5) = 41.25  Q1 – 1.5  IQR = 15 – 41.25 = –26.25 (near 0)  Q3+ 1.5  IQR = 42.5 + 41.25 = 83.75 (~80)  Any travel time close to 0 minutes or longer than about 80 minutes is considered an outlier. outliers
  • 50. Md. Manowarul Islam, Dept. Of CSE, JnU  Consider our NY travel times data. Construct a boxplot. M = 22.5 Q3= 42.5 Q1 = 15 Min=5 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 Max=85 This is an outlier by the 1.5 x IQR rule Boxplots and outliers
  • 51. Md. Manowarul Islam, Dept. Of CSE, JnU Thank you