SlideShare a Scribd company logo
1
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., occupation=― ‖
noisy: containing errors or outliers
e.g., Salary=―-10‖
inconsistent: containing discrepancies in codes or names
e.g., Age=―42‖ Birthday=―03/07/1997‖
e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖
e.g., discrepancy between duplicate records

2
 Data Cleaning / Data Cleansing
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies

 Data Integration
Integration of multiple databases, data cubes, or files
 Data Transformation
Normalization and aggregation
 Data Reduction
Obtains reduced representation in volume but produces
the same or similar analytical results

3
DESCRIPTIVE DATA
SUMMARIZATION

Measuring the Central Tendency

Distributive
Measure

Algebraic
Measure

Holistic
Measure

Measuring the Dispersion of Data

Range, Quartiles,
Outliers and
Boxplots

Variance
and
Standard
Deviation

Graphic
Displays

4
The variance of N observations, x1, x2, ....xN is

1
2 =
N

N

( xi x)
i1

2

1
N

2
i

x

1
( xi )2
N

5
Where

x

is the mean value of observations

The Standard deviation is 
The  of the observations is the square root of the

variance,  2

6
The basic properties of the standard deviation ,
  measures spread about the mean and should be used only
when the mean is chosen as the measure of center
  = 0 only when there is no spread, that is, when all
observations have the same value. Otherwise  > 0

7


Apart from the bar charts, pie charts and line graphs, there

are other popular types of graphs for the display of data
summaries and distributions.

Histograms

Quantile plots
q-q plots
Scatter plots
Curves
8
 Plotting histograms, or frequency histograms, is a graphical method
for summarizing the distribution of a given attribute.
 A histogram for an attribute A partitions into subsets or buckets.
 The width of each bucket is uniform
 Each bucket is represented by a rectangle
 The resulting graph is referred to as Bar chart

9
14
12
10
8
6
4
2
0

Series 3
Series 2

Series 1

10
 It is a simple and effective way for a univariate data distribution
 First, it displays all of the data for the given attribute
 Second, it plots quantile information
 The mechanism is slightly differ from percentile computation
 fi = i – 0.5 / N

11
12
 A quantile-quantile plot, or q-q plot graphs the quantiles of one

variate distribution against the corresponding quantiles of another

 It is a powerful visualization tool that allows the user to view
whether there is a shift in going from one distribution to another.

13
6

5
4
3

Series 1

2

Series 2

1

Series 3

0
Category Category Category Category
1

2

3

4

14
 A Scatter plot used for determining if there appears to be a

relationship, pattern, or trend between two numerical attributes
 To construct a scatter plot, each pair of values is treated as a pair of
coordinates in an algebraic sense and plotted as points in the plane
 Bivariate data to see clusters of points and outliers or correlation
relationships

15
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

16
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

17
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

18
 It is another graphic aid that adds a smooth curve to a scatter plot in order

to provide better perception of the pattern of dependence
 The word loess is short for ―local regression‖

 There are two values are needed that is smoothing parameter () and , the
degree of the polynomials that are fitted by regression

19
Y-Values
3.5
3
2.5
2
1.5

Y-Values

1
0.5

0
0

1

2

3

20
 Descriptive data summaries provide valuable insight into the
overall behavior of our data
 By helping to identify noise and outliers, they are especially useful
for data cleaning

21
 Data Cleaning or Data Cleansing routines attempt to fill in missing
values,

smooth

out

noise

while

identifying

outliers,

and

correct

inconsistencies in the data.
 Handling missing values

 Data Smoothing techniques
 Data Cleaning as a process

22
Many tuples have no recorded value for several attributes
Methods:

1. Ignore the tuple :
This is usually done when the class label is missing
It is not very effective
2. Fill in the missing value manually:

It is time-consuming and may not be feasible in large data set
3. Use a global constant to fill in the missing value:
Replace all missing attribute values by the same constant

23
4. Use the attribute mean to fill in the missing value:
Use the mean value to replace the missing value for particular
attribute
5. Use the attribute mean for all samples belonging to the same class as the
given tuple:
if classifying customers according to credit_risk, replace the missing
value with the average income value for customers
6. Use the most probable value to fill in the missing value:
This may be determined with regression, inference-based tools using
a Bayesian formalism, or decision tree induction.

24
 Methods 3 to 6 bias the data. The filled-in value may not be correct

 Method 6 is a popular strategy to predict missing values
 By considering the other values of the other attributes in its estimation
of the missing value
 In some cases, a missing value may not imply an error in the data!
Ex: Phone number in some application form  NULL

25
 What is Noise?

 Noise is a random error or variance in a measured variable.
 Data Smoothing techniques:
Binning
Regression
Clustering

26
 Binning

first sort data and partition into (equal-frequency) bins then
one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
 Regression
smooth by fitting the data into regression functions
 Clustering
detect and remove outliers

27
Combines data from multiple sources into a coherent store
 Schema integration:
Integrate metadata from different sources
 Entity identification problem:
Identify real world entities from multiple data sources,
e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different

28
Redundant data occur often when integration of multiple databases
Object identification: The same attribute or object may have

different names in different databases
Derivable data: One attribute may be a ―derived‖ attribute in
another table, e.g., annual revenue
Redundant attributes may be able to be detected by correlation analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality

29
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization

z-score normalization
normalization by decimal scaling

30


Min-max normalization: to [new_minA, new_maxA]

v'

v minA
(new _ maxA new _ minA) new _ minA
maxA minA

◦ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to 73,600 12,000 (1.0 0) 0 0.716
98,000 12 ,000





Z-score normalization (μ: mean, σ: standard deviation):
v
A
v'
73,600 54 ,000
A
1.225
◦ Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling

v
v'
10 j

16 ,000

Where j is the smallest integer such that Max(|ν’|) < 1

31
Why data reduction?
 A database/data warehouse may store terabytes of data
 Complex data analysis/mining may take a very long time to run on
the complete data set
Data reduction
Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
Data reduction strategies
o Data cube aggregation:
o Dimensionality reduction — e.g., remove unimportant attributes
o Data Compression
o Numerosity reduction — e.g., fit data into models
o Discretization and concept hierarchy generation

32
 Data preparation or preprocessing is a big issue for both data
warehousing and data mining
 Discriptive data summarization is need for quality data preprocessing

 Data preparation includes,

 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data preprocessing still an active
area of research

33
34

More Related Content

What's hot

Random forest
Random forestRandom forest
Random forest
Musa Hawamdah
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
AnjaliSoorej
 
Lda
LdaLda
Data preparation
Data preparationData preparation
Data preparation
Harry Potter
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
tdharmaputhiran
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
tdharmaputhiran
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Learnbay Datascience
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
Krish_ver2
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
R A Akerkar
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
Usha Vijay
 
Data reduction
Data reductionData reduction
Data reduction
GowriLatha1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 

What's hot (14)

Random forest
Random forestRandom forest
Random forest
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Reduction Stratergies
Data Reduction StratergiesData Reduction Stratergies
Data Reduction Stratergies
 
Lda
LdaLda
Lda
 
Data preparation
Data preparationData preparation
Data preparation
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
Data reduction
Data reductionData reduction
Data reduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Viewers also liked

Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.
Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.
Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.
Georgios Ath. Kounis
 
Trade & development report 2013overview en
Trade & development report 2013overview enTrade & development report 2013overview en
Trade & development report 2013overview en
UnitedPac Saint Lucia (Conservative Movement)
 
教案2
教案2教案2
教案2Amy Li
 
Powerpoint tgz akhir peng.pendidikan
Powerpoint tgz akhir peng.pendidikanPowerpoint tgz akhir peng.pendidikan
Powerpoint tgz akhir peng.pendidikan
nhiiyylhakirei
 
Lorain
LorainLorain
教學簡報
教學簡報教學簡報
教學簡報Amy Li
 
Uwp Constitution 1998 Edition
Uwp Constitution 1998 EditionUwp Constitution 1998 Edition
Uwp Constitution 1998 Edition
UnitedPac Saint Lucia (Conservative Movement)
 
Conc chm กสพท54
Conc chm กสพท54Conc chm กสพท54
Conc chm กสพท54noeiinoii
 
Plano para desenvolver projeto com blog
Plano para desenvolver  projeto com blogPlano para desenvolver  projeto com blog
Plano para desenvolver projeto com blog
Elizangela Soares
 
The Trinity Twist
The Trinity TwistThe Trinity Twist
The Trinity Twist
Gregory Walker
 
Torch presentation
Torch presentationTorch presentation
дефиле
дефиледефиле
дефиле
Chalenko
 
06 e
06 e06 e
06 e
noeiinoii
 
งานโครงงานคอม นำเสนอ
งานโครงงานคอม นำเสนอ งานโครงงานคอม นำเสนอ
งานโครงงานคอม นำเสนอ
noeiinoii
 
Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...
Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...
Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...
UnitedPac Saint Lucia (Conservative Movement)
 
Nuovi arrivi in biblioteca6
Nuovi arrivi in biblioteca6Nuovi arrivi in biblioteca6
Nuovi arrivi in biblioteca6Federica Pucci
 
the deluded dragon
the deluded dragon the deluded dragon
the deluded dragon
Vevi WulanSari
 
Freedom from IT: How to Give Power Back to Marketing and Merchandising Teams
Freedom from IT: How to Give Power Back to Marketing and Merchandising TeamsFreedom from IT: How to Give Power Back to Marketing and Merchandising Teams
Freedom from IT: How to Give Power Back to Marketing and Merchandising Teams
Mozu
 
Weekly mcx newsletter 23 dec 2013
Weekly mcx newsletter 23 dec 2013Weekly mcx newsletter 23 dec 2013
Weekly mcx newsletter 23 dec 2013
Rakhi Tips Provider
 
Hari kantin sekolah sek men keb tanjung dawai
Hari kantin sekolah sek men keb tanjung dawaiHari kantin sekolah sek men keb tanjung dawai
Hari kantin sekolah sek men keb tanjung dawai
Fasha Rashada
 

Viewers also liked (20)

Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.
Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.
Θερμικές Κάμερες MOBOTIX - Αρχές λειτουργίας & Εφαρμογές.
 
Trade & development report 2013overview en
Trade & development report 2013overview enTrade & development report 2013overview en
Trade & development report 2013overview en
 
教案2
教案2教案2
教案2
 
Powerpoint tgz akhir peng.pendidikan
Powerpoint tgz akhir peng.pendidikanPowerpoint tgz akhir peng.pendidikan
Powerpoint tgz akhir peng.pendidikan
 
Lorain
LorainLorain
Lorain
 
教學簡報
教學簡報教學簡報
教學簡報
 
Uwp Constitution 1998 Edition
Uwp Constitution 1998 EditionUwp Constitution 1998 Edition
Uwp Constitution 1998 Edition
 
Conc chm กสพท54
Conc chm กสพท54Conc chm กสพท54
Conc chm กสพท54
 
Plano para desenvolver projeto com blog
Plano para desenvolver  projeto com blogPlano para desenvolver  projeto com blog
Plano para desenvolver projeto com blog
 
The Trinity Twist
The Trinity TwistThe Trinity Twist
The Trinity Twist
 
Torch presentation
Torch presentationTorch presentation
Torch presentation
 
дефиле
дефиледефиле
дефиле
 
06 e
06 e06 e
06 e
 
งานโครงงานคอม นำเสนอ
งานโครงงานคอม นำเสนอ งานโครงงานคอม นำเสนอ
งานโครงงานคอม นำเสนอ
 
Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...
Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...
Grynberg pulls ahead in case against kenny anthony and the stlucia labour par...
 
Nuovi arrivi in biblioteca6
Nuovi arrivi in biblioteca6Nuovi arrivi in biblioteca6
Nuovi arrivi in biblioteca6
 
the deluded dragon
the deluded dragon the deluded dragon
the deluded dragon
 
Freedom from IT: How to Give Power Back to Marketing and Merchandising Teams
Freedom from IT: How to Give Power Back to Marketing and Merchandising TeamsFreedom from IT: How to Give Power Back to Marketing and Merchandising Teams
Freedom from IT: How to Give Power Back to Marketing and Merchandising Teams
 
Weekly mcx newsletter 23 dec 2013
Weekly mcx newsletter 23 dec 2013Weekly mcx newsletter 23 dec 2013
Weekly mcx newsletter 23 dec 2013
 
Hari kantin sekolah sek men keb tanjung dawai
Hari kantin sekolah sek men keb tanjung dawaiHari kantin sekolah sek men keb tanjung dawai
Hari kantin sekolah sek men keb tanjung dawai
 

Similar to Data mining

Pre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_techniquePre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_technique
Bhushan134837
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
Krish_ver2
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
DHIVYADEVAKI
 
Data processing
Data processingData processing
Data processing
Sania Shoaib
 
1234
12341234
Data preperation
Data preperationData preperation
Data preperation
Hoang Nguyen
 
Data preperation
Data preperationData preperation
Data preperation
Fraboni Ec
 
Data preperation
Data preperationData preperation
Data preperation
Luis Goldster
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
Data preparation
Data preparationData preparation
Data preparation
Young Alista
 
Data preparation
Data preparationData preparation
Data preparation
James Wong
 
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
Chandrika Sweety
 
Data Mining
Data MiningData Mining
Data Mining
Jay Nagar
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Tony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
James Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Young Alista
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Luis Goldster
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Fraboni Ec
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Hoang Nguyen
 

Similar to Data mining (20)

Pre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_techniquePre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_technique
 
1.6.data preprocessing
1.6.data preprocessing1.6.data preprocessing
1.6.data preprocessing
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data processing
Data processingData processing
Data processing
 
1234
12341234
1234
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
Data preperation
Data preperationData preperation
Data preperation
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Data preparation
Data preparationData preparation
Data preparation
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Data Mining
Data MiningData Mining
Data Mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 

Recently uploaded

Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
deepaannamalai16
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
haiqairshad
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
BoudhayanBhattachari
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
EduSkills OECD
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
giancarloi8888
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 

Recently uploaded (20)

Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
Standardized tool for Intelligence test.
Standardized tool for Intelligence test.Standardized tool for Intelligence test.
Standardized tool for Intelligence test.
 
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skillsspot a liar (Haiqa 146).pptx Technical writhing and presentation skills
spot a liar (Haiqa 146).pptx Technical writhing and presentation skills
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 

Data mining

  • 1. 1
  • 2. Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=― ‖ noisy: containing errors or outliers e.g., Salary=―-10‖ inconsistent: containing discrepancies in codes or names e.g., Age=―42‖ Birthday=―03/07/1997‖ e.g., Was rating ―1,2,3‖, now rating ―A, B, C‖ e.g., discrepancy between duplicate records 2
  • 3.  Data Cleaning / Data Cleansing Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies  Data Integration Integration of multiple databases, data cubes, or files  Data Transformation Normalization and aggregation  Data Reduction Obtains reduced representation in volume but produces the same or similar analytical results 3
  • 4. DESCRIPTIVE DATA SUMMARIZATION Measuring the Central Tendency Distributive Measure Algebraic Measure Holistic Measure Measuring the Dispersion of Data Range, Quartiles, Outliers and Boxplots Variance and Standard Deviation Graphic Displays 4
  • 5. The variance of N observations, x1, x2, ....xN is 1 2 = N N ( xi x) i1 2 1 N 2 i x 1 ( xi )2 N 5
  • 6. Where x is the mean value of observations The Standard deviation is  The  of the observations is the square root of the variance,  2 6
  • 7. The basic properties of the standard deviation ,   measures spread about the mean and should be used only when the mean is chosen as the measure of center   = 0 only when there is no spread, that is, when all observations have the same value. Otherwise  > 0 7
  • 8.  Apart from the bar charts, pie charts and line graphs, there are other popular types of graphs for the display of data summaries and distributions. Histograms Quantile plots q-q plots Scatter plots Curves 8
  • 9.  Plotting histograms, or frequency histograms, is a graphical method for summarizing the distribution of a given attribute.  A histogram for an attribute A partitions into subsets or buckets.  The width of each bucket is uniform  Each bucket is represented by a rectangle  The resulting graph is referred to as Bar chart 9
  • 11.  It is a simple and effective way for a univariate data distribution  First, it displays all of the data for the given attribute  Second, it plots quantile information  The mechanism is slightly differ from percentile computation  fi = i – 0.5 / N 11
  • 12. 12
  • 13.  A quantile-quantile plot, or q-q plot graphs the quantiles of one variate distribution against the corresponding quantiles of another  It is a powerful visualization tool that allows the user to view whether there is a shift in going from one distribution to another. 13
  • 14. 6 5 4 3 Series 1 2 Series 2 1 Series 3 0 Category Category Category Category 1 2 3 4 14
  • 15.  A Scatter plot used for determining if there appears to be a relationship, pattern, or trend between two numerical attributes  To construct a scatter plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane  Bivariate data to see clusters of points and outliers or correlation relationships 15
  • 19.  It is another graphic aid that adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence  The word loess is short for ―local regression‖  There are two values are needed that is smoothing parameter () and , the degree of the polynomials that are fitted by regression 19
  • 21.  Descriptive data summaries provide valuable insight into the overall behavior of our data  By helping to identify noise and outliers, they are especially useful for data cleaning 21
  • 22.  Data Cleaning or Data Cleansing routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.  Handling missing values  Data Smoothing techniques  Data Cleaning as a process 22
  • 23. Many tuples have no recorded value for several attributes Methods: 1. Ignore the tuple : This is usually done when the class label is missing It is not very effective 2. Fill in the missing value manually: It is time-consuming and may not be feasible in large data set 3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant 23
  • 24. 4. Use the attribute mean to fill in the missing value: Use the mean value to replace the missing value for particular attribute 5. Use the attribute mean for all samples belonging to the same class as the given tuple: if classifying customers according to credit_risk, replace the missing value with the average income value for customers 6. Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. 24
  • 25.  Methods 3 to 6 bias the data. The filled-in value may not be correct  Method 6 is a popular strategy to predict missing values  By considering the other values of the other attributes in its estimation of the missing value  In some cases, a missing value may not imply an error in the data! Ex: Phone number in some application form  NULL 25
  • 26.  What is Noise?  Noise is a random error or variance in a measured variable.  Data Smoothing techniques: Binning Regression Clustering 26
  • 27.  Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.  Regression smooth by fitting the data into regression functions  Clustering detect and remove outliers 27
  • 28. Combines data from multiple sources into a coherent store  Schema integration: Integrate metadata from different sources  Entity identification problem: Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton  Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different 28
  • 29. Redundant data occur often when integration of multiple databases Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a ―derived‖ attribute in another table, e.g., annual revenue Redundant attributes may be able to be detected by correlation analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality 29
  • 30. Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling 30
  • 31.  Min-max normalization: to [new_minA, new_maxA] v' v minA (new _ maxA new _ minA) new _ minA maxA minA ◦ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to 73,600 12,000 (1.0 0) 0 0.716 98,000 12 ,000   Z-score normalization (μ: mean, σ: standard deviation): v A v' 73,600 54 ,000 A 1.225 ◦ Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling v v' 10 j 16 ,000 Where j is the smallest integer such that Max(|ν’|) < 1 31
  • 32. Why data reduction?  A database/data warehouse may store terabytes of data  Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies o Data cube aggregation: o Dimensionality reduction — e.g., remove unimportant attributes o Data Compression o Numerosity reduction — e.g., fit data into models o Discretization and concept hierarchy generation 32
  • 33.  Data preparation or preprocessing is a big issue for both data warehousing and data mining  Discriptive data summarization is need for quality data preprocessing  Data preparation includes,  Data cleaning and data integration  Data reduction and feature selection  Discretization  A lot a methods have been developed but data preprocessing still an active area of research 33
  • 34. 34