SlideShare a Scribd company logo
1 of 25
Data Processing
1. Objectives....................................................................................2
2. Why Is Data Dirty?......................................................................2
3. Why Is Data Preprocessing Important?.......................................3
4. Major Tasks in Data Processing..................................................4
5. Forms of Data Processing:...........................................................5
6. Data Cleaning...............................................................................6
7. Missing Data................................................................................6
8. Noisy Data...................................................................................7
9. Simple Discretization Methods: Binning.....................................8
10. Cluster Analysis.......................................................................11
11. Regression................................................................................12
12. Data Integration.......................................................................13
13. Data Transformation................................................................15
14. Data reduction Strategies.........................................................16
15. Similarity and Dissimilarity.....................................................16
15.1. Similarity/Dissimilarity for Simple Attributes..................17
15.2. Euclidean Distance............................................................17
15.3. Minkowski Distance.........................................................18
15.4. Mahalanobis Distance.......................................................20
15.5. Common Properties of a Distance....................................22
15.6. Common Properties of a Similarity..................................22
15.7. Similarity Between Binary Vectors..................................23
15.8. Cosine Similarity..............................................................24
15.9. Extended Jaccard Coefficient (Tanimoto)........................24
15.10. Correlation......................................................................25
A. Bellaachia Page: 1
1. Objectives
• Incomplete:
o Lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data:
 e.g., occupation=“”
• Noisy:
o Containing errors or outliers
 e.g., Salary=“-10”
•
• Inconsistent:
o Containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
2. Why Is Data Dirty?
• Incomplete data comes from
o n/a data value when collected
o Different consideration between the time when the
data was collected and when it is analyzed.
o Human/hardware/software problems
• Noisy data comes from the process of data
o Collection
o Entry
A. Bellaachia Page: 2
o Transmission
• Inconsistent data comes from
o Different data sources
o Functional dependency violation
3. Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
 Quality decisions must be based on quality data
o e.g., duplicate or missing data may cause
incorrect or even misleading statistics.
o Data warehouse needs consistent integration
of quality data
• Data extraction, cleaning, and transformation comprise the
majority of the work of building a data warehouse. —Bill
Inmon
.
A. Bellaachia Page: 3
4. Major Tasks in Data Processing
• Data cleaning
 Fill in missing values, smooth noisy data, identify
or remove outliers, and resolve inconsistencies
• Data integration
 Integration of multiple databases, data cubes, or
files
• Data transformation
 Normalization and aggregation
• Data reduction
 Obtains reduced representation in volume but
produces the same or similar analytical results
• Data discretization
 Part of data reduction but with particular
importance, especially for numerical data
A. Bellaachia Page: 4
5. Forms of Data Processing:
A. Bellaachia Page: 5
6. Data Cleaning
• Importance
o “Data cleaning is one of the three biggest problems in
data warehousing”—Ralph Kimball
o “Data cleaning is the number one problem in data
warehousing”—DCI survey
• Data cleaning tasks
o Fill in missing values
o Identify outliers and smooth out noisy data
o Correct inconsistent data
o Resolve redundancy caused by data integration
7. Missing Data
• Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
 Equipment malfunction
 Inconsistent with other recorded data and thus deleted
 Data not entered due to misunderstanding
 Certain data may not be considered important at the
time of entry
 Not register history or changes of the data
• Missing data may need to be inferred.
• How to Handle Missing Data?
A. Bellaachia Page: 6
o Ignore the tuple: usually done when class label is
missing (assuming the tasks in classification—not
effective when the percentage of missing values per
attribute varies considerably.
o Fill in the missing value manually: tedious +
infeasible?
o Fill in it automatically with
 A global constant: e.g., “unknown”, a new
class?!
 the attribute mean
 the attribute mean for all samples belonging
to the same class: smarter
 the most probable value: inference-based
such as Bayesian formula or decision tree
8. Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
o faulty data collection instruments
o data entry problems
o data transmission problems
o technology limitation
o inconsistency in naming convention
• Other data problems which requires data cleaning
o duplicate records
o incomplete data
o inconsistent data
A. Bellaachia Page: 7
• How to Handle Noisy Data?
o Binning method:
 first sort data and partition into (equi-depth)
bins
 then one can smooth by bin means, smooth
by bin median, smooth by bin boundaries,
etc.
o Clustering
 detect and remove outliers
o Combined computer and human inspection
 detect suspicious values and check by
human (e.g., deal with possible outliers)
o Regression
 smooth by fitting the data into regression
functions
9. Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
o Divides the range into N intervals of equal size:
uniform grid
o if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
o The most straightforward, but outliers may dominate
presentation
A. Bellaachia Page: 8
o Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
o Divides the range into N intervals, each containing
approximately same number of samples
o Good data scaling
o Managing categorical attributes can be tricky.
• Binning methods
o They smooth a sorted data value by consulting its
“neighborhood”, that is the values around it.
o The sorted values are partitioned into a number of
buckets or bins.
o Smoothing by bin means: Each value in the bin is
replaced by the mean value of the bin.
o Smoothing by bin medians: Each value in the bin is
replaced by the bin median.
o Smoothing by boundaries: The min and max values of
a bin are identified as the bin boundaries.
o Each bin value is replaced by the closest boundary
value.
• Example: Binning Methods for Data Smoothing
A. Bellaachia Page: 9
o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
o Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
o Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
o Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
A. Bellaachia Page: 10
10. Cluster Analysis
A. Bellaachia Page: 11
11. Regression
A. Bellaachia Page: 12
x
y
y = x + 1
X1
Y1
Y1’
12. Data Integration
• Data integration:
o Combines data from multiple sources into a coherent
store
• Schema integration
o Integrate metadata from different sources
o Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-id º
B.cust-#
• Detecting and resolving data value conflicts
o For the same real world entity, attribute values from
different sources are different
o Possible reasons: different representations, different
scales, e.g., metric vs. British units
• Handling Redundancy in Data Integration
o Redundant data occur often when integration of
multiple databases
 The same attribute may have different names in
different databases
 One attribute may be a “derived” attribute in
another table, e.g., annual revenue
o Redundant data may be able to be detected by
correlational analysis
A. Bellaachia Page: 13
o Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and quality
A. Bellaachia Page: 14
13. Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified
range
o min-max normalization:
o z-score normalization:
o normalization by decimal scaling
Where j is the smallest integer such that Max(|v’|)<1
• Attribute/feature construction
o New attributes constructed from the given ones
A. Bellaachia Page: 15
AminnewAminnewAmaxnew
AminAmax
Aminv
v _)__(' +−
−
−
=
devAstand
meanAv
v
_
'
−
=
j
v
v
10
'=
14. Data reduction Strategies
• A data warehouse may store terabytes of data
o Complex data analysis/mining may take a very long
time to run on the complete data set
• Data reduction
o Obtain a reduced representation of the data set that is
much smaller in volume but yet produce the same (or
almost the same) analytical results
• Data reduction strategies
o Data cube aggregation
o Dimensionality reduction—remove unimportant
attributes
•
o Data Compression
o Numerosity reduction—fit data into models
o Discretization and concept hierarchy generation
15. Similarity and Dissimilarity
• Similarity
o Numerical measure of how alike two data objects
are.
o Is higher when objects are more alike.
o Often falls in the range [0,1]
• Dissimilarity
o Numerical measure of how different are two data
objects
o Lower when objects are more alike
A. Bellaachia Page: 16
o Minimum dissimilarity is often 0
o Upper limit varies
• Proximity refers to a similarity or dissimilarity
15.1. Similarity/Dissimilarity for Simple Attributes
• p and q are the attribute values for two data objects.
15.2. Euclidean Distance
∑=
−=
n
k
kk qpdist
1
2
)(
Where n is the number of dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or data objects p and
q.
A. Bellaachia Page: 17
Distance Matrix
15.3. Minkowski Distance
• Minkowski Distance is a generalization of Euclidean
Distance:
r
n
k
r
kk qpdist
1
1
)||( ∑
=
−=
A. Bellaachia Page: 18
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
• Minkowski Distance: Examples
o r = 1. City block (Manhattan, taxicab, L1 norm)
distance.
 A common example of this is the Hamming
distance, which is just the number of bits that are
different between two binary vectors
o r = 2. Euclidean distance
o r → ∞. “supremum” (Lmax norm, L∞ norm) distance
 This is the maximum difference between any
component of the vectors
o Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
A. Bellaachia Page: 19
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
Distance Matrix
15.4. Mahalanobis Distance
T
qpqpqpsmahalanobi )()(),( 1
−∑−= −
Where
Σ is the covariance matrix of the input data X
If X is a column vector with n scalar random variable components,
and μk is the expected value of the kth element of X, i.e., μk =
E(Xk), then the covariance matrix is defined as:
A. Bellaachia Page: 20
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L∞ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
∑ = E[(X-E[X]) (X-E[X])T] =












−−−−
−−−−
−−−−−−
=
=∑
)](X)E[(X......)](X)E[(X
............
...)](X)E[(X)](X)E[(X
)](X)E[(X...)](X)E[(X)](X)E[(X
]E[X])-(XE[X])-E[(X
nn11n
22221122
n1122111111
T
nnn
n
µµµµ
µµµµ
µµµµµµ
The (i,j) element is the covariance between Xi and Xj.
For red points, the Euclidean distance is 14.7, Mahalanobis
distance is 6.
• If the covariance matrix is the identity matrix, the
Mahalanobis distance reduces to the Euclidean distance. If
A. Bellaachia Page: 21
the covariance matrix is diagonal, then the resulting distance
measure is called the normalized Euclidean distance:
15.5. Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well
known properties.
1. d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points (data
objects), p and q.
• A distance that satisfies these properties is a metric
15.6. Common Properties of a Similarity
• Similarities, also have some well known properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects),
p and q.
A. Bellaachia Page: 22
15.7. Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary
attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero
attributes values
= (M11) / (M01 + M10 + M11)
• SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /
(2+1+0+7) = 0.7
A. Bellaachia Page: 23
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
15.8. Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
Where • indicates vector dot product and || d || is the length
of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 +
0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =
(42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 =
(6) 0.5 = 2.245
cos( d1, d2 ) = .3150
15.9. Extended Jaccard Coefficient (Tanimoto)
• Variation of Jaccard for continuous or count attributes
o Reduces to Jaccard for binary attributes
qpqp
qp
qpT
•−+
•
= 22
),(
A. Bellaachia Page: 24
15.10. Correlation
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q,
and then take their dot product
)(/))(( pstdpmeanpp kk −=′
)(/))(( qstdqmeanqq kk −=′
qpqpncorrelatio ′•′=),(
A. Bellaachia Page: 25

More Related Content

What's hot

L4 working with tables and data
L4 working with tables and dataL4 working with tables and data
L4 working with tables and dataBryan Corpuz
 
Chapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsChapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsnehabsairam
 
Ecdl v5 module 5 print
Ecdl v5 module 5 printEcdl v5 module 5 print
Ecdl v5 module 5 printMichael Lew
 
Intro To DataBase
Intro To DataBaseIntro To DataBase
Intro To DataBaseDevMix
 
ECDL module 5: using databases [To be continued]
ECDL module 5: using databases [To be continued] ECDL module 5: using databases [To be continued]
ECDL module 5: using databases [To be continued] Hassan Ayad
 
Database Concepts and Terminologies
Database Concepts and TerminologiesDatabase Concepts and Terminologies
Database Concepts and TerminologiesOusman Faal
 
Microsoft Access
Microsoft AccessMicrosoft Access
Microsoft Accesstormeyj
 
SAP BW - Creation of hierarchies (time dependant hierachy structures)
SAP BW - Creation of hierarchies (time dependant hierachy structures)SAP BW - Creation of hierarchies (time dependant hierachy structures)
SAP BW - Creation of hierarchies (time dependant hierachy structures)Yasmin Ashraf
 
Chapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortalsChapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortalsnehabsairam
 
Databases and its representation
Databases and its representationDatabases and its representation
Databases and its representationRuhull
 
ความรู้เบื้องต้นฐานข้อมูล 1
ความรู้เบื้องต้นฐานข้อมูล 1ความรู้เบื้องต้นฐานข้อมูล 1
ความรู้เบื้องต้นฐานข้อมูล 1Witoon Thammatuch-aree
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to databaseSuleman Memon
 

What's hot (20)

L4 working with tables and data
L4 working with tables and dataL4 working with tables and data
L4 working with tables and data
 
Chapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortalsChapter 8(designing of documnt databases)no sql for mere mortals
Chapter 8(designing of documnt databases)no sql for mere mortals
 
Ecdl v5 module 5 print
Ecdl v5 module 5 printEcdl v5 module 5 print
Ecdl v5 module 5 print
 
Intro To DataBase
Intro To DataBaseIntro To DataBase
Intro To DataBase
 
ECDL module 5: using databases [To be continued]
ECDL module 5: using databases [To be continued] ECDL module 5: using databases [To be continued]
ECDL module 5: using databases [To be continued]
 
Database Concepts and Terminologies
Database Concepts and TerminologiesDatabase Concepts and Terminologies
Database Concepts and Terminologies
 
Microsoft Access
Microsoft AccessMicrosoft Access
Microsoft Access
 
Access 2010
Access 2010Access 2010
Access 2010
 
Nota ict form 5
Nota ict form 5Nota ict form 5
Nota ict form 5
 
SAP BW - Creation of hierarchies (time dependant hierachy structures)
SAP BW - Creation of hierarchies (time dependant hierachy structures)SAP BW - Creation of hierarchies (time dependant hierachy structures)
SAP BW - Creation of hierarchies (time dependant hierachy structures)
 
Chapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortalsChapter 6(introduction to documnet databse) no sql for mere mortals
Chapter 6(introduction to documnet databse) no sql for mere mortals
 
Databases and its representation
Databases and its representationDatabases and its representation
Databases and its representation
 
ความรู้เบื้องต้นฐานข้อมูล 1
ความรู้เบื้องต้นฐานข้อมูล 1ความรู้เบื้องต้นฐานข้อมูล 1
ความรู้เบื้องต้นฐานข้อมูล 1
 
B tree
B treeB tree
B tree
 
Data Dictionary
Data DictionaryData Dictionary
Data Dictionary
 
Introduction to database
Introduction to databaseIntroduction to database
Introduction to database
 
Introduction - Database (MS Access)
Introduction - Database (MS Access)Introduction - Database (MS Access)
Introduction - Database (MS Access)
 
MS Access Training
MS Access TrainingMS Access Training
MS Access Training
 
Olap
OlapOlap
Olap
 
001.general
001.general001.general
001.general
 

Viewers also liked

Downloadbrandequity doc of Arvinoor Siregar SH MH
Downloadbrandequity doc of Arvinoor Siregar SH MHDownloadbrandequity doc of Arvinoor Siregar SH MH
Downloadbrandequity doc of Arvinoor Siregar SH MHAndri Goodwood
 
The Importance of Images
The Importance of ImagesThe Importance of Images
The Importance of Imagesclaudiabove
 
YZM 2116 - Bölüm 3 - Listeler
YZM 2116 - Bölüm 3 - Listeler YZM 2116 - Bölüm 3 - Listeler
YZM 2116 - Bölüm 3 - Listeler Deniz KILINÇ
 
Yzm 2116 - Bölüm 2 (Algoritma Analizi)
Yzm 2116  - Bölüm 2 (Algoritma Analizi)Yzm 2116  - Bölüm 2 (Algoritma Analizi)
Yzm 2116 - Bölüm 2 (Algoritma Analizi)Deniz KILINÇ
 
Importance Performance Analysis for 7-eleven
Importance Performance Analysis for 7-elevenImportance Performance Analysis for 7-eleven
Importance Performance Analysis for 7-elevenLiena Wied
 
Building customer trust to drive your KPIs
Building customer trust to drive your KPIsBuilding customer trust to drive your KPIs
Building customer trust to drive your KPIsmext Consulting
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Brand Personality
Brand PersonalityBrand Personality
Brand PersonalityKing Julian
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
Maggi noodles : ECONOMIC SURVEY
Maggi noodles : ECONOMIC SURVEYMaggi noodles : ECONOMIC SURVEY
Maggi noodles : ECONOMIC SURVEYMalvika Bansal
 
Market Research on Nestle Maggi
Market Research on Nestle MaggiMarket Research on Nestle Maggi
Market Research on Nestle MaggiSubhradeep Mitra
 
Brand personality
Brand personalityBrand personality
Brand personalityEkta Gupta
 
Ppt on brand personality
Ppt on brand personalityPpt on brand personality
Ppt on brand personalityyanbabu
 
Brand Image
Brand ImageBrand Image
Brand Imageair
 
Brand identity, brand personality & brand image
Brand identity, brand personality & brand imageBrand identity, brand personality & brand image
Brand identity, brand personality & brand imageSunny Bose
 
Brand Personality
Brand PersonalityBrand Personality
Brand PersonalitySj -
 

Viewers also liked (19)

Downloadbrandequity doc of Arvinoor Siregar SH MH
Downloadbrandequity doc of Arvinoor Siregar SH MHDownloadbrandequity doc of Arvinoor Siregar SH MH
Downloadbrandequity doc of Arvinoor Siregar SH MH
 
The Importance of Images
The Importance of ImagesThe Importance of Images
The Importance of Images
 
Artist With Hope
Artist With HopeArtist With Hope
Artist With Hope
 
YZM 2116 - Bölüm 3 - Listeler
YZM 2116 - Bölüm 3 - Listeler YZM 2116 - Bölüm 3 - Listeler
YZM 2116 - Bölüm 3 - Listeler
 
Importance of Brand image
Importance of Brand image Importance of Brand image
Importance of Brand image
 
Yzm 2116 - Bölüm 2 (Algoritma Analizi)
Yzm 2116  - Bölüm 2 (Algoritma Analizi)Yzm 2116  - Bölüm 2 (Algoritma Analizi)
Yzm 2116 - Bölüm 2 (Algoritma Analizi)
 
Importance Performance Analysis for 7-eleven
Importance Performance Analysis for 7-elevenImportance Performance Analysis for 7-eleven
Importance Performance Analysis for 7-eleven
 
Building customer trust to drive your KPIs
Building customer trust to drive your KPIsBuilding customer trust to drive your KPIs
Building customer trust to drive your KPIs
 
similarity measure
similarity measure similarity measure
similarity measure
 
Brand Personality
Brand PersonalityBrand Personality
Brand Personality
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Maggi noodles : ECONOMIC SURVEY
Maggi noodles : ECONOMIC SURVEYMaggi noodles : ECONOMIC SURVEY
Maggi noodles : ECONOMIC SURVEY
 
Market Research on Nestle Maggi
Market Research on Nestle MaggiMarket Research on Nestle Maggi
Market Research on Nestle Maggi
 
Brand personality
Brand personalityBrand personality
Brand personality
 
Ppt on brand personality
Ppt on brand personalityPpt on brand personality
Ppt on brand personality
 
Brand Image
Brand ImageBrand Image
Brand Image
 
Brand identity, brand personality & brand image
Brand identity, brand personality & brand imageBrand identity, brand personality & brand image
Brand identity, brand personality & brand image
 
Brand Personality
Brand PersonalityBrand Personality
Brand Personality
 
Brand Personality
Brand PersonalityBrand Personality
Brand Personality
 

Similar to Guide to Data Processing Techniques and Concepts

Similar to Guide to Data Processing Techniques and Concepts (20)

Data pre processing
Data pre processingData pre processing
Data pre processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data processing
Data processingData processing
Data processing
 
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Data Mining
Data MiningData Mining
Data Mining
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 

Recently uploaded

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Guide to Data Processing Techniques and Concepts

  • 1. Data Processing 1. Objectives....................................................................................2 2. Why Is Data Dirty?......................................................................2 3. Why Is Data Preprocessing Important?.......................................3 4. Major Tasks in Data Processing..................................................4 5. Forms of Data Processing:...........................................................5 6. Data Cleaning...............................................................................6 7. Missing Data................................................................................6 8. Noisy Data...................................................................................7 9. Simple Discretization Methods: Binning.....................................8 10. Cluster Analysis.......................................................................11 11. Regression................................................................................12 12. Data Integration.......................................................................13 13. Data Transformation................................................................15 14. Data reduction Strategies.........................................................16 15. Similarity and Dissimilarity.....................................................16 15.1. Similarity/Dissimilarity for Simple Attributes..................17 15.2. Euclidean Distance............................................................17 15.3. Minkowski Distance.........................................................18 15.4. Mahalanobis Distance.......................................................20 15.5. Common Properties of a Distance....................................22 15.6. Common Properties of a Similarity..................................22 15.7. Similarity Between Binary Vectors..................................23 15.8. Cosine Similarity..............................................................24 15.9. Extended Jaccard Coefficient (Tanimoto)........................24 15.10. Correlation......................................................................25 A. Bellaachia Page: 1
  • 2. 1. Objectives • Incomplete: o Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data:  e.g., occupation=“” • Noisy: o Containing errors or outliers  e.g., Salary=“-10” • • Inconsistent: o Containing discrepancies in codes or names  e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records 2. Why Is Data Dirty? • Incomplete data comes from o n/a data value when collected o Different consideration between the time when the data was collected and when it is analyzed. o Human/hardware/software problems • Noisy data comes from the process of data o Collection o Entry A. Bellaachia Page: 2
  • 3. o Transmission • Inconsistent data comes from o Different data sources o Functional dependency violation 3. Why Is Data Preprocessing Important? • No quality data, no quality mining results!  Quality decisions must be based on quality data o e.g., duplicate or missing data may cause incorrect or even misleading statistics. o Data warehouse needs consistent integration of quality data • Data extraction, cleaning, and transformation comprise the majority of the work of building a data warehouse. —Bill Inmon . A. Bellaachia Page: 3
  • 4. 4. Major Tasks in Data Processing • Data cleaning  Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration  Integration of multiple databases, data cubes, or files • Data transformation  Normalization and aggregation • Data reduction  Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization  Part of data reduction but with particular importance, especially for numerical data A. Bellaachia Page: 4
  • 5. 5. Forms of Data Processing: A. Bellaachia Page: 5
  • 6. 6. Data Cleaning • Importance o “Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball o “Data cleaning is the number one problem in data warehousing”—DCI survey • Data cleaning tasks o Fill in missing values o Identify outliers and smooth out noisy data o Correct inconsistent data o Resolve redundancy caused by data integration 7. Missing Data • Data is not always available  E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to  Equipment malfunction  Inconsistent with other recorded data and thus deleted  Data not entered due to misunderstanding  Certain data may not be considered important at the time of entry  Not register history or changes of the data • Missing data may need to be inferred. • How to Handle Missing Data? A. Bellaachia Page: 6
  • 7. o Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. o Fill in the missing value manually: tedious + infeasible? o Fill in it automatically with  A global constant: e.g., “unknown”, a new class?!  the attribute mean  the attribute mean for all samples belonging to the same class: smarter  the most probable value: inference-based such as Bayesian formula or decision tree 8. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to o faulty data collection instruments o data entry problems o data transmission problems o technology limitation o inconsistency in naming convention • Other data problems which requires data cleaning o duplicate records o incomplete data o inconsistent data A. Bellaachia Page: 7
  • 8. • How to Handle Noisy Data? o Binning method:  first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. o Clustering  detect and remove outliers o Combined computer and human inspection  detect suspicious values and check by human (e.g., deal with possible outliers) o Regression  smooth by fitting the data into regression functions 9. Simple Discretization Methods: Binning • Equal-width (distance) partitioning: o Divides the range into N intervals of equal size: uniform grid o if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. o The most straightforward, but outliers may dominate presentation A. Bellaachia Page: 8
  • 9. o Skewed data is not handled well. • Equal-depth (frequency) partitioning: o Divides the range into N intervals, each containing approximately same number of samples o Good data scaling o Managing categorical attributes can be tricky. • Binning methods o They smooth a sorted data value by consulting its “neighborhood”, that is the values around it. o The sorted values are partitioned into a number of buckets or bins. o Smoothing by bin means: Each value in the bin is replaced by the mean value of the bin. o Smoothing by bin medians: Each value in the bin is replaced by the bin median. o Smoothing by boundaries: The min and max values of a bin are identified as the bin boundaries. o Each bin value is replaced by the closest boundary value. • Example: Binning Methods for Data Smoothing A. Bellaachia Page: 9
  • 10. o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 o Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 o Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 o Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 A. Bellaachia Page: 10
  • 11. 10. Cluster Analysis A. Bellaachia Page: 11
  • 12. 11. Regression A. Bellaachia Page: 12 x y y = x + 1 X1 Y1 Y1’
  • 13. 12. Data Integration • Data integration: o Combines data from multiple sources into a coherent store • Schema integration o Integrate metadata from different sources o Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id º B.cust-# • Detecting and resolving data value conflicts o For the same real world entity, attribute values from different sources are different o Possible reasons: different representations, different scales, e.g., metric vs. British units • Handling Redundancy in Data Integration o Redundant data occur often when integration of multiple databases  The same attribute may have different names in different databases  One attribute may be a “derived” attribute in another table, e.g., annual revenue o Redundant data may be able to be detected by correlational analysis A. Bellaachia Page: 13
  • 14. o Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality A. Bellaachia Page: 14
  • 15. 13. Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing • Normalization: scaled to fall within a small, specified range o min-max normalization: o z-score normalization: o normalization by decimal scaling Where j is the smallest integer such that Max(|v’|)<1 • Attribute/feature construction o New attributes constructed from the given ones A. Bellaachia Page: 15 AminnewAminnewAmaxnew AminAmax Aminv v _)__(' +− − − = devAstand meanAv v _ ' − = j v v 10 '=
  • 16. 14. Data reduction Strategies • A data warehouse may store terabytes of data o Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction o Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results • Data reduction strategies o Data cube aggregation o Dimensionality reduction—remove unimportant attributes • o Data Compression o Numerosity reduction—fit data into models o Discretization and concept hierarchy generation 15. Similarity and Dissimilarity • Similarity o Numerical measure of how alike two data objects are. o Is higher when objects are more alike. o Often falls in the range [0,1] • Dissimilarity o Numerical measure of how different are two data objects o Lower when objects are more alike A. Bellaachia Page: 16
  • 17. o Minimum dissimilarity is often 0 o Upper limit varies • Proximity refers to a similarity or dissimilarity 15.1. Similarity/Dissimilarity for Simple Attributes • p and q are the attribute values for two data objects. 15.2. Euclidean Distance ∑= −= n k kk qpdist 1 2 )( Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. A. Bellaachia Page: 17
  • 18. Distance Matrix 15.3. Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance: r n k r kk qpdist 1 1 )||( ∑ = −= A. Bellaachia Page: 18 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 0 1 2 3 0 1 2 3 4 5 6 p1 p2 p3 p4 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0
  • 19. Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. • Minkowski Distance: Examples o r = 1. City block (Manhattan, taxicab, L1 norm) distance.  A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors o r = 2. Euclidean distance o r → ∞. “supremum” (Lmax norm, L∞ norm) distance  This is the maximum difference between any component of the vectors o Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 A. Bellaachia Page: 19 L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0
  • 20. Distance Matrix 15.4. Mahalanobis Distance T qpqpqpsmahalanobi )()(),( 1 −∑−= − Where Σ is the covariance matrix of the input data X If X is a column vector with n scalar random variable components, and μk is the expected value of the kth element of X, i.e., μk = E(Xk), then the covariance matrix is defined as: A. Bellaachia Page: 20 L2 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L∞ p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0
  • 21. ∑ = E[(X-E[X]) (X-E[X])T] =             −−−− −−−− −−−−−− = =∑ )](X)E[(X......)](X)E[(X ............ ...)](X)E[(X)](X)E[(X )](X)E[(X...)](X)E[(X)](X)E[(X ]E[X])-(XE[X])-E[(X nn11n 22221122 n1122111111 T nnn n µµµµ µµµµ µµµµµµ The (i,j) element is the covariance between Xi and Xj. For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. • If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If A. Bellaachia Page: 21
  • 22. the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean distance: 15.5. Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q) ≥ 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2. d(p, q) = d(q, p) for all p and q. (Symmetry) 3. d(p, r) ≤ d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric 15.6. Common Properties of a Similarity • Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. A. Bellaachia Page: 22
  • 23. 15.7. Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes • Compute similarities using the following quantities M01 = the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00 = the number of attributes where p was 0 and q was 0 M11 = the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) • SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 A. Bellaachia Page: 23
  • 24. J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0 15.8. Cosine Similarity • If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| , Where • indicates vector dot product and || d || is the length of vector d. • Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150 15.9. Extended Jaccard Coefficient (Tanimoto) • Variation of Jaccard for continuous or count attributes o Reduces to Jaccard for binary attributes qpqp qp qpT •−+ • = 22 ),( A. Bellaachia Page: 24
  • 25. 15.10. Correlation • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, p and q, and then take their dot product )(/))(( pstdpmeanpp kk −=′ )(/))(( qstdqmeanqq kk −=′ qpqpncorrelatio ′•′=),( A. Bellaachia Page: 25