Data science
UNIT-I
Data objects and Attribute types
Similarity and Dissimilarity
1
Types of Data Sets
• Record
• Relational records
• Data matrix, e.g., numerical matrix, crosstabs
• Document data: text documents: term-
frequency vector
• Transaction data
• Graph and network
• World Wide Web
• Social or information networks
• Molecular Structures
• Ordered
• Video data: sequence of images
• Temporal data: time-series
• Sequential Data: transaction sequences
• Genetic sequence data
• Spatial, image and multimedia:
• Spatial data: maps
• Image data:
• Video data:
Document 1
season
timeout
lost
wi
n
game
score
ball
pla
y
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
2
Important characteristics of Structured data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion
3
Data: 1. Structured
2. Unstructured
3. Semi structured
4. Quasi structured
4
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points,
objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
5
Attributes
• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Ordinal
• Continuous and Discontinues
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
6
Types of attributes
7
8
Nominal Attributes-related to names
9
10
11
Similarity and Dissimilarity
Proximity: refers to a similarity or dissimilarity
12
Similarity
Similarity and Dissimilarity Applications:
• Web search
• Computer vision:
• Image Processing
• Natural language processing
• Clustering outliers
13
Similarity / Proximity Measures
14
• Nominal attributes
• Binary attributes
• Ordinal attributes
• Numeric attributes 1. Scaled 2. Ratio
• Mixed attributes
Data Matrix and Dissimilarity Matrix
15
Data Matrix
Proximity Measure for Nominal Attributes
16
Example: Nominal attribute
17
Example: Nominal attribute
18
Example: Finding Similarity matrix for Nominal attribute
19
Example : find dissimilarity of given data
20
Proximity Measure for Binary Attributes
21
Proximity Measure for Binary Attributes
• A contingency table for binary data
• Distance measure for symmetric binary
variables:
• Distance measure for asymmetric binary
variables:
• Jaccard coefficient (similarity measure
for asymmetric binary variables):
22
 Note: Jaccard coefficient is the same as “coherence”:
Object i
Object j
Proximity Measure for Binary Attributes: Example
23
Proximity Measure for Binary Attributes
24
• Example
• Gender is a symmetric attribute
• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0
25
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
75
.
0
2
1
1
2
1
)
,
(
67
.
0
1
1
1
1
1
)
,
(
33
.
0
1
0
2
1
0
)
,
(















mary
jim
d
jim
jack
d
mary
jack
d
Proximity Measure for Binary Attributes: Example: Find Dissimilarity
matrix
Proximity Measure for Numeric Attributes: Standardizing Numeric Data
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
• the distance between the raw score and the population mean in units of
the standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation
where
• standardized measure (z-score):
• Using mean absolute deviation is more robust than using standard deviation



 x
z
26
.
)
...
2
1
1
nf
f
f
f
x
x
(x
n
m 



|)
|
...
|
|
|
(|
1
2
1 f
nf
f
f
f
f
f
m
x
m
x
m
x
n
s 






f
f
if
if s
m
x
z


Data Matrix and Dissimilarity Matrix
27
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
Data Matrix
Distance on Numeric Data: Minkowski Distance
• Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
28
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different
between two binary vectors
• h = 2: (L2 norm) Euclidean distance
• h  . “supremum” (Lmax norm, L norm) distance.
• This is the maximum difference between any component
(attribute) of the vectors
|
|
...
|
|
|
|
)
,
(
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






29
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d 






Example: Minkowski Distance
30
Dissimilarity Matrices
point attribute 1 attribute 2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Manhattan (L1)
Euclidean (L2)
Supremum
Example : Numeric attribute: Find the distances
31
Example: Numeric Attribute
32
Example: Ratio-Scaled Attributes: Find Dissimilarity matrix
33
Example: find Dissimilarity of given Numeric
data
34
Proximity measure for Ordinal attributes
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
• compute the dissimilarity using methods for interval-scaled variables
35
1
1



f
if
if M
r
z
}
,...,
1
{ f
if
M
r 
Example: Ordinal Attributes
36
Example: Ordinal Attributes
37
Example: Find Dissimilarity matrix of given
ordinal attribute
38
Proximity measures of Mixed Attributes
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their effects
• f is binary or nominal:
dij
(f) = 0 if xif = xjf , or dij
(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and
• Treat zif as interval-scaled
)
(
1
)
(
)
(
1
)
,
( f
ij
p
f
f
ij
f
ij
p
f
d
j
i
d







1
1



f
if
M
r
zif
39
40
Example: Find Dissimilarity of Mixed Attributes
41
Example: Find Dissimilarity of Mixed Attributes
42
Example: Find Dissimilarity of Mixed Attributes
Example: Mixed Attributes
43
Example : Mixed Attributes
44
Example: Mixed Attributes
45
Example: Mixed Attributes
46
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
• Other vector objects: gene features in micro-arrays, …
• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d
47
Example: Find similarity of documents
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d
• Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
48

Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt

  • 1.
    Data science UNIT-I Data objectsand Attribute types Similarity and Dissimilarity 1
  • 2.
    Types of DataSets • Record • Relational records • Data matrix, e.g., numerical matrix, crosstabs • Document data: text documents: term- frequency vector • Transaction data • Graph and network • World Wide Web • Social or information networks • Molecular Structures • Ordered • Video data: sequence of images • Temporal data: time-series • Sequential Data: transaction sequences • Genetic sequence data • Spatial, image and multimedia: • Spatial data: maps • Image data: • Video data: Document 1 season timeout lost wi n game score ball pla y coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0 TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk 2
  • 3.
    Important characteristics ofStructured data • Dimensionality • Curse of dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale • Distribution • Centrality and dispersion 3 Data: 1. Structured 2. Unstructured 3. Semi structured 4. Quasi structured
  • 4.
  • 5.
    Data Objects • Datasets are made up of data objects. • A data object represents an entity. • Examples: • sales database: customers, store items, sales • medical database: patients, treatments • university database: students, professors, courses • Also called samples , examples, instances, data points, objects, tuples. • Data objects are described by attributes. • Database rows -> data objects; columns ->attributes. 5
  • 6.
    Attributes • Attribute (ordimensions, features, variables): a data field, representing a characteristic or feature of a data object. • E.g., customer _ID, name, address • Types: • Nominal • Binary • Ordinal • Continuous and Discontinues • Numeric: quantitative • Interval-scaled • Ratio-scaled 6
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    Similarity and Dissimilarity Proximity:refers to a similarity or dissimilarity 12 Similarity
  • 13.
    Similarity and DissimilarityApplications: • Web search • Computer vision: • Image Processing • Natural language processing • Clustering outliers 13
  • 14.
    Similarity / ProximityMeasures 14 • Nominal attributes • Binary attributes • Ordinal attributes • Numeric attributes 1. Scaled 2. Ratio • Mixed attributes
  • 15.
    Data Matrix andDissimilarity Matrix 15 Data Matrix
  • 16.
    Proximity Measure forNominal Attributes 16
  • 17.
  • 18.
  • 19.
    Example: Finding Similaritymatrix for Nominal attribute 19
  • 20.
    Example : finddissimilarity of given data 20
  • 21.
    Proximity Measure forBinary Attributes 21
  • 22.
    Proximity Measure forBinary Attributes • A contingency table for binary data • Distance measure for symmetric binary variables: • Distance measure for asymmetric binary variables: • Jaccard coefficient (similarity measure for asymmetric binary variables): 22  Note: Jaccard coefficient is the same as “coherence”: Object i Object j
  • 23.
    Proximity Measure forBinary Attributes: Example 23
  • 24.
    Proximity Measure forBinary Attributes 24
  • 25.
    • Example • Genderis a symmetric attribute • The remaining attributes are asymmetric binary • Let the values Y and P be 1, and the value N 0 25 Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N 75 . 0 2 1 1 2 1 ) , ( 67 . 0 1 1 1 1 1 ) , ( 33 . 0 1 0 2 1 0 ) , (                mary jim d jim jack d mary jack d Proximity Measure for Binary Attributes: Example: Find Dissimilarity matrix
  • 26.
    Proximity Measure forNumeric Attributes: Standardizing Numeric Data • Z-score: • X: raw score to be standardized, μ: mean of the population, σ: standard deviation • the distance between the raw score and the population mean in units of the standard deviation • negative when the raw score is below the mean, “+” when above • An alternative way: Calculate the mean absolute deviation where • standardized measure (z-score): • Using mean absolute deviation is more robust than using standard deviation     x z 26 . ) ... 2 1 1 nf f f f x x (x n m     |) | ... | | | (| 1 2 1 f nf f f f f f m x m x m x n s        f f if if s m x z  
  • 27.
    Data Matrix andDissimilarity Matrix 27 point attribute1 attribute2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 Dissimilarity Matrix (with Euclidean Distance) x1 x2 x3 x4 x1 0 x2 3.61 0 x3 5.1 5.1 0 x4 4.24 1 5.39 0 Data Matrix
  • 28.
    Distance on NumericData: Minkowski Distance • Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p- dimensional data objects, and h is the order (the distance so defined is also called L-h norm) • Properties • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) • d(i, j) = d(j, i) (Symmetry) • d(i, j)  d(i, k) + d(k, j) (Triangle Inequality) • A distance that satisfies these properties is a metric 28
  • 29.
    Special Cases ofMinkowski Distance • h = 1: Manhattan (city block, L1 norm) distance • E.g., the Hamming distance: the number of bits that are different between two binary vectors • h = 2: (L2 norm) Euclidean distance • h  . “supremum” (Lmax norm, L norm) distance. • This is the maximum difference between any component (attribute) of the vectors | | ... | | | | ) , ( 2 2 1 1 p p j x i x j x i x j x i x j i d        29 ) | | ... | | | (| ) , ( 2 2 2 2 2 1 1 p p j x i x j x i x j x i x j i d       
  • 30.
    Example: Minkowski Distance 30 DissimilarityMatrices point attribute 1 attribute 2 x1 1 2 x2 3 5 x3 2 0 x4 4 5 L x1 x2 x3 x4 x1 0 x2 5 0 x3 3 6 0 x4 6 1 7 0 L2 x1 x2 x3 x4 x1 0 x2 3.61 0 x3 2.24 5.1 0 x4 4.24 1 5.39 0 L x1 x2 x3 x4 x1 0 x2 3 0 x3 2 5 0 x4 3 1 5 0 Manhattan (L1) Euclidean (L2) Supremum
  • 31.
    Example : Numericattribute: Find the distances 31
  • 32.
  • 33.
    Example: Ratio-Scaled Attributes:Find Dissimilarity matrix 33
  • 34.
    Example: find Dissimilarityof given Numeric data 34
  • 35.
    Proximity measure forOrdinal attributes • An ordinal variable can be discrete or continuous • Order is important, e.g., rank • Can be treated like interval-scaled • replace xif by their rank • map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by • compute the dissimilarity using methods for interval-scaled variables 35 1 1    f if if M r z } ,..., 1 { f if M r 
  • 36.
  • 37.
  • 38.
    Example: Find Dissimilaritymatrix of given ordinal attribute 38
  • 39.
    Proximity measures ofMixed Attributes • A database may contain all attribute types • Nominal, symmetric binary, asymmetric binary, numeric, ordinal • One may use a weighted formula to combine their effects • f is binary or nominal: dij (f) = 0 if xif = xjf , or dij (f) = 1 otherwise • f is numeric: use the normalized distance • f is ordinal • Compute ranks rif and • Treat zif as interval-scaled ) ( 1 ) ( ) ( 1 ) , ( f ij p f f ij f ij p f d j i d        1 1    f if M r zif 39
  • 40.
    40 Example: Find Dissimilarityof Mixed Attributes
  • 41.
    41 Example: Find Dissimilarityof Mixed Attributes
  • 42.
    42 Example: Find Dissimilarityof Mixed Attributes
  • 43.
  • 44.
    Example : MixedAttributes 44
  • 45.
  • 46.
  • 47.
    Cosine Similarity • Adocument can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. • Other vector objects: gene features in micro-arrays, … • Applications: information retrieval, biologic taxonomy, gene feature mapping, ... • Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1  d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d||: the length of vector d 47
  • 48.
    Example: Find similarityof documents • cos(d1, d2) = (d1  d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d|: the length of vector d • Ex: Find the similarity between documents 1 and 2. d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94 48