This document discusses distance and similarity measures that are commonly used for data mining and analytics tasks involving the comparison of objects. It defines similarity and dissimilarity, and notes that many measures involve representing objects as feature vectors and then computing distances between the vectors. Several specific distance measures for numeric and categorical data are described, including Euclidean distance, Manhattan distance, and cosine similarity. The document also discusses techniques like k-nearest neighbors classification that rely on computing distances between objects.
Installing and configuring computer systems involves competencies to assemble computer hardware, install operating systems and drivers for peripherals/devices, and install application software as well as to conduct testing and documentation.
This learning activity sheet will help you understand better the concepts and skills of installing and configuring computer systems. Try your best to go over the activities and answer the questions or items asked for.
Learning
This document discusses various techniques for analyzing and visualizing data to gain insights. It covers data attribute types, basic statistical descriptions to understand data distribution and outliers, different visualization methods to discover patterns and relationships, and various ways to measure similarity between data objects, including distances, coefficients, and cosine similarity for text. The goal is to preprocess and understand data at a high level before applying more advanced analytics.
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Abebe Admasu
This document discusses various techniques for quantifying similarity and distance between data objects. It begins by explaining that similarity and distance measures are needed to solve problems like recommending similar items to customers or grouping similar web documents. It then covers specific measures like Jaccard similarity, cosine similarity, Lp norms/Minkowski distance, and edit distance. It discusses properties these measures should satisfy to be considered a valid distance metric. Finally, it discusses applications of similarity measures in recommendation systems and challenges therein.
The document discusses various techniques for clustering and dimensionality reduction of web documents. It introduces machine learning clustering methods like k-means clustering and discusses challenges like handling different cluster sizes and shapes. It also covers dimensionality reduction methods like principal component analysis (PCA) and locality-sensitive hashing that can be used to cluster high dimensional web document datasets by reducing their dimensionality.
Hierarchical clustering is a method of partitioning a set of data into meaningful sub-classes or clusters. It involves two approaches - agglomerative, which successively links pairs of items or clusters, and divisive, which starts with the whole set as a cluster and divides it into smaller partitions. Agglomerative Nesting (AGNES) is an agglomerative technique that merges clusters with the least dissimilarity at each step, eventually combining all clusters. Divisive Analysis (DIANA) is the inverse, starting with all data in one cluster and splitting it until each data point is its own cluster. Both approaches can be visualized using dendrograms to show the hierarchical merging or splitting of clusters.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Installing and configuring computer systems involves competencies to assemble computer hardware, install operating systems and drivers for peripherals/devices, and install application software as well as to conduct testing and documentation.
This learning activity sheet will help you understand better the concepts and skills of installing and configuring computer systems. Try your best to go over the activities and answer the questions or items asked for.
Learning
This document discusses various techniques for analyzing and visualizing data to gain insights. It covers data attribute types, basic statistical descriptions to understand data distribution and outliers, different visualization methods to discover patterns and relationships, and various ways to measure similarity between data objects, including distances, coefficients, and cosine similarity for text. The goal is to preprocess and understand data at a high level before applying more advanced analytics.
Google BigQuery is a very popular enterprise warehouse that’s built with a co...Abebe Admasu
This document discusses various techniques for quantifying similarity and distance between data objects. It begins by explaining that similarity and distance measures are needed to solve problems like recommending similar items to customers or grouping similar web documents. It then covers specific measures like Jaccard similarity, cosine similarity, Lp norms/Minkowski distance, and edit distance. It discusses properties these measures should satisfy to be considered a valid distance metric. Finally, it discusses applications of similarity measures in recommendation systems and challenges therein.
The document discusses various techniques for clustering and dimensionality reduction of web documents. It introduces machine learning clustering methods like k-means clustering and discusses challenges like handling different cluster sizes and shapes. It also covers dimensionality reduction methods like principal component analysis (PCA) and locality-sensitive hashing that can be used to cluster high dimensional web document datasets by reducing their dimensionality.
Hierarchical clustering is a method of partitioning a set of data into meaningful sub-classes or clusters. It involves two approaches - agglomerative, which successively links pairs of items or clusters, and divisive, which starts with the whole set as a cluster and divides it into smaller partitions. Agglomerative Nesting (AGNES) is an agglomerative technique that merges clusters with the least dissimilarity at each step, eventually combining all clusters. Divisive Analysis (DIANA) is the inverse, starting with all data in one cluster and splitting it until each data point is its own cluster. Both approaches can be visualized using dendrograms to show the hierarchical merging or splitting of clusters.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
The document discusses various clustering algorithms and concepts:
1) K-means clustering groups data by minimizing distances between points and cluster centers, but it is sensitive to initialization and may find local optima.
2) K-medians clustering is similar but uses point medians instead of means as cluster representatives.
3) K-center clustering aims to minimize maximum distances between points and clusters, and can be approximated with a farthest-first traversal algorithm.
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGIJORCS
Clustering plays a vital role in the various areas of research like Data Mining, Image Retrieval, Bio-computing and many a lot. Distance measure plays an important role in clustering data points. Choosing the right distance measure for a given dataset is a biggest challenge. In this paper, we study various distance measures and their effect on different clustering. This paper surveys existing distance measures for clustering and present a comparison between them based on application domain, efficiency, benefits and drawbacks. This comparison helps the researchers to take quick decision about which distance measure to use for clustering. We conclude this work by identifying trends and challenges of research and development towards clustering.
CLIQUE is an algorithm for subspace clustering of high-dimensional data. It works in two steps: (1) It partitions each dimension of the data space into intervals of equal length to form a grid, (2) It identifies dense units within this grid and finds clusters as maximal sets of connected dense units. CLIQUE efficiently discovers clusters by identifying dense units in subspaces and intersecting them to obtain candidate dense units in higher dimensions. It automatically determines relevant subspaces for clustering and scales well with large, high-dimensional datasets.
The document summarizes a student project analyzing restaurant data using Python. It includes an introduction to the project goals, dataset, data mining techniques, and machine learning algorithms used. Specifically, the project aims to collect a restaurant dataset from Kaggle, apply knowledge in data mining and machine learning using Python, and perform classification, regression, and prediction tasks. Key algorithms discussed include linear regression, covariance, standard deviation, and prediction using support vector machines.
Lecture 5: Similarity and Distance. Metrics. Min-wise independent hashing. (ppt,pdf)
Chapter 3 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman.
Chapter 2 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.
Data Mining Exploring DataLecture Notes for Chapter 3OllieShoresna
This document provides an overview of data exploration techniques discussed in the book Introduction to Data Mining by Tan, Steinbach, and Kumar. It discusses key motivations for data exploration such as understanding data characteristics and selecting appropriate analysis tools. Common exploratory techniques include summary statistics, visualization, and OLAP. Visualization techniques covered include histograms, box plots, scatter plots, and parallel coordinates. OLAP operations like slicing, dicing, roll-up and drill-down on multidimensional data cubes are also described. Examples use the Iris dataset to illustrate various techniques.
KNN is a simple machine learning algorithm that classifies data points based on their similarity. It works by finding the K nearest neighbors of a new data point based on distance, and assigning the most common class among those neighbors as the prediction. The value of K affects the decision boundary, with larger K resulting in smoother boundaries. Choosing an optimal K is done empirically. While simple to implement, KNN has limitations with high-dimensional or irrelevant features in data.
This document discusses different types of data attributes, including nominal, ordinal, interval, and ratio attributes. It also describes structured and unstructured data types, such as records, matrices, documents, transactions, graphs, and web data. Finally, it covers various data preprocessing techniques like aggregation, sampling, dimensionality reduction, feature selection and creation, and data transformation.
This document discusses data and attributes in data mining. It defines data as a collection of objects and their properties or attributes. Attributes can be nominal, ordinal, interval or ratio. The document describes different types of attributes and data sets, as well as important characteristics like dimensionality and sparsity. It also covers data quality issues, preprocessing techniques like aggregation, sampling and feature selection, and measures of similarity and dissimilarity between data objects.
Connectivity-Based Clustering for Mixed Discrete and Continuous DataIJCI JOURNAL
This paper introduces a density-based clustering procedure for datasets with variables of mixed type. The proposed procedure, which is closely related to the concept of shared neighbourhoods, works particularly well in cases where the individual clusters differ greatly in terms of the average pairwise distance of the associated objects. Using a number of concrete examples, it is shown that the proposed clustering algorithm succeeds in allowing the identification of subgroups of objects with statistically significant distributional characteristics.
The document discusses various techniques for clustering data, including hierarchical clustering, k-means algorithms, and distance measures. It provides examples of how different types of data like documents, customer purchases, DNA sequences can be represented as vectors and clustered. Key clustering approaches described are hierarchical agglomerative clustering using different linkage criteria, k-means clustering and its variant BFR for large datasets.
This document provides an overview of clustering and proximity measures in pattern recognition. It introduces clustering as an unsupervised learning technique where patterns are organized into groups without labeled data. Different proximity measures are discussed, including metrics for measuring distance and similarity between vector pairs and sets of vectors. Examples of proximity measures include Minkowski distance, cosine similarity, and computing proximities between a point and cluster or between two clusters based on maximum, minimum, or average distances between points.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
The document discusses various distance metrics that can be used to quantify similarity between text documents for machine learning applications. It explains challenges in modeling text data due to its high dimensionality and sparse distributions. It then summarizes distance metrics available in Scikit-Learn and SciPy that can be used, including Euclidean, Manhattan, Chebyshev, Minkowski, Mahalanobis, Cosine, Canberra, Jaccard, and Hamming distances. It provides examples applying t-SNE visualization to embed documents from three text corpora using different distance metrics to understand how the choice of distance metric impacts the resulting visualizations.
The document discusses various clustering algorithms and concepts:
1) K-means clustering groups data by minimizing distances between points and cluster centers, but it is sensitive to initialization and may find local optima.
2) K-medians clustering is similar but uses point medians instead of means as cluster representatives.
3) K-center clustering aims to minimize maximum distances between points and clusters, and can be approximated with a farthest-first traversal algorithm.
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERINGIJORCS
Clustering plays a vital role in the various areas of research like Data Mining, Image Retrieval, Bio-computing and many a lot. Distance measure plays an important role in clustering data points. Choosing the right distance measure for a given dataset is a biggest challenge. In this paper, we study various distance measures and their effect on different clustering. This paper surveys existing distance measures for clustering and present a comparison between them based on application domain, efficiency, benefits and drawbacks. This comparison helps the researchers to take quick decision about which distance measure to use for clustering. We conclude this work by identifying trends and challenges of research and development towards clustering.
CLIQUE is an algorithm for subspace clustering of high-dimensional data. It works in two steps: (1) It partitions each dimension of the data space into intervals of equal length to form a grid, (2) It identifies dense units within this grid and finds clusters as maximal sets of connected dense units. CLIQUE efficiently discovers clusters by identifying dense units in subspaces and intersecting them to obtain candidate dense units in higher dimensions. It automatically determines relevant subspaces for clustering and scales well with large, high-dimensional datasets.
The document summarizes a student project analyzing restaurant data using Python. It includes an introduction to the project goals, dataset, data mining techniques, and machine learning algorithms used. Specifically, the project aims to collect a restaurant dataset from Kaggle, apply knowledge in data mining and machine learning using Python, and perform classification, regression, and prediction tasks. Key algorithms discussed include linear regression, covariance, standard deviation, and prediction using support vector machines.
Lecture 5: Similarity and Distance. Metrics. Min-wise independent hashing. (ppt,pdf)
Chapter 3 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman.
Chapter 2 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
This chapter discusses getting to know data through analysis and visualization. It covers data objects and attribute types, statistical descriptions of data including measures of central tendency and dispersion, visualization techniques like histograms and scatter plots, and measuring similarity between data objects. The goal is to better understand data characteristics before applying more advanced mining techniques.
Data Mining Exploring DataLecture Notes for Chapter 3OllieShoresna
This document provides an overview of data exploration techniques discussed in the book Introduction to Data Mining by Tan, Steinbach, and Kumar. It discusses key motivations for data exploration such as understanding data characteristics and selecting appropriate analysis tools. Common exploratory techniques include summary statistics, visualization, and OLAP. Visualization techniques covered include histograms, box plots, scatter plots, and parallel coordinates. OLAP operations like slicing, dicing, roll-up and drill-down on multidimensional data cubes are also described. Examples use the Iris dataset to illustrate various techniques.
KNN is a simple machine learning algorithm that classifies data points based on their similarity. It works by finding the K nearest neighbors of a new data point based on distance, and assigning the most common class among those neighbors as the prediction. The value of K affects the decision boundary, with larger K resulting in smoother boundaries. Choosing an optimal K is done empirically. While simple to implement, KNN has limitations with high-dimensional or irrelevant features in data.
This document discusses different types of data attributes, including nominal, ordinal, interval, and ratio attributes. It also describes structured and unstructured data types, such as records, matrices, documents, transactions, graphs, and web data. Finally, it covers various data preprocessing techniques like aggregation, sampling, dimensionality reduction, feature selection and creation, and data transformation.
This document discusses data and attributes in data mining. It defines data as a collection of objects and their properties or attributes. Attributes can be nominal, ordinal, interval or ratio. The document describes different types of attributes and data sets, as well as important characteristics like dimensionality and sparsity. It also covers data quality issues, preprocessing techniques like aggregation, sampling and feature selection, and measures of similarity and dissimilarity between data objects.
Connectivity-Based Clustering for Mixed Discrete and Continuous DataIJCI JOURNAL
This paper introduces a density-based clustering procedure for datasets with variables of mixed type. The proposed procedure, which is closely related to the concept of shared neighbourhoods, works particularly well in cases where the individual clusters differ greatly in terms of the average pairwise distance of the associated objects. Using a number of concrete examples, it is shown that the proposed clustering algorithm succeeds in allowing the identification of subgroups of objects with statistically significant distributional characteristics.
The document discusses various techniques for clustering data, including hierarchical clustering, k-means algorithms, and distance measures. It provides examples of how different types of data like documents, customer purchases, DNA sequences can be represented as vectors and clustered. Key clustering approaches described are hierarchical agglomerative clustering using different linkage criteria, k-means clustering and its variant BFR for large datasets.
This document provides an overview of clustering and proximity measures in pattern recognition. It introduces clustering as an unsupervised learning technique where patterns are organized into groups without labeled data. Different proximity measures are discussed, including metrics for measuring distance and similarity between vector pairs and sets of vectors. Examples of proximity measures include Minkowski distance, cosine similarity, and computing proximities between a point and cluster or between two clusters based on maximum, minimum, or average distances between points.
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
the presentation contains the following :
-Data Objects and Attribute Types.
-Basic Statistical Descriptions of Data.
-Data Visualization.
-Measuring Data Similarity and Dissimilarity.
-Summary.
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
This document summarizes a research paper titled "A Novel Algorithm for Design Tree Classification with PCA". It discusses dimensionality reduction techniques like principal component analysis (PCA) that can improve the efficiency of classification algorithms on high-dimensional data. PCA transforms data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate, called the first principal component. The paper proposes applying PCA and linear transformation on an original dataset before using a decision tree classification algorithm, in order to get better classification results.
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
The document discusses various distance metrics that can be used to quantify similarity between text documents for machine learning applications. It explains challenges in modeling text data due to its high dimensionality and sparse distributions. It then summarizes distance metrics available in Scikit-Learn and SciPy that can be used, including Euclidean, Manhattan, Chebyshev, Minkowski, Mahalanobis, Cosine, Canberra, Jaccard, and Hamming distances. It provides examples applying t-SNE visualization to embed documents from three text corpora using different distance metrics to understand how the choice of distance metric impacts the resulting visualizations.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. Distance or Similarity Measures
Many data mining and analytics tasks involve the comparison of
objects and determining in terms of their similarities (or
dissimilarities)
Clustering
Nearest-neighbor search, classification, and prediction
Characterization and discrimination
Automatic categorization
Correlation analysis
Many of todays real-world applications rely on the computation
similarities or distances among objects
Personalization
Recommender systems
Document categorization
Information retrieval
Target marketing
2
3. Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Proximity refers to a similarity or dissimilarity
3
4. 4
Distance or Similarity Measures
Measuring Distance
In order to group similar items, we need a way to measure the distance
between objects (e.g., records)
Often requires the representation of objects as “feature vectors”
ID Gender Age Salary
1 F 27 19,000
2 M 51 64,000
3 M 52 100,000
4 F 33 55,000
5 M 45 45,000
T1 T2 T3 T4 T5 T6
Doc1 0 4 0 0 0 2
Doc2 3 1 4 3 1 2
Doc3 3 0 0 0 3 0
Doc4 0 1 0 3 0 0
Doc5 2 2 2 3 1 4
An Employee DB Term Frequencies for Documents
Feature vector corresponding to
Employee 2: <M, 51, 64000.0>
Feature vector corresponding to Document 4:
<0, 1, 0, 3, 0, 0>
5. 5
Distance or Similarity Measures
Properties of Distance Measures:
for all objects A and B, dist(A, B) 0, and dist(A, B) = dist(B, A)
for any object A, dist(A, A) = 0
dist(A, C) dist(A, B) + dist (B, C)
Representation of objects as vectors:
Each data object (item) can be viewed as an n-dimensional vector, where
the dimensions are the attributes (features) in the data
Example (employee DB): Emp. ID 2 = <M, 51, 64000>
Example (Documents): DOC2 = <3, 1, 4, 3, 1, 2>
The vector representation allows us to compute distance or similarity
between pairs of items using standard vector operations, e.g.,
Cosine of the angle between vectors
Manhattan distance
Euclidean distance
Hamming Distance
6. Data Matrix and Distance Matrix
Data matrix
Conceptual representation of a table
Cols = features; rows = data objects
n data points with p dimensions
Each row in the matrix is the vector
representation of a data object
Distance (or Similarity) Matrix
n data points, but indicates only the
pairwise distance (or similarity)
A triangular matrix
Symmetric
6
np
x
...
nf
x
...
n1
x
...
...
...
...
...
ip
x
...
if
x
...
i1
x
...
...
...
...
...
1p
x
...
1f
x
...
11
x
0
...
)
2
,
(
)
1
,
(
:
:
:
)
2
,
3
(
)
...
n
d
n
d
0
d
d(3,1
0
d(2,1)
0
7. Proximity Measure for Nominal Attributes
If object attributes are all nominal (categorical), then proximity
measure are used to compare objects
Can take 2 or more states, e.g., red, yellow, blue, green
(generalization of a binary attribute)
Method 1: Simple matching
m: # of matches, p: total # of variables
Method 2: Convert to Standard Spreadsheet format
For each attribute A create M binary attribute for the M nominal states of A
Then use standard vector-based similarity or distance metrics
7
p
m
p
j
i
d
)
,
(
8. Proximity Measure for Binary Attributes
A contingency table for
binary data
Distance measure for
symmetric binary variables
Distance measure for
asymmetric binary variables
Jaccard coefficient (similarity
measure for asymmetric
binary variables)
8
Object i
Object j
9. Normalizing or Standardizing Numeric Data
Z-score:
x: raw value to be standardized, μ: mean of the population,
σ: standard deviation
the distance between the raw score and the population mean
in units of the standard deviation
negative when the value is below the mean, “+” when above
Min-Max Normalization
x
z
9
ID Gender Age Salary
1 F 27 19,000
2 M 51 64,000
3 M 52 100,000
4 F 33 55,000
5 M 45 45,000
ID Gender Age Salary
1 1 0.00 0.00
2 0 0.96 0.56
3 0 1.00 1.00
4 1 0.24 0.44
5 0 0.72 0.32
10. 10
Common Distance Measures for Numeric Data
Consider two vectors
Rows in the data matrix
Common Distance Measures:
Manhattan distance:
Euclidean distance:
Distance can be defined as a dual of a similarity measure
( , ) 1 ( , )
dist X Y sim X Y
2 2
( )
( , )
i i
i
i i
i i
x y
sim X Y
x y
12. Distance on Numeric Data:
Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
Note that Euclidean and Manhattan distances are special cases
h = 1: (L1 norm) Manhattan distance
h = 2: (L2 norm) Euclidean distance
12
)
|
|
...
|
|
|
(|
)
,
( 2
2
2
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d
|
|
...
|
|
|
|
)
,
(
2
2
1
1 p
p j
x
i
x
j
x
i
x
j
x
i
x
j
i
d
13. 13
Vector-Based Similarity Measures
In some situations, distance measures provide a skewed view of data
E.g., when the data is very sparse and 0’s in the vectors are not significant
In such cases, typically vector-based similarity measures are used
Most common measure: Cosine similarity
Dot product of two vectors:
Cosine Similarity = normalized dot product
the norm of a vector X is:
the cosine similarity is:
i
i
x
X 2
i
i
i
i
i
i
i
y
x
y
x
y
X
Y
X
Y
X
sim
2
2
)
(
)
,
(
1 2
, , , n
X x x x
1 2
, , , n
Y y y y
i
i
i y
x
Y
X
Y
X
sim )
,
(
14. 14
Vector-Based Similarity Measures
Why divide by the norm?
Example:
X = <2, 0, 3, 2, 1, 4>
||X|| = SQRT(4+0+9+4+1+16) = 5.83
X* = X / ||X|| = <0.343, 0, 0.514, 0.343, 0.171, 0.686>
Now, note that ||X*|| = 1
So, dividing a vector by its norm, turns it into a unit-length vector
Cosine similarity measures the angle between two unit length vectors (i.e., the
magnitude of the vectors are ignored).
1 2
, , , n
X x x x
i
i
x
X 2
15. 15
Example Application: Information Retrieval
Documents are represented as “bags of words”
Represented as vectors when used computationally
A vector is an array of floating point (or binary in case of bit maps)
Has direction and magnitude
Each vector has a place for every term in collection (most are sparse)
nova galaxy heat actor film role
A 1.0 0.5 0.3
B 0.5 1.0
C 1.0 0.8 0.7
D 0.9 1.0 0.5
E 1.0 1.0
F 0.7
G 0.5 0.7 0.9
H 0.6 1.0 0.3 0.2
I 0.7 0.5 0.3
Document Ids
a document
vector
16. Documents & Query in n-dimensional Space
16
Documents are represented as vectors in the term space
Typically values in each dimension correspond to the frequency of the
corresponding term in the document
Queries represented as vectors in the same vector-space
Cosine similarity between the query and documents is often used
to rank retrieved documents
18. Correlation as Similarity
In cases where there could be high mean variance across data
objects (e.g., movie ratings), Pearson Correlation coefficient is
the best option
Pearson Correlation
Often used in recommender systems based on Collaborative
Filtering
18
19. 19
Distance-Based Classification
Basic Idea: classify new instances based on their similarity to or
distance from instances we have seen before
also called “instance-based learning”
Simplest form of MBR: Rote Learning
learning by memorization
save all previously encountered instance; given a new instance, find one from
the memorized set that most closely “resembles” the new one; assign new
instance to the same class as the “nearest neighbor”
more general methods try to find k nearest neighbors rather than just one
but, how do we define “resembles?”
MBR is “lazy”
defers all of the real work until new instance is obtained; no attempt is made to
learn a generalized model from the training set
less data preprocessing and model evaluation, but more work has to be done at
classification time
20. Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then it’s probably a duck
20
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records
21. K-Nearest-Neighbor Strategy
Given object x, find the k most similar objects to x
The k nearest neighbors
Variety of distance or similarity measures can be used to identify and rank
neighbors
Note that this requires comparison between x and all objects in the database
Classification:
Find the class label for each of the k neighbor
Use a voting or weighted voting approach to determine the majority class
among the neighbors (a combination function)
Weighted voting means the closest neighbors count more
Assign the majority class label to x
Prediction:
Identify the value of the target attribute for the k neighbors
Return the weighted average as the predicted value of the target attribute for x
21
22. 22
K-Nearest-Neighbor Strategy
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
23. 23
Combination Functions
Voting: the “democracy” approach
poll the neighbors for the answer and use the majority vote
the number of neighbors (k) is often taken to be odd in order to avoid ties
works when the number of classes is two
if there are more than two classes, take k to be the number of classes plus 1
Impact of k on predictions
in general different values of k affect the outcome of classification
we can associate a confidence level with predictions (this can be the % of
neighbors that are in agreement)
problem is that no single category may get a majority vote
if there is strong variations in results for different choices of k, this an
indication that the training set is not large enough
24. 24
Voting Approach - Example
ID Gender Age Salary Respond?
1 F 27 19,000 no
2 M 51 64,000 yes
3 M 52 105,000 yes
4 F 33 55,000 yes
5 M 45 45,000 no
new F 45 100,000 ?
Neighbors Answers k =1 k = 2 k = 3 k = 4 k = 5
D_man 4,3,5,2,1 Y,Y,N,Y,N yes yes yes yes yes
D_euclid 4,1,5,2,3 Y,N,N,Y,Y yes ? no ? yes
k =1 k = 2 k = 3 k = 4 k = 5
D_man yes, 100% yes, 100% yes, 67% yes, 75% yes, 60%
D_euclid yes, 100% yes, 50% no, 67% yes, 50% yes, 60%
Will a new customer
respond to solicitation?
Using the voting method without confidence
Using the voting method with a confidence
25. 25
Combination Functions
Weighted Voting: not so “democratic”
similar to voting, but the vote some neighbors counts more
“shareholder democracy?”
question is which neighbor’s vote counts more?
How can weights be obtained?
Distance-based
closer neighbors get higher weights
“value” of the vote is the inverse of the distance (may need to add a small constant)
the weighted sum for each class gives the combined score for that class
to compute confidence, need to take weighted average
Heuristic
weight for each neighbor is based on domain-specific characteristics of that neighbor
Advantage of weighted voting
introduces enough variation to prevent ties in most cases
helps distinguish between competing neighbors
26. 26
KNN and Collaborative Filtering
Collaborative Filtering Example
A movie rating system
Ratings scale: 1 = “hate it”; 7 = “love it”
Historical DB of users includes ratings of movies by Sally, Bob, Chris, and Lynn
Karen is a new user who has rated 3 movies, but has not yet seen “Independence
Day”; should we recommend it to her?
Sally Bob Chris Lynn Karen
Star Wars 7 7 3 4 7
Jurassic Park 6 4 7 4 4
Terminator II 3 4 7 6 3
Independence Day 7 6 2 2 ?
Will Karen like “Independence Day?”
27. 27
Collaborative Filtering
(k Nearest Neighbor Example)
Star Wars Jurassic Park Terminator 2 Indep. Day Average Cosine Distance Euclid Pearson
Sally 7 6 3 7 5.33 0.983 2 2.00 0.85
Bob 7 4 4 6 5.00 0.995 1 1.00 0.97
Chris 3 7 7 2 5.67 0.787 11 6.40 -0.97
Lynn 4 4 6 2 4.67 0.874 6 4.24 -0.69
Karen 7 4 3 ? 4.67 1.000 0 0.00 1.00
K Pearson
1 6
2 6.5
3 5
Example computation:
Pearson(Sally, Karen) = ( (7-5.33)*(7-4.67) + (6-5.33)*(4-4.67) + (3-5.33)*(3-4.67) )
/ SQRT( ((7-5.33)2
+(6-5.33)2
+(3-5.33)2
) * ((7- 4.67)2
+(4- 4.67)2
+(3- 4.67)2
)) = 0.85
K is the number of nearest
neighbors used in to find the
average predicted ratings of
Karen on Indep. Day.
Prediction
28. 28
Collaborative Filtering
(k Nearest Neighbor)
In practice a more sophisticated approach is used to generate the predictions
based on the nearest neighbors
To generate predictions for a target user a on an item i:
ra = mean rating for user a
u1, …, uk are the k-nearest-neighbors to a
ru,i = rating of user u on item I
sim(a,u) = Pearson correlation between a and u
This is a weighted average of deviations from the neighbors’ mean
ratings (and closer neighbors count more)
k
u
k
u u
i
u
a
i
a
u
a
sim
u
a
sim
r
r
r
p
1
1 ,
,
)
,
(
)
,
(
)
(