3. Supervised vs. UnSupervised Learning
Supervised Learning
Classification: partition examples into groups according to pre-defined
categories
Regression: assign value to feature vectors
Requires labeled data for training
Unsupervised Learning
Clustering: partition examples into groups when no pre-defined
categories/classes are available
Novelty detection: find changes in data
Outlier detection: find unusual events (e.g. hackers)
Only instances required, but no labels
4. Clustering Concepts
El objetivo básico del análisis de clusters es descubrir grupos en los
datos, de modo tal que los objetos del mismo grupo sean similares,
mientras que los objetos de diferentes grupos sean tan disímiles
como sea posible.
Partition unlabeled examples into disjoint subsets of clusters, such
that:
Examples within a cluster are similar
Examples in different clusters are different
Discover new categories in an unsupervised manner (no sample
category labels provided).
5. Clustering Concepts (2)
Las aplicaciones son muy numerosas, por ejemplo la clasificación de plantas y
animales, en ciencias sociales la clasificación de personas considerando sus
costumbres y preferencias, en marketing la identificación de grupos de consumidores
con necesidades parecidas, etc.
Cluster retrieved documents (e.g. Teoma)
to present more organized and understandable results to user
Detecting near duplicates
Entity resolution
E.g. “Thorsten Joachims” == “Thorsten B Joachims”
Cheating detection
Exploratory data analysis
Automated (or semi-automated) creation of taxonomies
e.g. Yahoo-style
6. Clustering Concepts (3)
Consideraremos dos tipos de algoritmos de clustering:
Métodos de partición: clasifican los datos en k grupos que deben cumplir los
requerimientos de una partición
Cada grupo debe contener al menos un objeto
Cada objeto debe pertenecer exactamente a un grupo.
Métodos jerárquicos:
Aglomerativos: empiezan con n clusters de una observación cada uno, en cada paso
se combinan dos grupos hasta terminar en un sólo cluster con n observaciones.
Divisorios: comienzan con un sólo cluster de n observaciones y en cada paso se divide
un grupo en dos hasta tener n clusters con una observación cada uno.
7. K-Means Clustering Method
1. Ask user how many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
3. For each datapoint find out which Center it’s closest to. (Thus each
Center “owns” a set of datapoints)
4. For each Center find the centroid of the points it owns
5. …and jumps there
6. …Repeat until terminated!
(Are we sure it will terminate?)
8. K-Means Step by step (1 & 2)
1. Ask user how many
clusters they’d like.
(e.g. k=5)
2. Randomly guess k
cluster Center
locations
9. K-Means Step by step (3)
1. Ask…
2. Randomly guess k
cluster Center
locations
3. For each datapoint
find out which Center
it’s closest to. (Thus
each Center “owns” a
set of datapoints)
10. K-Means Step by step (4)
1. Ask…
2. Randomly guess…
3. For each datapoint
find out which Center
it’s closest to. (Thus
each Center “owns” a
set of datapoints)
4. For each Center find
the centroid of the
points it owns
11. K-Means Step by step (5 & 6)
1. Ask…
2. Randomly guess…
3. For each datapoint …
4. For each Center find
the centroid of the
points it owns
5. …and jumps there
6. …Repeat until
terminated!
12. K-Means Q&A
What is it trying to optimize?
Are we sure it will terminate?
Are we sure it will find an optimal clustering?
How should we start it?
How could we automatically choose the number
of centers?
13. K-Means Q&A (2)
This clustering method is simple and reasonably effective.
The final cluster centers do not represent a global
minimum but only a local one.
Completely different final clusters can arise from
differerences in the initial randomly chosen cluster
centers.
14. K-Means Q&A (3)
Are we sure it will terminate?
There are only a finite number of ways of partitioning R records into k
groups.
So there are only a finite number of possible configurations in which all
Centers are the centroids of the points they own.
If the configuration changes on an iteration, it must have improved the
distortion.
So each time the configuration changes it must go to a configuration it’s
never been to before.
So if it tried to go on forever, it would eventually run out of configurations.
15. K-Means Q&A (4)
Will we find the optimal configuration?
Can you invent a configuration that has converged, but does not
have the minimum distortion?
16. K-Means Q&A (5)
Will we find the optimal configuration?
Can you invent a configuration that has converged, but does not
have the minimum distortion?
17. K-Means Q&A (6)
Trying to find good optima
Idea 1: Be careful about where you start
Neat trick:
Place first center on top of randomly chosen datapoint.
Place second center on datapoint that’s as far away as possible from first center:
Place j’th center on datapoint that’s as far away as possible from the closest of
Centers 1 through j-1
Idea 2: Do many runs of k-means, each from a different random start
configuration
Many other ideas floating around.
18. K-Means Q&A (7)
Choosing the number of Centers
A difficult problem
Most common approach is to try to find the solution that minimizes
the Schwarz Criterion
Trying k from 2 to n !!
Incrementally (k=2, then do 2-Means for each cluster, and so on…)
19. Common uses of K-means
Often used as an exploratory data analysis tool
In one-dimension, a good way to quantize realvalued variables into k
non-uniform buckets
Used on acoustic data in speech understanding to convert waveforms
into one of k categories (known as Vector Quantization)
Also used for choosing color palettes on old fashioned graphical
display devices!
21. Single Linkage Hierarchical Clustering (2)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
22. Single Linkage Hierarchical Clustering (3)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
3. Merge it into a parent
cluster
23. Single Linkage Hierarchical Clustering (4)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
3. Merge it into a parent
cluster
4. Repeat... until you’ve
merged the whole dataset
into one cluster
24. Single Linkage Hierarchical Clustering (5)
1. Say “Every point is its
own cluster”
2. Find “Most similar” pair of
clusters
3. Merge it into a parent
cluster
4. Repeat... until you’ve
merged the whole dataset
into one cluster
25. Hierarchical Clustering Q&A
How do we define similarity between clusters?
Minimum distance between points in clusters (in which case we’re
simply doing Euclidian Minimum Spanning Trees)
Maximum distance between points in clusters
Average distance between points in clusters
And more…
27. Hierarchical Clustering Q&A (2)
Single Linkage Comments
Also known in the trade as Hierarchical Agglomerative Clustering (note
the acronym)
It’s nice that you get a hierarchy instead of an amorphous collection of
groups
If you want k groups, just cut the (k-1) longest links
There’s no real statistical or information-theoretic foundation to this.
Makes your lecturer feel a bit queasy.
28. Cluster Silhouettes
Para cada ejemplo i definimos a(i), con A el cluster asignado a i
Luego calculamos d(i, C) para los clusters distintos a A
Nos quedamos con b(i) como la menor distancia un cluster. El cluster B para el cual este mínimo
se cumple, es decir d(i,B) = b(i) se llama el vecino del objeto i. (La segunda opción de
pertenencia)
29. Cluster Silhouettes (2)
Ahora definimos s(i) como:
Para entender el significado de s(i) veamos que sucede en las situaciones extremas:
Cuando s(i) es cercano a 1, a(i) es decir, el promedio de las disimilaridades entre i y los objetos de su cluster
son mucho más pequeñas que b(i) la disimilaridad entre i y el cluster vecino. Por lo tanto podemos decir que
i está bien clasificado.
Cuando s(i) es cercano a 0, b(i) y a(i) son aproximadamente iguales no es claro si i debe ser asignado a A ó al
cluster vecino. El objeto i está tan lejos de uno como de otro.
La peor situación se da cuando s(i) es cercano a –1, a(i) es mucho más grande que b(i), entonces i en
promedio está más cerca del cluster vecino que de A.
30. Cluster Silhouettes (3)
0.0 0.2 0.4 0.6 0.8 1.0
Li
J
Le
P
Ti
I
K
Ta
Silhouettewidth
Averagesilhouettewidth:0.8
C1
C2
SC Interpretación
0.71-1 Fuerte estructura
0.51-0.7 Razonable estructura
0.26-0.5 La estructura es débil y podría ser artificial
< 0.25 No se ha hallado estructura