MLSD18. Unsupervised Learning

1st edition
November 4-5, 2018
Machine Learning School in Doha

Unsupervised Learning
Sanjay Chawla
Qatar Computing Research Institute (QCRI)

Lets start with flowers….
Iris Setosa Iris Virginica Iris Versicolor
Can we write a computer program to distinguish between the three species ?

Supervised vs. Unsupervised Learning
• Supervised Learning:
– “Learn” a relationship from:
• [SL,SW,PL,PW] à Species(S,Vi,Ve)
• Unsupervised Learning
– “Learn” something from:
• [SL,SW,PL,PW]
– Why ?

How to compare ?
Sepal Length Sepal Width Petal Length Petal Width
f1 5.1 3.4 1.4 0.2
f2 7.2 3.6 6.1 2.5
(5.1-7.2)=-2.1 (3.4-3.6)=-0.2 (1.4-6.1)=-4.7 (0.2-2.5)=-2.3
(5.1-7.2)2 = 4.41 (3.4-3.6)2=0.04 (1.4-6.1)2
=22.09
(0.2-2.5)2=5.29
distance(f1,f2) = (4.41 + 0.04 + 22.09 + 5.29)0.5 = 5.64
This is called Euclidean Distance
Bottom Line: We have a quantitative way of comparing between
two flowers (or any two objects)

The ”average” flower
Sepal Length Sepal Width Petal Length Petal Width
f1 5.1 3.4 1.4 0.2
f2 7.2 3.6 6.1 2.5
Average of two numbers: X and Y is (X+Y)/2
Average of two flowers vectors (rows) f1 and f2:
!.#$%.&
&
,
'.($'.)
&
,
#.($).#
&
,
*.&$&.!
&
Does average flower exist ?

Unsupervised Learning
• We want to find ”three” (or K) averages in our
data set.
• If we can use the species information –then
easy.
• Chicken and Egg problem!

Revisiting chicken-egg problem
4.9 3.4 1.5 0.3Avg Setosa
However we cannot use the labels !

Breaking the chicken-egg problem
• In Machine Learning we break the chicken and egg problem using
”random guess”
• Randomly select three (K) vectors: m1,m2,m3
• Assignment:
– Let C1 ={all data points nearest to m1}
– Let C2 = {all data points nearest to m2}
– Let C3 = {all data points nearest to m3}
• Update:
– m1 is average of C1
• Repeat

Bad vs. Good
• Randomization can be tricky!

K-means algorithm
Let C = initial k cluster centroids (often selected randomly)
Mark C as unstable
While <C is unstable>
Assign all data points to their nearest centroid in C.
Compute the centroids of the points assigned to each
element of C.
Update C as the set of new centroids.
Mark C as stable or unstable by comparing with
previous set of centroids.
End While
Complexity: O(nkdI)
n:num of points; k: num of clusters; d: dimension; I: num of iterations
Take away: complexity is linear in n.
From W3-S14

Example: 2 Clusters
c
c
c
c
A(-1,2) B(1,2)
C(-1,-2) D(1,-2)
(0,0)
K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and
{C,D}
K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then
{A,C} and {B,D} end up as the two clusters.
4
2
From W3-S16

Clustering with Outlier Detection
In general clustering algorithms are extremely
sensitive to outliers

K-means-- algorithm
Input: Data Set, k (number of clusters), L (number of outliers)
Let C = initial k cluster means (centroids) (often selected randomly)
Mark C as unstable
While <C is unstable>
Assign all data points to their nearest centroid in C.
Sort all points in descending order based on their distance to their nearest
centroid
Remove the top L points
Compute the centroids of the remaining points assigned to each
element of C.
Update C as the set of new centroids.
Mark C as stable or unstable by comparing with
previous set of centroids.
End While

Association Discovery
• Motivation:
TID Transaction
1 phone, adapter
2 phone, adapter, headphones, USB
3 adapter, charger, USB
4 phone, charger, USB
Definition: A itemset is a set of items
Definition: A itemset is frequent if the number of times it appears
is greater than a pre-defined threshold T
Objective: Find all frequent itemsets

Example
TID Transaction
1 phone, adapter
2 phone, adapter, headphones, USB
3 adapter, charger, USB
4 phone, charger, USB
Support of {phone} = ¾
Support of {phone, adapter} = 2/4
Support of {phone,adapter, USB} = ¼
Support of {charger,USB} = 2/4

• Brute Force Approach
– Let I be the set of items
– Then number of possible subsets is 2|I| - 1
– For each possible subset check the
percentage of transactions which contain it
• Not practical
– 1000 items:
– 21000 – 1 > number of atoms in the universe

• Efficient Algorithm
• Key Observation:
– If Itemset1 is a subset of Itemset2 then
support(Itemset1) > support(itemset2)
– Example:
• support(phone) >= support{phone, adapter}
• How can we use this observation to design an
algorithm ?

Latent Dirichlet Allocation
• Suppose you have the following sentences:
1. Technology companies include Amazon and Google, Facebook
2. Google applications shine
3. I bought pizza online from Amazon
4. Fresh pasta and pizza is delicious
• LDA makes it possible to automatically discover
topics from sentences
– Sentence 1& 2 – 100% Topic A
– Sentence 3 – 50% Topic A; 50% Topic B
– Sentence 4 – 100% Topic B
• Topics:
• Topic A – 50% Google, 25% Amazon…
• Topic B – 50% pizza, 25% pasta…..

MLSD18. Unsupervised Learning

Recommended

Recommended

More Related Content

Similar to MLSD18. Unsupervised Learning

Similar to MLSD18. Unsupervised Learning (20)

More from BigML, Inc

More from BigML, Inc (20)

Recently uploaded

Recently uploaded (20)

MLSD18. Unsupervised Learning