5. Supervised vs. Unsupervised Learning
• Supervised Learning:
– “Learn” a relationship from:
• [SL,SW,PL,PW] à Species(S,Vi,Ve)
• Unsupervised Learning
– “Learn” something from:
• [SL,SW,PL,PW]
– Why ?
6. How to compare ?
Sepal Length Sepal Width Petal Length Petal Width
f1 5.1 3.4 1.4 0.2
f2 7.2 3.6 6.1 2.5
(5.1-7.2)=-2.1 (3.4-3.6)=-0.2 (1.4-6.1)=-4.7 (0.2-2.5)=-2.3
(5.1-7.2)2 = 4.41 (3.4-3.6)2=0.04 (1.4-6.1)2
=22.09
(0.2-2.5)2=5.29
distance(f1,f2) = (4.41 + 0.04 + 22.09 + 5.29)0.5 = 5.64
This is called Euclidean Distance
Bottom Line: We have a quantitative way of comparing between
two flowers (or any two objects)
7. The ”average” flower
Sepal Length Sepal Width Petal Length Petal Width
f1 5.1 3.4 1.4 0.2
f2 7.2 3.6 6.1 2.5
Average of two numbers: X and Y is (X+Y)/2
Average of two flowers vectors (rows) f1 and f2:
!.#$%.&
&
,
'.($'.)
&
,
#.($).#
&
,
*.&$&.!
&
Does average flower exist ?
8. Unsupervised Learning
• We want to find ”three” (or K) averages in our
data set.
• If we can use the species information –then
easy.
• Chicken and Egg problem!
10. Breaking the chicken-egg problem
• In Machine Learning we break the chicken and egg problem using
”random guess”
• Randomly select three (K) vectors: m1,m2,m3
• Assignment:
– Let C1 ={all data points nearest to m1}
– Let C2 = {all data points nearest to m2}
– Let C3 = {all data points nearest to m3}
• Update:
– m1 is average of C1
– m2 is average of C2
– m3 is average of C3
• Repeat
12. K-means algorithm
Let C = initial k cluster centroids (often selected randomly)
Mark C as unstable
While <C is unstable>
Assign all data points to their nearest centroid in C.
Compute the centroids of the points assigned to each
element of C.
Update C as the set of new centroids.
Mark C as stable or unstable by comparing with
previous set of centroids.
End While
Complexity: O(nkdI)
n:num of points; k: num of clusters; d: dimension; I: num of iterations
Take away: complexity is linear in n.
From W3-S14
13. Example: 2 Clusters
c
c
c
c
A(-1,2) B(1,2)
C(-1,-2) D(1,-2)
(0,0)
K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and
{C,D}
K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then
{A,C} and {B,D} end up as the two clusters.
4
2
From W3-S16
14. Clustering with Outlier Detection
In general clustering algorithms are extremely
sensitive to outliers
15. K-means-- algorithm
Input: Data Set, k (number of clusters), L (number of outliers)
Let C = initial k cluster means (centroids) (often selected randomly)
Mark C as unstable
While <C is unstable>
Assign all data points to their nearest centroid in C.
Sort all points in descending order based on their distance to their nearest
centroid
Remove the top L points
Compute the centroids of the remaining points assigned to each
element of C.
Update C as the set of new centroids.
Mark C as stable or unstable by comparing with
previous set of centroids.
End While
17. Association Discovery
• Motivation:
TID Transaction
1 phone, adapter
2 phone, adapter, headphones, USB
3 adapter, charger, USB
4 phone, charger, USB
Definition: A itemset is a set of items
Definition: A itemset is frequent if the number of times it appears
is greater than a pre-defined threshold T
Objective: Find all frequent itemsets
18. Example
TID Transaction
1 phone, adapter
2 phone, adapter, headphones, USB
3 adapter, charger, USB
4 phone, charger, USB
Support of {phone} = ¾
Support of {phone, adapter} = 2/4
Support of {phone,adapter, USB} = ¼
Support of {charger,USB} = 2/4
19. Association Discovery
• Brute Force Approach
– Let I be the set of items
– Then number of possible subsets is 2|I| - 1
– For each possible subset check the
percentage of transactions which contain it
• Not practical
– 1000 items:
– 21000 – 1 > number of atoms in the universe
20. Association Discovery
• Efficient Algorithm
• Key Observation:
– If Itemset1 is a subset of Itemset2 then
support(Itemset1) > support(itemset2)
– Example:
• support(phone) >= support{phone, adapter}
• How can we use this observation to design an
algorithm ?
21. Latent Dirichlet Allocation
• Suppose you have the following sentences:
1. Technology companies include Amazon and Google, Facebook
2. Google applications shine
3. I bought pizza online from Amazon
4. Fresh pasta and pizza is delicious
• LDA makes it possible to automatically discover
topics from sentences
– Sentence 1& 2 – 100% Topic A
– Sentence 3 – 50% Topic A; 50% Topic B
– Sentence 4 – 100% Topic B
• Topics:
• Topic A – 50% Google, 25% Amazon…
• Topic B – 50% pizza, 25% pasta…..