Bats means
Upcoming SlideShare
Loading in...5
×
 
  • 68 views

manikandan.stc2014, v maniyarasan.stc2014, s manjuma devi.stc2014, a manoharan.stc2014, s mari muthu.stc2014, r mathan kumar.stc2014, m mathu vanan.stc2014, r mohana gowthem.stc2014, k mukes ...

manikandan.stc2014, v maniyarasan.stc2014, s manjuma devi.stc2014, a manoharan.stc2014, s mari muthu.stc2014, r mathan kumar.stc2014, m mathu vanan.stc2014, r mohana gowthem.stc2014, k mukes kumar.stc2014, s muthu kumar.stc2014, m naseema.stc2014, b nirmala devi.stc2014, t nithiya priya.stc2014, s raguram.stc2014, t rajaswari.stc2014, s rakki muthu.stc2014, m ravi shankar.stc2014, s sabapathi.stc2014, m sabarinathan.stc2014, t sahana.stc2014, k sajina parveen.stc2014, m saranya.stc2014, p sathya priya.stc2014, m sindhuja.stc2014, p somukumr.stc2014, m sountharya.stc2014, m sounya.stc2014, l suman.stc2014, s sureka rani.stc2014, s vasuki.stc2014, m venkateswaran2010.stc2014, d vignesh.stc2014, s vijay.stc2014, k mughi vanan.stc2014,

Statistics

Views

Total Views
68
Views on SlideShare
68
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Bats means Bats means Presentation Transcript

    • K-means++: The Advantages of Careful Seeding Sergei Vassilvitskii David Arthur (Stanford university)
    • Given points in split them into similar groups. Clustering n Rd k
    • Given points in split them into similar groups. Clustering n Rd k This talk: k-means clustering: Find centers, that minimizek C x∈X min c∈C x − c 2 2
    • Objective: Find centers, that minimize Why Means? k C x∈X min c∈C x − c 2 2 For one cluster: Find that minimizesy x∈X x − y 2 2 Easy! y = 1 |X| x∈X x
    • Lloyd’s Method: k-means Initialize with random clusters
    • Lloyd’s Method: k-means Assign each point to nearest center
    • Lloyd’s Method: k-means Recompute optimum centers (means)
    • Lloyd’s Method: k-means Repeat: Assign points to nearest center
    • Lloyd’s Method: k-means Repeat: Recompute centers
    • Lloyd’s Method: k-means Repeat...
    • Lloyd’s Method: k-means Repeat...Until clustering does not change
    • Analysis How good is this algorithm? Finds a local optimum That is potentially arbitrarily worse than optimal solution
    • Approximating k-means • Mount et al.: approximation in time • Har Peled et al.: in time • Kumar et al.: in time 9 + O(n3 / d ) 1 + O(n + kk+2 −2dk logk (n/ )) 1 + 2(k/ )O(1) nd
    • Approximating k-means • Mount et al.: approximation in time • Har Peled et al.: in time • Kumar et al.: in time Lloyd’s method: • Worst-case time complexity: • Smoothed complexity: 9 + O(n3 / d ) 1 + O(n + kk+2 −2dk logk (n/ )) 1 + 2(k/ )O(1) nd 2Ω( √ n) nO(k)
    • Approximating k-means • Mount et al.: approximation in time • Har Peled et al.: in time • Kumar et al.: in time Lloyd’s method: 9 + O(n3 / d ) 1 + O(n + kk+2 −2dk logk (n/ )) 1 + 2(k/ )O(1) nd For example, Digit Recognition dataset (UCI): n = 60, 000 d = 600 Convergence to a local optimum in 60 iterations.
    • Challenge Develop an approximation algorithm for k-means clustering that is competitive with the k-means method in speed and solution quality. Easiest line of attack: focus on the initial center positions. Classical k-means: pick points at random.k
    • k-means on Gaussians
    • k-means on Gaussians
    • Easy Fix Select centers using a furthest point algorithm (2-approximation to k-Center clustering).
    • Easy Fix Select centers using a furthest point algorithm (2-approximation to k-Center clustering).
    • Easy Fix Select centers using a furthest point algorithm (2-approximation to k-Center clustering).
    • Easy Fix Select centers using a furthest point algorithm (2-approximation to k-Center clustering).
    • Easy Fix Select centers using a furthest point algorithm (2-approximation to k-Center clustering).
    • Sensitive to Outliers
    • Sensitive to Outliers
    • Sensitive to Outliers
    • Interpolate between the two methods: Let be the distance between and the nearest cluster center. Sample proportionally to (D(x))α = Dα (x) k-means++ D(x) x Original Lloyd’s: α = 0 Contribution of to the overall errorx α = ∞Furthest Point: α = 2k-means++:
    • k-Means++
    • k-Means++ Theorem: k-means++ is approximate in expectation.Θ(log k) Ostrovsky et al. [06]: Similar method is approximate under some data distribution assumptions. O(1)
    • Proof - 1st cluster Fix an optimal clustering .C∗ Pick first center uniformly at random Bound the total error of that cluster.
    • Proof - 1st cluster Let be the cluster. Each point equally likely to be the chosen center. A a0 ∈ A = 2 a∈A a − ¯A 2 = 2φ∗ (A) E[φ(A)] = a0∈A 1 |A| a∈A a − a0 2 Expected Error:
    • Proof - Other Clusters Suppose next center came from a new cluster in OPT. Bound the total error of that cluster.
    • Other CLusters Let be this cluster, and the point selected. Then: B b0 E[φ(B)] = b0∈B D2 (b0) b∈B D2(b) · b∈B min(D(b), b − b0 )2 Key step: D(b0) ≤ D(b) + b − b0
    • Cont. For any b: D2 (b0) ≤ 2D2 (b) + 2 b − b0 2 Same for all b0 Cost in uniform sampling D2 (b0) ≤ 2 |B| b∈B D2 (b) + 2 |B| b∈B b − b0 2 Avg. over all b:
    • D2 (b0) ≤ 2 |B| b∈B D2 (b) + 2 |B| b∈B b − b0 2 Avg. over all b: Cont. For any b: D2 (b0) ≤ 2D2 (b) + 2 b − b0 2 Recall: E[φ(B)] = b0∈B D2 (b0) b∈B D2(b) · b∈B min(D(b), b − b0 )2 ≤ 4 |B| b0∈B b∈B b − b0 2 = 8φ∗ (B)
    • Wrap Up If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive.8
    • Wrap Up If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error. 8
    • Wrap Up If clusters are well separated, and we always pick a center from a new optimal cluster, the algorithm is - competitive. Intuition: if no points from a cluster are picked, then it probably does not contribute much to the overall error. Formally, an inductive proof shows this method is competitive. 8 Θ(log k)
    • Experiments Tested on several datasets: Synthetic • 10k points, 3 dimensions Cloud Cover (UCI Repository] • 10k points, 54 dimensions Color Quantization • 16k points, 16 dimensions Intrusion Detection (KDD Cup) • 500k points, 35 dimensions
    • Typical Run KM++ v. KM v. KM-Hybrid 600 700 800 900 1000 1100 1200 1300 0 50 100 150 200 250 300 350 400 450 500 Stage Error LLOYD HYBRID KM++
    • Experiments Total Error Time: k-means++ 1% slower due to initialization. k-means km-Hybrid k-means++ Synthetic Cloud Cover Color Intrusion 6.02 × 105 5.95 × 105 6.06 × 105 670741 712 0.016 0.015 0.014 32.9 × 103 3.4 × 103−
    • Final Message Friends don’t let friends use k-means.
    • Thank You Any Questions?