• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Clustering large-scale data Buzzwords 2013
 

Clustering large-scale data Buzzwords 2013

on

  • 333 views

Talk I gave at Buzzwords 2013. This should also be recorded and online somewhere.

Talk I gave at Buzzwords 2013. This should also be recorded and online somewhere.

Statistics

Views

Total Views
333
Views on SlideShare
314
Embed Views
19

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 19

http://www.linkedin.com 8
https://www.linkedin.com 6
https://twitter.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Clustering large-scale data Buzzwords 2013 Clustering large-scale data Buzzwords 2013 Presentation Transcript

    • Clustering data at scaleDan Filimon, Politehnica University BucharestTed Dunning, MapR TechnologiesSaturday, June 1, 13
    • whoami• Soon-to-be graduate of Politehnica Univerity ofBucharest• Apache Mahout committer• This work is my senior project• Contact me at• dfilimon@apache.org• dangeorge.filimon@gmail.comSaturday, June 1, 13
    • Agenda• Data• Clustering• k-means• Improvements• Large scale• k-means as a map-reduce• streaming k-means• MapReduce & Storm• ResultsFull version at http://goo.gl/n3n8SSaturday, June 1, 13
    • Data• Real-valued vectors (or anything that can beencoded as such)• Think of rows in a database table• Can be: documents, web pages, images, videos,users, DNASaturday, June 1, 13
    • The problemGroup n d-dimensional datapoints into k disjointsets to minimizeckXc10@Xxij2Xidist(xij, ci)1AXi is the ithclusterci is the centroid of the ithclusterxij is the jthpoint from the ithclusterdist(x, y) = ||x y||2Saturday, June 1, 13
    • k-meanspoints to clusterSaturday, June 1, 13
    • k-meansinitial centroidsSaturday, June 1, 13
    • k-meansfirst iteration assignmentSaturday, June 1, 13
    • k-meansadjusted centroidsSaturday, June 1, 13
    • k-meanssecond iteration assignmentSaturday, June 1, 13
    • k-meansadjusted centroidsSaturday, June 1, 13
    • Details• Quality• How to initialize the centroids• When to stop iterating• How to deal with outliers• Speed• Complexity of cluster assignmentSaturday, June 1, 13
    • Quality?• Clustering is in the eye of the beholder• Total clustering cost and:• compact• well-separated• Dunn Index, Davies-Bouldin Index, etc.Saturday, June 1, 13
    • Centroid initialization• Important for quickconvergence and quality• Randomly select k points ascentroids• Clustering fails if twocentroids are in the samecluster• k-means++ addresses this2 seeds here1 seed hereSaturday, June 1, 13
    • Outliers• Real data is messy• Outliers can affectk-means centroids●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●−5 0 5−6−4−20246xy●●Effect of outliersbetter centroidSaturday, June 1, 13
    • Closest cluster: reducing k• Avoid computing the distance from a point to everycluster• Random Projections• projection search• locality-sensitive hashing searchSaturday, June 1, 13
    • Random Projections1.5-2 -1.5 -1 -0.5 0.5 13-3-2-112X AxisYAxis• Unit length vectors with normally distributedcomponentsSaturday, June 1, 13
    • Closest cluster: reducing d• Principal Component Analysis• compute SVD• Random Projections• multiply data by projection matrixSaturday, June 1, 13
    • k-means as a MapReduce• Can’t split k-means loop directly• Must express a single k-means step as a MapReduceSaturday, June 1, 13
    • Now for somethingcompletely different• Attempt to build a “sketch” of the data in one pass• O(k log n) intermediate clusters• Can fit into memory• Ball k-means on the sketch for k final clustersSaturday, June 1, 13
    • streaming k-meansfirst pointSaturday, June 1, 13
    • streaming k-meansbecomes centroidSaturday, June 1, 13
    • streaming k-meanssecond pointfar awaySaturday, June 1, 13
    • streaming k-meansbecomes centroidfar awaySaturday, June 1, 13
    • streaming k-meansthird pointcloseSaturday, June 1, 13
    • streaming k-meanscentroid is updatedSaturday, June 1, 13
    • streaming k-meansclosebut becomes anew centroidSaturday, June 1, 13
    • streaming k-meansSaturday, June 1, 13
    • streaming k-means• for each point p, with weight w• find the closest centroid to p, call it c and let d be thedistance between p and c• if an event with probability proportional tod * w / distanceCutoff occurs• create a new cluster with p as its centroid• else, merge p into c• if there are too many clusters, increase distanceCutoffand cluster recursivelySaturday, June 1, 13
    • The big pictureMapReduce• Can cluster all the points with just 1 MapReduce:• m mappers run streaming k-means• 1 reducer runs ball k-means to get k clustersSaturday, June 1, 13
    • The big pictureStorm• Storm streaming topology• streaming k-means bolt• Release the sketch when notified (e.g. tick tuples)• Trident• streaming k-means partition aggregator• ball k-means aggregatorSaturday, June 1, 13
    • ResultsSaturday, June 1, 13
    • Quality• Compared the quality on various small-medium UCI datasets• iris, seeds, movement, control, power• Computed the following quality measures:• Dunn Index (higher is better)• Davies-Bouldin Index (lower is better)• Adjusted Rand Index (higher is better)• Total cost (lower is better)Saturday, June 1, 13
    • iris randplot-20.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5051062000038Confusion matrix run 2Adjusted Rand Index 1Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
    • iris randplot-40.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5501000383059Confusion matrix run 4Adjusted Rand Index 0.546899Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
    • seeds compareplotbskm km45674.3116448 4.22126213333333Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
    • movement compareplot●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 601.01.52.02.51.403378●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1.50878●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
    • control randplot-31 2 3 4 5 612345601320000039000000463934019710000001804590310000Confusion matrix run 3Adjusted Rand Index 0.724136Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
    • power allplot●●●●bskm km102050100200500142.101769766667 149.3591672Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
    • power compareplot●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ●●●0 5 10 15 20 25 300200400600800149.3592● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●142.1018●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
    • Overall (avg. 5 runs)Dataset Clustering Avg. Dunn Avg. DB Avg. Cost Avg.ARIiriskm 9.161 0.265 124.1460.905irisbskm 6.454 0.336 117.8590.905seedskm 7.432 0.453 909.8750.980seedsbskm 6.886 0.505 916.5110.980movementkm 0.457 1.843 336.4560.650movementbskm 0.436 2.003 347.0780.650controlkm 0.553 1.700 10143130.630controlbskm 0.753 1.434 10049170.630powerkm 0.107 1.380 733390830.605powerbskm 1.953 1.080 544227580.605Saturday, June 1, 13
    • Speed (Threaded)Saturday, June 1, 13
    • Speed (MapReduce)017.53552.57020 clusters 1 iterationKMeans StreamingKMeansiteration times comparable~2.4x fasterSaturday, June 1, 13
    • Speed (MapReduce)0751502253001000 clusters 1 iterationKMeans StreamingKMeansbenefits fromapproximatenearest-neighborsearch~8x fasterSaturday, June 1, 13
    • Code• Now available in Apache Mahout trunk• Prototype for Stormhttp://github.com/dfilimon/streaming-stormClustering algorithmsBallKMeansStreamingKMeansFast nearest-neighbor search ProjectionSearchQuality metrics ClusteringUtilsMapReduce classesStreamingKMeansMapperStreamingKMeansReducerStorm classes StreamingKMeansBoltSaturday, June 1, 13
    • Thank you!Questions?dfilimon@apache.orgdangeorge.filimon@gmail.comSaturday, June 1, 13