• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Clustering large-scale data TU Berlin talk
 

Clustering large-scale data TU Berlin talk

on

  • 204 views

Talk I gave at TU Berlin for Sebastian Schelter's data mining class.

Talk I gave at TU Berlin for Sebastian Schelter's data mining class.

Statistics

Views

Total Views
204
Views on SlideShare
204
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Clustering large-scale data TU Berlin talk Clustering large-scale data TU Berlin talk Presentation Transcript

    • Clustering large-scale dataDan Filimon, Politehnica University BucharestTed Dunning, MapR TechnologiesBuzzwords talk at http://goo.gl/9FxbNSaturday, June 1, 13
    • whoami• Soon-to-be graduate of Politehnica Univerity ofBucharest• Apache Mahout committer• Contact me at• dfilimon@apache.org• dangeorge.filimon@gmail.comSaturday, June 1, 13
    • The problemGroup n d-dimensional datapoints into k disjointsets to minimizeckXc10@Xxij2Xidist(xij, ci)1AXi is the ithclusterci is the centroid of the ithclusterxij is the jthpoint from the ithclusterdist(x, y) = ||x y||2Saturday, June 1, 13
    • k-meanspoints to clusterSaturday, June 1, 13
    • k-meansinitial centroidsSaturday, June 1, 13
    • k-meansfirst iteration assignmentSaturday, June 1, 13
    • k-meansadjusted centroidsSaturday, June 1, 13
    • k-meanssecond iteration assignmentSaturday, June 1, 13
    • k-meansadjusted centroidsSaturday, June 1, 13
    • Details• Quality• How to initialize the centroids• When to stop iterating• How to deal with outliers• Speed• Complexity of cluster assignmentSaturday, June 1, 13
    • k-means as a MapReduce• Can’t split k-means loop directly• Must express a single k-means step as a MapReduceSaturday, June 1, 13
    • Now for somethingcompletely different• Attempt to build a “sketch” of the data in one pass• O(k log n) intermediate clusters• Can fit into memory• Ball k-means on the sketch for k final clustersSaturday, June 1, 13
    • streaming k-meansfirst pointSaturday, June 1, 13
    • streaming k-meansbecomes centroidSaturday, June 1, 13
    • streaming k-meanssecond pointfar awaySaturday, June 1, 13
    • streaming k-meansbecomes centroidfar awaySaturday, June 1, 13
    • streaming k-meansthird pointcloseSaturday, June 1, 13
    • streaming k-meanscentroid is updatedSaturday, June 1, 13
    • streaming k-meansclosebut becomes anew centroidSaturday, June 1, 13
    • streaming k-meansSaturday, June 1, 13
    • The big pictureMapReduce• Can cluster all the points with just 1 MapReduce:• m mappers run streaming k-means• 1 reducer runs ball k-means to get k clustersSaturday, June 1, 13
    • Quality• Compared the quality on various small-medium UCI datasets• iris, seeds, movement, control, power• k-means vs streaming k-means + ball k-means iscomparable quality-wiseSaturday, June 1, 13
    • iris randplot-20.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5051062000038Confusion matrix run 2Adjusted Rand Index 1Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
    • seeds compareplotbskm km45674.3116448 4.22126213333333Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
    • movement compareplot●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 601.01.52.02.51.403378●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1.50878●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
    • Overall (avg. 5 runs)Dataset Clustering Avg. Dunn Avg. DB Avg. Cost Avg.ARIiriskm 9.161 0.265 124.1460.905irisbskm 6.454 0.336 117.8590.905seedskm 7.432 0.453 909.8750.980seedsbskm 6.886 0.505 916.5110.980movementkm 0.457 1.843 336.4560.650movementbskm 0.436 2.003 347.0780.650controlkm 0.553 1.700 10143130.630controlbskm 0.753 1.434 10049170.630powerkm 0.107 1.380 733390830.605powerbskm 1.953 1.080 544227580.605Saturday, June 1, 13
    • Speed (Threaded)Saturday, June 1, 13
    • Speed (MapReduce)017.53552.57020 clusters 1 iterationKMeans StreamingKMeansiteration times comparable~2.4x fasterSaturday, June 1, 13
    • Speed (MapReduce)0751502253001000 clusters 1 iterationKMeans StreamingKMeansbenefits fromapproximatenearest-neighborsearch~8x fasterSaturday, June 1, 13
    • My takeContribute to open source!Saturday, June 1, 13
    • My takeRespect language conventions!Saturday, June 1, 13
    • My takeLearn math!Statistics and Linear Algebra especiallySaturday, June 1, 13
    • My takeWrite tests!They let you know when you break thingsSaturday, June 1, 13
    • My takeProfile your code!Especially when it’s oddly slow :)Saturday, June 1, 13
    • My takeUse source control effectively!Merge often, have multiple branchesSaturday, June 1, 13
    • My takeMicrobenchmarks are tricky in JavaUse frameworks like CaliperSaturday, June 1, 13
    • My takeAsk lots of questions!Don’t be afraid of saying something stupidSaturday, June 1, 13
    • Thank you!Questions?dfilimon@apache.orgdangeorge.filimon@gmail.comSaturday, June 1, 13