• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Fast and Accurate K-means for Large Datasets #nipsereading
 

Fast and Accurate K-means for Large Datasets #nipsereading

on

  • 13,435 views

 

Statistics

Views

Total Views
13,435
Views on SlideShare
2,271
Embed Views
11,164

Actions

Likes
5
Downloads
29
Comments
1

30 Embeds 11,164

http://d.hatena.ne.jp 10841
http://nokuno.blogspot.jp 153
http://nokuno.blogspot.com 60
http://www.feedspot.com 16
http://nokuno.blogspot.in 14
http://webcache.googleusercontent.com 12
http://translate.googleusercontent.com 10
http://nokuno.blogspot.fr 10
http://nokuno.blogspot.ca 8
https://twitter.com 7
http://teishoin.net 4
http://nokuno.blogspot.sg 4
https://twimg0-a.akamaihd.net 3
http://nokuno.blogspot.co.il 2
https://si0.twimg.com 2
http://www.newsblur.com 2
http://nokuno.blogspot.tw 2
http://nokuno.blogspot.com.au 2
http://nokuno.blogspot.dk 1
http://nokuno.blogspot.com.es 1
http://nokuno.blogspot.ru 1
http://nokuno.blogspot.de 1
http://192.168.0.150 1
http://app.unreadzero.com 1
http://nokuno.blogspot.ie 1
http://nokuno.blogspot.co.uk 1
http://nokuno.blogspot.kr 1
http://digg.com 1
http://rssminer.net 1
http://dhatenane.greatbabyfood.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Very good ppt for sharing
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Fast and Accurate K-means for Large Datasets #nipsereading Fast and Accurate K-means for Large Datasets #nipsereading Presentation Transcript

    • Fast  and  Accurate  K-­‐means   for  Large  Datasets Michael  Shindler,  Alex  Wong,  Adam  Meyerson   Presenter:  Yoh  Okuno  #nipsreading  
    • About  Presenter•  Name:  Yoh  Okuno    •  R&D  Engineer  at  Yahoo!  Japan  •  Interest:  NLP  (Natural  Language  Processing),   Machine  Learning,  and  Data  Mining.  •  Skills:  C/C++,  Java,  Python,  and  Hadoop.  •  Website:  http://yoh.okuno.name/  
    • Overview1.  Recent  Advancement  on  K-­‐means  Clustering   –  Batch  versus  Streaming  Settings   –  Related  Works  and  Our  Contribution  2.  Algorithm  for  Large-­‐Scale  K-­‐means  Clustering   –  Streaming  +  Mini-­‐Batch  +  Smart  Initialization  3.  Incorporating  Approximate  Nearest  Neighbor  Search   –  Based  on  Random  Projection  (Hashing)  4.  Evaluation  and  Discussion  
    • 1.  Recent  Advancement  on   K-­‐means  Clustering  
    • Review  of  the  Standard  K-­‐means  Clustering•  Minimize  cost  function  below  iteratively:   N   ￿ 2 minimize: ￿xi − µzi ￿   i=1   x_i:  i-­‐th  data  point   Where: z_i:  cluster  number     μ_j:  centroid  of  j-­‐th  cluster  1.  Update  z  with  fixed  μ  (assign  cluster  number)  2.  Update  μ  with  fixed  z  (calculate  average)  
    • Related  Works  and  Our  Contributions  •  The  standard  batch  algorithm  [Lloyd  1982]  •  Streaming  approaches  [Aggarwal  2007]  •  Mini-­‐batch    approaches  [Sculley  2010]  •  Our  work  is  based  on  a  recent  streaming   approach  [Braverman+  2011]    •  Incorporated  approximate  nearest  neighbor
    • 2.  Algorithm  for  Large-­‐Scale   K-­‐means  Clustering  
    • InitializeStreamingMini  Batch
    • Initialize  clusters•  Create  clusters  until  the  buffer  will  be  full   –  Run  nearest  neighbor  search  on  the  new  data   –  Add  a  cluster  randomly  (according  to  its  distance)    
    • Streaming  K-­‐means  Clustering•  Renew  clusters  randomly  in  the  same  way   Same  to  the     previous  page
    • Ball  k-­‐means  on  weighted  points•  Run  ball  k-­‐means  on  weighted  points   [Braverman+  2011]  [Ostrovsky+  2006]  
    • 3.  Incorporating  Approximate   Nearest  Neighbor  Search
    • Bottleneck:  nearest  neighbor  search  among    points
    • Approximate  Nearest  Neighbor  Search•  Use  simple  random  projection  1.  Set  ω    R^d  as  [0,  1)  randomly  2.  Calculate  inner  product  of  ω  and  clusters  3.  Given  query  x,  calculate  inner  product  x・ω  4.  Find  the  nearest  cluster  with  x  using  product  
    • 4.  Evaluation  and  Discussions
    • Datasets•  BigCross  dataset:     –  Size:  11  million  points  in  55  dimensions  •  Census  1990:  national  survey   –  2  million  points  in  68  dimensions  •  Environment:  C++  /  Ubuntu  /  2.9Ghz  /  6GB  
    • Note:  Lower  cost  is  Better
    • Note:  Lower  time  is  better
    • Conclusion•  Proposed  a  fast,  accurate  k-­‐means  clustering   based  on  a  streaming  algorithm  •  Incorporated  approximate  nearest  neighbor   search  with  the  proposed  algorithm  •  Excellent  on  both  practice  and  theory
    • References•  [Lloyd  1982]  Least  Squares  Quantization  in  PCM.  IEEE  on   Information  Theory.  •  [Aggarwal  2007]  Data  Streams:  Models  and  Algorithms.   Springer.  •  [Braverman+  2011]  Streaming  K-­‐means  on  Well-­‐ Clusterable  Data.  SODA.  •  [Ackermann+  2010]  StreamKM++:  A  Clustering  Algorithm   for  Data  Streams.  ALENEX.  •  [Sculley  2010]  Web-­‐Scale  K-­‐means  Clustering.  WWW.  
    • Any  Questions?