Boston hug-2012-07
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Boston hug-2012-07

on

  • 1,065 views

Describes the state of Apache Mahout with special focus on the upcoming k-nearest neighbor and k-means clustering algorithms.

Describes the state of Apache Mahout with special focus on the upcoming k-nearest neighbor and k-means clustering algorithms.

Statistics

Views

Total Views
1,065
Views on SlideShare
1,065
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Boston hug-2012-07 Presentation Transcript

  • 1. Mahout, New and ImprovedNow with Super Fast Clustering©MapR Technologies - Confidential 1
  • 2. Agenda What happened in Mahout 0.7 – less bloat – simpler structure – general cleanup©MapR Technologies - Confidential 2
  • 3. To Cut Out Bloat©MapR Technologies - Confidential 3
  • 4. ©MapR Technologies - Confidential 4
  • 5. Bloat is Leaving in 0.7 Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care? Examples – old LDA – old Naïve Bayes – genetic algorithms If you care, get on the mailing list©MapR Technologies - Confidential 5
  • 6. Bloat is Leaving in 0.7 Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care? Examples – old LDA – old Naïve Bayes – genetic algorithms If you care, get on the mailing list – oops, too late since 0.7 is already released©MapR Technologies - Confidential 6
  • 7. Integration of Collections©MapR Technologies - Confidential 7
  • 8. Nobody Cares about Collections We need it, math is built on it Pull it into math Broke the build (battle of the code expanders) Fixed now (thanks to Grant)©MapR Technologies - Confidential 8
  • 9. Pig Vector©MapR Technologies - Confidential 9
  • 10. What is it? Supports access to Mahout functionality from Pig So far -- text vectorization And classification And model saving©MapR Technologies - Confidential 10
  • 11. What is it? Supports Pig access to Mahout functions So far text vectorization And classification And model saving Kind of works (see pigML from twitter for better function)©MapR Technologies - Confidential 11
  • 12. Compile and Install Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package©MapR Technologies - Confidential 12
  • 13. Tokenize and Vectorize Text Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector (10,x+y+1, x:numeric, y:word, z:text); You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier©MapR Technologies - Confidential 13
  • 14. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use©MapR Technologies - Confidential 14
  • 15. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use – but that doesn’t do anything yet!©MapR Technologies - Confidential 15
  • 16. Load and Encode Data Load the data a = load /Users/tdunning/Downloads/NNBench.csv using PigStorage(,) as (x1:int, x2:int, x3:int); And encode it b = foreach a generate 1 as key, EncodeVector(*) as v; Note that the true meaning of * is very subtle Now store it store b into vectors.dat using com.twitter.elephantbird.pig.store.SequenceFileStorage ( -c com.twitter.elephantbird.pig.util.IntWritableConverter’, -c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);©MapR Technologies - Confidential 16
  • 17. Train a Model Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc); Note that the argument is a string with its own syntax©MapR Technologies - Confidential 17
  • 18. Reservations and Qualms Pig-vector isn’t done And it is ugly And it doesn’t quite work And it is hard to build But there seems to be promise©MapR Technologies - Confidential 18
  • 19. Potential Add Naïve Bayes Model? Somehow simplify the syntax? Try a recent version of elephant-bird? Switch to pigML?©MapR Technologies - Confidential 19
  • 20. Large-scale k-Means Clustering©MapR Technologies - Confidential 20
  • 21. Goals Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality – low average distance to nearest centroid on held-out data Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes©MapR Technologies - Confidential 21
  • 22. Non-goals Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2©MapR Technologies - Confidential 22
  • 23. Anti-goals Multiple passes over original data Scale as O(k n)©MapR Technologies - Confidential 23
  • 24. Why?©MapR Technologies - Confidential 24
  • 25. K-nearest Neighbor with Super Fast k-means©MapR Technologies - Confidential 25
  • 26. What’s that? Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you have few knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results, not just single nearest Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 26
  • 27. Modeling with k-nearest Neighbors a b c©MapR Technologies - Confidential 27
  • 28. Subject to Some Limits©MapR Technologies - Confidential 28
  • 29. Log Transform Improves Things©MapR Technologies - Confidential 29
  • 30. Neighbors Depend on Good Presentation©MapR Technologies - Confidential 30
  • 31. How We Did It 2 week hackathon with 6 developers from MapR customer Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup©MapR Technologies - Confidential 31
  • 32. How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene©MapR Technologies - Confidential 32
  • 33. What We Did Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices Searcher interface – Brute, ProjectionSearch, KmeansSearch, LshSearch Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 33
  • 34. Projection Search java.lang.TreeSet!©MapR Technologies - Confidential 34
  • 35. Projection Search Projection onto a line provides a total order on data Nearby points stay nearby Some other points also wind up close Search points just before or just after the query point©MapR Technologies - Confidential 35
  • 36. How Many Projections?©MapR Technologies - Confidential 36
  • 37. K-means Search Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters Recursive application – to search a cluster, use a Searcher!©MapR Technologies - Confidential 37
  • 38. ©MapR Technologies - Confidential 38
  • 39. x©MapR Technologies - Confidential 39
  • 40. ©MapR Technologies - Confidential 40
  • 41. ©MapR Technologies - Confidential 41
  • 42. x©MapR Technologies - Confidential 42
  • 43. But This Requires k-means! Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads on one node) – Very parallelizable©MapR Technologies - Confidential 43
  • 44. Basic Method Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters©MapR Technologies - Confidential 44
  • 45. Algorithmic DetailsForeach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > k log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease©MapR Technologies - Confidential 45
  • 46. How It Works Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 46
  • 47. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 47
  • 48. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that!©MapR Technologies - Confidential 48
  • 49. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)©MapR Technologies - Confidential 49
  • 50. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) Empirically, projection search beats 64 bit LSH by a bit – More optimization may change this story©MapR Technologies - Confidential 50
  • 51. Moving to Ultra Mega Super Scale Map-reduce implementation nearly trivial Map: rough-cluster input data, output ß, weighted centroids Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory Combiner possible, but not important©MapR Technologies - Confidential 51
  • 52.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-boston-2012-07 Hash tags: #boston-hug #mahout #mapr©MapR Technologies - Confidential 52