• Like
Boston hug-2012-07
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Boston hug-2012-07

  • 686 views
Published

Describes the state of Apache Mahout with special focus on the upcoming k-nearest neighbor and k-means clustering algorithms. …

Describes the state of Apache Mahout with special focus on the upcoming k-nearest neighbor and k-means clustering algorithms.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
686
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
11
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Mahout, New and ImprovedNow with Super Fast Clustering©MapR Technologies - Confidential 1
  • 2. Agenda What happened in Mahout 0.7 – less bloat – simpler structure – general cleanup©MapR Technologies - Confidential 2
  • 3. To Cut Out Bloat©MapR Technologies - Confidential 3
  • 4. ©MapR Technologies - Confidential 4
  • 5. Bloat is Leaving in 0.7 Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care? Examples – old LDA – old Naïve Bayes – genetic algorithms If you care, get on the mailing list©MapR Technologies - Confidential 5
  • 6. Bloat is Leaving in 0.7 Lots of abandoned code in Mahout – average code quality is poor – no users – no maintainers – why do we care? Examples – old LDA – old Naïve Bayes – genetic algorithms If you care, get on the mailing list – oops, too late since 0.7 is already released©MapR Technologies - Confidential 6
  • 7. Integration of Collections©MapR Technologies - Confidential 7
  • 8. Nobody Cares about Collections We need it, math is built on it Pull it into math Broke the build (battle of the code expanders) Fixed now (thanks to Grant)©MapR Technologies - Confidential 8
  • 9. Pig Vector©MapR Technologies - Confidential 9
  • 10. What is it? Supports access to Mahout functionality from Pig So far -- text vectorization And classification And model saving©MapR Technologies - Confidential 10
  • 11. What is it? Supports Pig access to Mahout functions So far text vectorization And classification And model saving Kind of works (see pigML from twitter for better function)©MapR Technologies - Confidential 11
  • 12. Compile and Install Start by compiling and installing mahout in your local repository: cd ~/Apache git clone https://github.com/apache/mahout.git cd mahout mvn install -DskipTests Then do the same with pig-vector cd ~/Apache git clone git@github.com:tdunning/pig-vector.git cd pig-vector mvn package©MapR Technologies - Confidential 12
  • 13. Tokenize and Vectorize Text Tokenized is done using a text encoder – the dimension of the resulting vectors (typically 100,000-1,000,000 – a description of the variables to be included in the encoding – the schema of the tuples that pig will pass together with their data types Example: define EncodeVector org.apache.mahout.pig.encoders.EncodeVector (10,x+y+1, x:numeric, y:word, z:text); You can also add a Lucene 3.1 analyzer in parentheses if you want something fancier©MapR Technologies - Confidential 13
  • 14. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use©MapR Technologies - Confidential 14
  • 15. The Formula Not normal arithmetic Describes which variables to use, whether offset is included Also describes which interactions to use – but that doesn’t do anything yet!©MapR Technologies - Confidential 15
  • 16. Load and Encode Data Load the data a = load /Users/tdunning/Downloads/NNBench.csv using PigStorage(,) as (x1:int, x2:int, x3:int); And encode it b = foreach a generate 1 as key, EncodeVector(*) as v; Note that the true meaning of * is very subtle Now store it store b into vectors.dat using com.twitter.elephantbird.pig.store.SequenceFileStorage ( -c com.twitter.elephantbird.pig.util.IntWritableConverter’, -c com.twitter.elephantbird.pig.util.GenericWritableConverter -t org.apache.mahout.math.VectorWritable’);©MapR Technologies - Confidential 16
  • 17. Train a Model Pass previously encoded data to a sequential model trainer define train org.apache.mahout.pig.LogisticRegression( iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc); Note that the argument is a string with its own syntax©MapR Technologies - Confidential 17
  • 18. Reservations and Qualms Pig-vector isn’t done And it is ugly And it doesn’t quite work And it is hard to build But there seems to be promise©MapR Technologies - Confidential 18
  • 19. Potential Add Naïve Bayes Model? Somehow simplify the syntax? Try a recent version of elephant-bird? Switch to pigML?©MapR Technologies - Confidential 19
  • 20. Large-scale k-Means Clustering©MapR Technologies - Confidential 20
  • 21. Goals Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality – low average distance to nearest centroid on held-out data Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes©MapR Technologies - Confidential 21
  • 22. Non-goals Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2©MapR Technologies - Confidential 22
  • 23. Anti-goals Multiple passes over original data Scale as O(k n)©MapR Technologies - Confidential 23
  • 24. Why?©MapR Technologies - Confidential 24
  • 25. K-nearest Neighbor with Super Fast k-means©MapR Technologies - Confidential 25
  • 26. What’s that? Find the k nearest training examples Use the average value of the target variable from them This is easy … but hard – easy because it is so conceptually simple and you have few knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results, not just single nearest Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time©MapR Technologies - Confidential 26
  • 27. Modeling with k-nearest Neighbors a b c©MapR Technologies - Confidential 27
  • 28. Subject to Some Limits©MapR Technologies - Confidential 28
  • 29. Log Transform Improves Things©MapR Technologies - Confidential 29
  • 30. Neighbors Depend on Good Presentation©MapR Technologies - Confidential 30
  • 31. How We Did It 2 week hackathon with 6 developers from MapR customer Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup©MapR Technologies - Confidential 31
  • 32. How We Did It 2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case) Goal is new open technology to facilitate new closed solutions Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene©MapR Technologies - Confidential 32
  • 33. What We Did Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices Searcher interface – Brute, ProjectionSearch, KmeansSearch, LshSearch Super-fast clustering – Kmeans, StreamingKmeans©MapR Technologies - Confidential 33
  • 34. Projection Search java.lang.TreeSet!©MapR Technologies - Confidential 34
  • 35. Projection Search Projection onto a line provides a total order on data Nearby points stay nearby Some other points also wind up close Search points just before or just after the query point©MapR Technologies - Confidential 35
  • 36. How Many Projections?©MapR Technologies - Confidential 36
  • 37. K-means Search Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters Recursive application – to search a cluster, use a Searcher!©MapR Technologies - Confidential 37
  • 38. ©MapR Technologies - Confidential 38
  • 39. x©MapR Technologies - Confidential 39
  • 40. ©MapR Technologies - Confidential 40
  • 41. ©MapR Technologies - Confidential 41
  • 42. x©MapR Technologies - Confidential 42
  • 43. But This Requires k-means! Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads on one node) – Very parallelizable©MapR Technologies - Confidential 43
  • 44. Basic Method Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters©MapR Technologies - Confidential 44
  • 45. Algorithmic DetailsForeach data point xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > k log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease©MapR Technologies - Confidential 45
  • 46. How It Works Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly©MapR Technologies - Confidential 46
  • 47. Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads©MapR Technologies - Confidential 47
  • 48. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that!©MapR Technologies - Confidential 48
  • 49. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)©MapR Technologies - Confidential 49
  • 50. Warning, Recursive Descent Inner loop requires finding nearest centroid With lots of centroids, this is slow But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) Empirically, projection search beats 64 bit LSH by a bit – More optimization may change this story©MapR Technologies - Confidential 50
  • 51. Moving to Ultra Mega Super Scale Map-reduce implementation nearly trivial Map: rough-cluster input data, output ß, weighted centroids Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory Combiner possible, but not important©MapR Technologies - Confidential 51
  • 52.  Contact: – tdunning@maprtech.com – @ted_dunning Slides and such: – http://info.mapr.com/ted-boston-2012-07 Hash tags: #boston-hug #mahout #mapr©MapR Technologies - Confidential 52