Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new in Apache Mahout

5,139 views

Published on

Apache Mahout is changing radically. Here is a report on what is coming, notably including an R like domain specific language that can use multiple computational engines such as Spark.

Published in: Software, Technology, Education

What's new in Apache Mahout

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  2. 2. © 2014 MapR Technologies 2 What’s New in Apache Mahout: A Preview of Mahout 1.0 21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org
  3. 3. © 2014 MapR Technologies 3 There was just an explosion in Apache Mahout…
  4. 4. © 2014 MapR Technologies 4 Apache Mahout up to now… • Open source Apache project http://mahout.apache.org/ • Mahout version is 0.9 released Feb 2014; included Scala – Summary 0.9 blog at http://bit.ly/1rirUUL • Library of scalable algorithms for machine learning – Some run on Apache Hadoop distributions; others do not require Hadoop – Some can be run at small scale – Some are run in parallel; others are sequential • Includes the following main areas: – Clustering & related techniques – Classification – Recommendation – Mahout Math Library
  5. 5. © 2014 MapR Technologies 5 Roadmap to Mahout 1.0 • Say good-bye to MapReduce – New MR algorithms will not be accepted – Support for existing ones will continue for now • Support for Apache Spark – Under construction; some features already available • Support for h2o being explored • Support for Apache Stratosphere possibly in future
  6. 6. © 2014 MapR Technologies 6 Roadmap: Apache Mahout 1.0
  7. 7. © 2014 MapR Technologies 7 Apache Spark • Apache Spark http://spark.apache.org/ – Open source “fast and general engine for large scale data processing” – Especially fast in-memory – Made top level open Apache project • Feb 2014 • http://spark.apache.org/ • over 100 committers – Original developers have started company called Databricks (Berkeley CA) http://databricks.com/
  8. 8. © 2014 MapR Technologies 8 Mahout and Scala • Scala http://www.scala-lang.org/ – Open source; appeared in 2003 – Wiki describes as “object-functional programming and scripting language” • Scala provides functional style – Makes lazy evaluation much safer – Notationally compact – Minor syntax extensions allowed – Makes math much easier
  9. 9. © 2014 MapR Technologies 9 Here’s what DSL & Spark will mean for Mahout • Scala DSL provides convenient notation for expressing parallel machine learning • Spark (and other engines) provide execution environment • Overview of Scala and Apache Spark bindings in Mahout can be found at https://mahout.apache.org/users/sparkbindings/home.html
  10. 10. © 2014 MapR Technologies 10 What do clusters, Cap’n Crunch and Coco Puffs have in common?
  11. 11. © 2014 MapR Technologies 11 They’re part of the data in the new Mahout Spark shell tutorial…
  12. 12. © 2014 MapR Technologies 12 And you shouldn’t be eating them.
  13. 13. © 2014 MapR Technologies 13 Tutorial: Mahout- Spark Shell • Find it here http://bit.ly/RSTeMr • Early stage code - play with Mahout Scala’s DSL for linear algebra and Mahout-Spark shell – Uses publicly available breakfast cereal data set – Challenge: Fit linear model that infers customer ratings from ingredients – Toy data set but load with Mahout to mimic a huge data set • Mahout's linear algebra DSL has an abstraction called DistributedRowMatrix (DRM) – models a matrix that is partitioned by rows and stored in the memory of a cluster of machines
  14. 14. © 2014 MapR Technologies 14 Dissecting the Model • Components – Cereal ingredients are the features – Ratings are the target variables • Linear regression assumes that target variable y is generated by linear combination of feature matrix X with parameter vector β plus the noise ε y = Xβ + ε • Goal: Find estimate of parameter vector β that explains data
  15. 15. © 2014 MapR Technologies 15 What do you see in this matrix? val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2);
  16. 16. © 2014 MapR Technologies 16 Add Bias Column val drmX1 = drmX.mapBlock(ncol = drmX.ncol + 1) { case(keys, block) => // create a new block with an additional column val blockWithBiasColumn = block.like(block.nrow, block.ncol + 1) // copy data from current block into the new block blockWithBiasColumn(::, 0 until block.ncol) := block // last column consists of ones blockWithBiasColumn(::, block.ncol) := 1 keys -> blockWithBiasColumn }
  17. 17. © 2014 MapR Technologies 17 Solve Linear System, Compute Error val XtX = (drmX1.t %*% drmX1).collect val Xty = (drmX1.t %*% y).collect(::, 0) beta = solve(XtX, Xty) val fittedY = (drmX1 %*% beta).collect(::, 0) error = (y - fittedY).norm(2)
  18. 18. © 2014 MapR Technologies 18 In R all = matrix( c(2, 2, 10.5, 10, 29.509541, 1, 2, 12, 12, 18.042851, 1, 1, 12, 13, 22.736446, 2, 1, 11, 13, 32.207582, 1, 2, 12, 11, 21.871292, 2, 1, 16, 8, 36.187559, 6, 2, 17, 1, 50.764999, 3, 2, 13, 7, 40.400208, 3, 3, 13, 4, 45.811716), byrow=T, ncol=5)
  19. 19. © 2014 MapR Technologies 19 More R a1 = cbind(a, 1) ata = t(a1) %*% a1 aty = t(a1) %*% y x1 = solve(a=ata, b=aty)
  20. 20. © 2014 MapR Technologies 20 Well, Actually all = data.frame(all) m = lm(X5 ~ X1 + X2 + X3 + X4, df) plot(df$X5, predict(m)) abline(lm(y ~ x, data.frame(x=df$X5, y=predict(m))), col='red’)
  21. 21. © 2014 MapR Technologies 21 R Wins
  22. 22. © 2014 MapR Technologies 22 R Wins … For Now
  23. 23. © 2014 MapR Technologies 23 R Wins … For Now … at Small Scale
  24. 24. © 2014 MapR Technologies 24 Recommendation Behavior of a crowd helps us understand what individuals will do
  25. 25. © 2014 MapR Technologies 25 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple
  26. 26. © 2014 MapR Technologies 26 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob Bob got an apple. What else would Bob like?
  27. 27. © 2014 MapR Technologies 27 Recommendation Alice got an apple and a puppyAlice Charles got a bicycleCharles Bob A puppy!
  28. 28. © 2014 MapR Technologies 28 You get the idea of how recommenders work…
  29. 29. © 2014 MapR Technologies 29 By the way, like me, Bob also wants a pony…
  30. 30. © 2014 MapR Technologies 30 Recommendation ? Alice Bob Charles Amelia What if everybody gets a pony? What else would you recommend for new user Amelia?
  31. 31. © 2014 MapR Technologies 31 Recommendation ? Alice Bob Charles Amelia If everybody gets a pony, it’s not a very good indicator of what to else predict... What we want is anomalous co-occurrence
  32. 32. © 2014 MapR Technologies 32 Get Useful Indicators from Behaviors • Use log files to build history matrix of users x items – Remember: this history of interactions will be sparse compared to all potential combinations • Transform to a co-occurrence matrix of items x items • Look for useful co-occurrence by looking for anomalous co- occurrences to make an indicator matrix – Log Likelihood Ratio (LLR) can be helpful to judge which co- occurrences can with confidence be used as indicators of preference – ItemSimilarityJob in Apache Mahout uses LLR • (pony book said RowSimilarityJob,not as good )
  33. 33. © 2014 MapR Technologies 33 Model uses three matrices…
  34. 34. © 2014 MapR Technologies 34 History Matrix: Users x Items Alice Bob Charles ✔ ✔ ✔ ✔ ✔ ✔ ✔
  35. 35. © 2014 MapR Technologies 35 Co-Occurrence Matrix: Items x Items - 1 2 1 1 1 1 2 1 0 0 0 0 Use LLR test to turn co- occurrence into indicators of interesting co-occurrence
  36. 36. © 2014 MapR Technologies 36 Indicator Matrix: Anomalous Co-Occurrence ✔ ✔
  37. 37. © 2014 MapR Technologies 37 Which one is the anomalous co-occurrence? A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 A not A B 1 0 not B 0 2 0.90 1.95 4.52 14.3
  38. 38. © 2014 MapR Technologies 38 Collection of Documents: Insert Meta-Data Search Technology Item meta-data Document for “puppy” id: t4 title: puppy desc: The sweetest little puppy ever. keywords: puppy, dog, pet Ingest easily via NFS
  39. 39. © 2014 MapR Technologies 39 A Quick Simplification • Users who do h • Also do Ah User-centric recommendations Item-centric recommendations AT (Ah) (AT A)h
  40. 40. © 2014 MapR Technologies 40 val drmA = sampleDownAndBinarize( drmARaw, randomSeed, maxNumInteractions).checkpoint() val numUsers = drmA.nrow.toInt // Compute number of interactions per thing in A val csums = drmBroadcast(drmA.colSums) // Compute co-occurrence matrix A'A val drmAtA = drmA.t %*% drmA
  41. 41. © 2014 MapR Technologies 41 What’s New in Apache Mahout: A Preview of Mahout 1.0 21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org
  42. 42. © 2014 MapR Technologies 42
  43. 43. © 2014 MapR Technologies 43 Sandbox
  44. 44. © 2014 MapR Technologies 44 Going Further: Multi-Modal Recommendation
  45. 45. © 2014 MapR Technologies 45 Going Further: Multi-Modal Recommendation
  46. 46. © 2014 MapR Technologies 46 Better Long-Term Recommendations • Anti-flood Avoid having too much of a good thing • Dithering “When making it worse makes it better”
  47. 47. © 2014 MapR Technologies 47 Why Use Dithering?
  48. 48. © 2014 MapR Technologies 48 What’s New in Apache Mahout? A Preview of Mahout 1.0 21 May 2014 #BDBDM Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  49. 49. © 2014 MapR Technologies 49 Sample Music Log Files 13 START 10113 2182654281 23 BEACON 10113 2182654281 24 START 10113 79600611935028 34 BEACON 10113 79600611935028 44 BEACON 10113 79600611935028 54 BEACON 10113 79600611935028 64 BEACON 10113 79600611935028 74 BEACON 10113 79600611935028 84 BEACON 10113 79600611935028 94 BEACON 10113 79600611935028 104 BEACON 10113 79600611935028 109 FINISH10113 79600611935028 111 START 10113 58999912011972 121 BEACON 10113 58999912011972 Time Event type User ID Artist ID Track ID
  50. 50. © 2014 MapR Technologies 50 id 1710 mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499 name Chuck Berry area United States gender Male indicator_artists 386685,875994,637954,3418,1344,789739,1460, … id 541902 mbid 983d4f8f-473e-4091-8394-415c105c4656 name Charlie Winston area United Kingdom gender None indicator_artists 997727,815,830794,59588,900,2591,1344,696268, … Documents for Music Recommendation
  51. 51. © 2014 MapR Technologies 51 Practical Machine Learning: Innovations in Recommendation 28 April 2014 NoSQL Matters Conference #NoSQLMatters Ted Dunning, Chief Applications Architect MapR Technologies Twitter @Ted_Dunning Email tdunning@mapr.com tdunning@apache.org Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout
  52. 52. © 2014 MapR Technologies 52

×