What's new in Apache Mahout

© 2014 MapR Technologies 2
What’s New in Apache Mahout:
A Preview of Mahout 1.0
21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM
Ted Dunning, Chief Applications Architect MapR Technologies
Twitter @Ted_Dunning
Email tdunning@mapr.com tdunning@apache.org

There was just an explosion
in Apache Mahout…

Apache Mahout up to now…
• Open source Apache project http://mahout.apache.org/
• Mahout version is 0.9 released Feb 2014; included Scala
– Summary 0.9 blog at http://bit.ly/1rirUUL
• Library of scalable algorithms for machine learning
– Some run on Apache Hadoop distributions; others do not require Hadoop
– Some can be run at small scale
– Some are run in parallel; others are sequential
• Includes the following main areas:
– Clustering & related techniques
– Classification
– Recommendation
– Mahout Math Library

Roadmap to Mahout 1.0
• Say good-bye to MapReduce
– New MR algorithms will not be accepted
– Support for existing ones will continue for now
• Support for Apache Spark
– Under construction; some features already available
• Support for h2o being explored
• Support for Apache Stratosphere possibly in future

Roadmap: Apache Mahout 1.0

Apache Spark
• Apache Spark http://spark.apache.org/
– Open source “fast and general engine for large scale data processing”
– Especially fast in-memory
– Made top level open Apache project
• Feb 2014
• http://spark.apache.org/
• over 100 committers
– Original developers have started company called Databricks (Berkeley CA)
http://databricks.com/

Mahout and Scala
• Scala http://www.scala-lang.org/
– Open source; appeared in 2003
– Wiki describes as “object-functional programming and scripting
language”
• Scala provides functional style
– Makes lazy evaluation much safer
– Notationally compact
– Minor syntax extensions allowed
– Makes math much easier

Here’s what DSL & Spark will mean for Mahout
• Scala DSL provides convenient notation for expressing parallel
machine learning
• Spark (and other engines) provide execution environment
• Overview of Scala and Apache Spark bindings in Mahout can be
found at
https://mahout.apache.org/users/sparkbindings/home.html

What do clusters, Cap’n Crunch
and Coco Puffs have in common?

They’re part of the data in the
new Mahout Spark shell tutorial…

And you shouldn’t be eating them.

Tutorial: Mahout- Spark Shell
• Find it here http://bit.ly/RSTeMr
• Early stage code - play with Mahout Scala’s DSL for linear
algebra and Mahout-Spark shell
– Uses publicly available breakfast cereal data set
– Challenge: Fit linear model that infers customer ratings from ingredients
– Toy data set but load with Mahout to mimic a huge data set
• Mahout's linear algebra DSL has an abstraction called
DistributedRowMatrix (DRM)
– models a matrix that is partitioned by rows and stored in the memory of
a cluster of machines

Dissecting the Model
• Components
– Cereal ingredients are the features
– Ratings are the target variables
• Linear regression assumes that target variable y is generated by
linear combination of feature matrix X with parameter vector β
plus the noise ε
y = Xβ + ε
• Goal: Find estimate of parameter vector β that explains data

What do you see in this matrix?
val drmData = drmParallelize(dense(
(2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios
(1, 2, 12, 12, 18.042851), // Cap'n'Crunch
(1, 1, 12, 13, 22.736446), // Cocoa Puffs
(2, 1, 11, 13, 32.207582), // Froot Loops
(1, 2, 12, 11, 21.871292), // Honey Graham Ohs
(2, 1, 16, 8, 36.187559), // Wheaties Honey Gold
(6, 2, 17, 1, 50.764999), // Cheerios
(3, 2, 13, 7, 40.400208), // Clusters
(3, 3, 13, 4, 45.811716)), // Great Grains Pecan
numPartitions = 2);

Add Bias Column
val drmX1 = drmX.mapBlock(ncol = drmX.ncol + 1) {
case(keys, block) =>
// create a new block with an additional column
val blockWithBiasColumn =
block.like(block.nrow, block.ncol + 1)
// copy data from current block into the new block
blockWithBiasColumn(::, 0 until block.ncol) := block
// last column consists of ones
blockWithBiasColumn(::, block.ncol) := 1
keys -> blockWithBiasColumn
}

Solve Linear System, Compute Error
val XtX = (drmX1.t %*% drmX1).collect
val Xty = (drmX1.t %*% y).collect(::, 0)
beta = solve(XtX, Xty)
val fittedY = (drmX1 %*% beta).collect(::, 0)
error = (y - fittedY).norm(2)

In R
all = matrix(
c(2, 2, 10.5, 10, 29.509541,
1, 2, 12, 12, 18.042851,
1, 1, 12, 13, 22.736446,
2, 1, 11, 13, 32.207582,
1, 2, 12, 11, 21.871292,
2, 1, 16, 8, 36.187559,
6, 2, 17, 1, 50.764999,
3, 2, 13, 7, 40.400208,
3, 3, 13, 4, 45.811716), byrow=T, ncol=5)

More R
a1 = cbind(a, 1)
ata = t(a1) %*% a1
aty = t(a1) %*% y
x1 = solve(a=ata, b=aty)

Well, Actually
all = data.frame(all)
m = lm(X5 ~ X1 + X2 + X3 + X4, df)
plot(df$X5, predict(m))
abline(lm(y ~ x,
data.frame(x=df$X5, y=predict(m))), col='red’)

R Wins

R Wins … For Now

R Wins … For Now … at Small Scale

Recommendation
Behavior of a crowd
helps us understand
what individuals will do

Recommendation
Alice got an apple and
a puppyAlice
Charles got a bicycleCharles
Bob Bob got an apple

Recommendation
a puppyAlice
Bob Bob got an apple. What else would Bob like?

Recommendation
a puppyAlice
Bob A puppy!

You get the idea of how
recommenders work…

By the way, like me, Bob also
wants a pony…

Recommendation
?
Alice
Bob
Charles
Amelia
What if everybody gets a
pony?
What else would you recommend
for new user Amelia?

Recommendation
?
Alice
Bob
Charles
Amelia
If everybody gets a pony, it’s not a
very good indicator of what to else
predict...
What we want is anomalous co-occurrence

Get Useful Indicators from Behaviors
• Use log files to build history matrix of users x items
– Remember: this history of interactions will be sparse compared to all
potential combinations
• Transform to a co-occurrence matrix of items x items
• Look for useful co-occurrence by looking for anomalous co-
occurrences to make an indicator matrix
– Log Likelihood Ratio (LLR) can be helpful to judge which co-
occurrences can with confidence be used as indicators of preference
– ItemSimilarityJob in Apache Mahout uses LLR
• (pony book said RowSimilarityJob,not as good )

Model uses three matrices…

History Matrix: Users x Items
Alice
Bob
Charles
✔ ✔ ✔
✔ ✔
✔ ✔

Co-Occurrence Matrix: Items x Items
-
1 2
1 1
1
1
2 1
0
0
0 0
Use LLR test to turn co-
occurrence into indicators of
interesting co-occurrence

Indicator Matrix: Anomalous Co-Occurrence
✔
✔

Which one is the anomalous co-occurrence?
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
A not A
B 1 0
not B 0 2
0.90 1.95
4.52 14.3

Collection of Documents: Insert Meta-Data
Search
Technology
Item
meta-data
Document for
“puppy” id: t4
title: puppy
desc: The sweetest little puppy
ever.
keywords: puppy, dog, pet
Ingest easily via NFS

A Quick Simplification
• Users who do h
• Also do
Ah
User-centric recommendations
Item-centric recommendations
AT
(Ah)
(AT
A)h

val drmA = sampleDownAndBinarize(
drmARaw, randomSeed, maxNumInteractions).checkpoint()
val numUsers = drmA.nrow.toInt
// Compute number of interactions per thing in A
val csums = drmBroadcast(drmA.colSums)
// Compute co-occurrence matrix A'A
val drmAtA = drmA.t %*% drmA

What’s New in Apache Mahout:
21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM

Sandbox

Going Further: Multi-Modal Recommendation

Better Long-Term Recommendations
• Anti-flood
Avoid having too much of a good thing
• Dithering
“When making it worse makes it better”

Why Use Dithering?

What’s New in Apache Mahout?
21 May 2014 #BDBDM
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout

Sample Music Log Files
13 START 10113 2182654281
23 BEACON 10113 2182654281
24 START 10113 79600611935028
34 BEACON 10113 79600611935028
44 BEACON 10113 79600611935028
54 BEACON 10113 79600611935028
64 BEACON 10113 79600611935028
74 BEACON 10113 79600611935028
84 BEACON 10113 79600611935028
94 BEACON 10113 79600611935028
104 BEACON 10113 79600611935028
109 FINISH10113 79600611935028
111 START 10113 58999912011972
121 BEACON 10113 58999912011972
Time
Event type
User ID
Artist ID
Track ID

id 1710
mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499
name Chuck Berry
area United States
gender Male
indicator_artists 386685,875994,637954,3418,1344,789739,1460, …
id 541902
mbid 983d4f8f-473e-4091-8394-415c105c4656
name Charlie Winston
area United Kingdom
gender None
indicator_artists 997727,815,830794,59588,900,2591,1344,696268, …
Documents for Music Recommendation

Practical Machine Learning:
Innovations in Recommendation
28 April 2014 NoSQL Matters Conference #NoSQLMatters
Apache Mahout https://mahout.apache.org/
Twitter @ApacheMahout

What's new in Apache Mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What's new in Apache Mahout

Similar to What's new in Apache Mahout (20)

More from Ted Dunning

More from Ted Dunning (10)

Recently uploaded

Recently uploaded (20)

What's new in Apache Mahout

Editor's Notes