Breaking the Kubernetes Kill Chain: Host Path Mount
Matrix methods for Hadoop
1. Matrix Methods
with Hadoop
Slides bit.ly/10SIe1A
Code github.com/dgleich/matrix-hadoop-tutorial
DAVID F. GLEICH
ASSISTANT PROFESSOR "
COMPUTER SCIENCE "
PURDUE UNIVERSITY
1
David Gleich · Purdue
bit.ly/10SIe1A
5. Matrix computations
2 3
A1,1 A1,2 ··· A1,n
6 . 7
. 7
6 A2,1 A2,2 ··· . 7
A=6 .
6 7
4 . .. ..
. . . Am 1,n 5
Am,1 ··· Am,n 1 Am,n
Ax Ax = b min kAx bk Ax = x
Operations
Linear " Least squares
Eigenvalues
systems
5
David Gleich · Purdue
bit.ly/10SIe1A
6. Outcomes
Recognize relationships between matrix methods and
things you’ve already been doing"
Example SQL queries as matrix computations
Understand how to use Hadoop to compute these
matrix methods at scale for BigData"
Example Recommenders with social network info
Understand some of the issues that could arise.
6
David Gleich · Purdue
bit.ly/10SIe1A
7. Ideal outcomes
How to use techniques from "
matrix computations in order "
to solve your problems quickly!
1986
7
David Gleich · Purdue
bit.ly/10SIe1A
8. Taking the red pill …
Image from rockysprings, deviantart, CC share-alike
8
9. Matrix computations
Physics
Databases
Statistics
Machine learning
Engineering
Information retrieval
Graphics
Computer vision
Bioinformatics
Social networks
bit.ly/10SIe1A
9
David Gleich · Purdue
11. A SQL statement as a "
matrix computation
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
How do I find the
average rating for
each product?
11
David Gleich · Purdue
bit.ly/10SIe1A
12. A SQL statement as a "
matrix computation
SELECT!
p.product_id,!
p.name,!
AVG(pr.rating) AS rating_average!
http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql
FROM products p!
How do I find the INNER JOIN product_ratings pr!
average rating for ON pr.product_id = p.product_id!
GROUP BY p.product_id!
each product? ORDER BY rating_average DESC!
12
David Gleich · Purdue
bit.ly/10SIe1A
13. This SQL statement is a "
matrix computation!
13
Image from rockysprings, deviantart, CC share-alike
19. … but there is a linear
operator hiding …
2 P P 3
A1,1 / j “A1,j 6= 0” A1,2 / j “A1,j 6= 0” ···
P P
6A2,1 / · · ·7
P=4 j “A2,j 6= 0” A2,2 / j “A2,j 6= 0” 5
.
. ..
. .
avg(A) = Pe
e is the vector of all ones
19
David Gleich · Purdue
bit.ly/10SIe1A
21. MapReduce
21
David Gleich · Purdue
bit.ly/10SIe1A
22. The MapReduce Framework
Originated at Google for indexing web Data scalable
pages and computing PageRank.
Maps
M M
1
2
1
M
2
M
Reduce
M M
R 3
4
M
Express algorithms in "
3
R
4
M M
“data-local operations”.
5
M Shuffle
5
Implement one type of Fault-tolerance by design
communication: shuffle.
Input stored in triplicate
Reduce input/"
M
Shuffle moves all data with M
output on disk
R
the same key to the same M
R
M
reducer.
Map output"
persisted to disk"
22
before shuffle
David Gleich · Purdue
bit.ly/10SIe1A
23. wordcount "
is a matrix computation too
map(document) :
for word in document
D
D
emit (word, 1)
1
2
matrix,1
bigdata,1
hadoop,1
D
D
matrix,1
bigdata,1
hadoop,1
3
4
matrix,1
bigdata,1
hadoop,1
matrix,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
D
bigdata,1
hadoop,1
5
bigdata,1
hadoop,1
bigdata,1
reduce(word, counts) :
emit (word, sum(counts))
23
David Gleich · Purdue
bit.ly/10SIe1A
24. wordcount "
is a matrix computation too
2 3
doc1
A1,1 A1,2 ··· A1,n
6 . 7
. 7
6
doc2
A2,1 A2,2 ··· . 7
A=6 .
6 7 = A
4 . .. ..
. . . Am 1,n 5
docm
Am,1 ··· Am,n 1 Am,n
word count =
colsum(A)
= AT e
e is the vector of all ones
24
David Gleich · Purdue
bit.ly/10SIe1A
25. inverted index"
is a matrix computation too
2 3
doc1
A1,1 A1,2 ··· A1,n
6 . 7
. 7
6
doc2
A2,1 A2,2 ··· . 7
A=6 .
6 7 = A
4 . .. ..
. . . Am 1,n 5
docm
Am,1 ··· Am,n 1 Am,n
25
David Gleich · Purdue
bit.ly/10SIe1A
26. inverted index"
is a matrix computation too
2 3
term1
A1,1 A2,1 ··· Am,1
6 . 7
. 7
6A1,2 A2,2 ··· . 7
6
term2
6 . 7 = AT
4 . .. ..
. . . Am,n 1 5
termm
A1,n ··· Am 1,n Am,n
26
David Gleich · Purdue
bit.ly/10SIe1A
29. A recommender system "
with social info
product_ratings
friends_links
pid8 uid2 4
uid6 uid1
pid9 uid9 1
uid8 uid9
pid2 uid9 5
uid7 uid7
pid9 uid5 5
uid7 uid4
pid6 uid8 4
uid6 uid2
R
S
pid1 uid2 4
uid7 uid1
pid3 uid4 4
uid3 uid1
pid5 uid9 2
uid1 uid8
pid9 uid8 4
uid7 uid3
pid9 uid9 1
uid9 uid1
29
David Gleich · Purdue
bit.ly/10SIe1A
30. A recommender system "
with social info
Recommend each item based with something that is"
on the average rating of all “X = S RT”
almost a matrix-matrix"
trusted users
product
2 3 2 3
A1,1 A2,1 ··· A1,1 A2,1 ···
R
S
pid1
uid1
6
pid2
A1,2 A2,2 · · ·7 6
uid2
A1,2 A2,2 · · ·7
4 5 4 5
.
. .. .. .
. .. ..
. . . . . .
! ! 1
X X
Xuid,pid = Suid,uid2 Ruid2,pid · “Suid,uid2 and Ruid2,pid 6= 0”
uid2 uid2
30
David Gleich · Purdue
bit.ly/10SIe1A
31. Tools I like
hadoop streaming
dumbo
mrjob
hadoopy
C++
31
David Gleich · Purdue
bit.ly/10SIe1A
32. Tools I don’t use but other
people seem to like …
pig
java
hbase
Mahout is the closest thing to a library
for matrix computations in Hadoop. If
you like Java, you should probably
mahout
start there.
Eclipse
I’m a low-level guy
Cassandra
32
David Gleich · Purdue
bit.ly/10SIe1A
33. hadoop streaming
the map function is a program"
(key,value) pairs are sent via stdin"
output (key,value) pairs goes to stdout
the reduce function is a program"
(key,value) pairs are sent via stdin"
keys are grouped"
output (key,value) pairs goes to stdout
33
David Gleich · Purdue
bit.ly/10SIe1A
34. mrjob from
a wrapper around hadoop streaming for
map and reduce functions in python
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield (word.lower(), 1)
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
34
David Gleich · Purdue
bit.ly/10SIe1A
35. How can Hadoop streaming
Synthetic data test 100,000,000-by-500 matrix (~500GB)
Codes implemented in MapReduce streaming
possibly be fast?
Matrix stored as TypedBytes lists of doubles
Python frameworks use Numpy+Atlas
Custom C++ TypedBytes reader/writer with Atlas
500 GB matrix. Computing the R in a QR factorization. "
See my non-streaming Java implementation too
New next talk!
Iter 1 Iter 1 Iter 2 Overall
QR (secs.) Total (secs.) Total (secs.) Total (secs.)
Dumbo 67725 960 217 1177
Hadoopy 70909 612 118 730
C++ 15809 350 37 387
Java 436 66 502
C++ in streaming beats a native Java implementation.
All timing results from the Hadoop job tracker
mrjob could be faster if it used
David Gleich (Sandia) MapReduce 2011 16/22
Example available from
typedbytes for intermediate storage see
github.com/dgleich/mrtsqr"
https://github.com/Yelp/mrjob/pull/447
for verification
35
David Gleich · Purdue
bit.ly/10SIe1A
36. Ax = y
X
Matrix-vector product
yi = Aik xk
Follow along! k
matrix-hadoop/codes/smatvec.py!
x
A
36
David Gleich · Purdue
bit.ly/10SIe1A
37. Where do matrix-vector
products arise?
Google’s PageRank
Computing cosine-similarity between one
document and all other documents
Predictions from kernel methods
Computing averages (the example above)
37
David Gleich · Purdue
bit.ly/10SIe1A
38. Ax = y
X
Matrix-vector product
yi = Aik xk
Follow along! k
matrix-hadoop/codes/smatvec.py!
A is stored by row
x
$ head samples/smat_5_5.txt !
0 0 0.125 3 1.024 4 0.121!
A
1 0 0.597!
2 2 1.247!
3 4 -1.45! x is stored entry-wise
!
4 2 0.061! $ head samples/vec_5.txt!
0 0.241!
1 -0.98!
2 0.237!
3 -0.32!
4 0.080!
38
David Gleich · Purdue
bit.ly/10SIe1A
39. Matrix-vector product Ax = y
X
(in pictures)
yi = Aik xk
k
x
x
x
A
A
A
y
Input
Map 1! Reduce 1! Reduce 2!
Align on columns Output Aik xk Output
keyed on row i
sum(Aik xk)
39
David Gleich · Purdue
bit.ly/10SIe1A
40. Matrix-vector product Ax = y
X
(in pictures)
yi = Aik xk
k
x
def joinmap(self, key, line):!
vals = line.split()!
x
if len(vals) == 2:!
# the vector!
yield (vals[0], # row!
A
A
(float(vals[1]),)) # xi!
else:!
# the matrix!
row = vals[0]!
for i in xrange(1,len(vals),2):!
yield (vals[i], # column!
(row, # i,Aij!
float(vals[i+1])))!
Input
Map 1!
Align on columns
40
David Gleich · Purdue
bit.ly/10SIe1A
41. Matrix-vector product Ax = y
X
(in pictures)
yi = Aik xk
k
x
x
def joinred(self, key, vals):!
vecval = 0. !
x
matvals = []!
for val in vals:!
if len(val) == 1:!
A
A
A
vecval += val[0]!
else:!
matvals.append(val) !
for val in matvals:!
yield (val[0], val[1]*vecval)!
Note that you should use a
Input
secondary sort to avoid Map 1! Reduce 1!
reading both in memory
Align on columns Output Aik xk
keyed on row i
41
David Gleich · Purdue
bit.ly/10SIe1A
42. Matrix-vector product Ax = y
X
(in pictures)
yi = Aik xk
k
x
x
x
A
def sumred(self, key, vals):!
A
A
y
yield (key, sum(vals))!
Input
Map 1! Reduce 1! Reduce 2!
Align on columns Output Aik xk Output
keyed on row i
sum(Aik xk)
42
David Gleich · Purdue
bit.ly/10SIe1A
43. AB = C
Matrix-matrix product
Cij =
X
Aik Bkj
Follow along! k
matrix-hadoop/codes/matmat.py!
B
A
43
David Gleich · Purdue
bit.ly/10SIe1A
44. AB = C
Matrix-matrix product
Cij =
X
Aik Bkj
Follow along! k
matrix-hadoop/codes/matmat.py!
A is stored by row
B
$ head samples/smat_10_5_A.txt !
0 0 0.599 4 -1.53!
1!
A
2
3!
4
2 0.260!
0 0.267 1 0.839
B is stored by row
$ head samples/smat_5_5.txt !
0 0 0.125 3 1.024 4 0.121!
1 0 0.597!
2 2 1.247!
44
David Gleich · Purdue
bit.ly/10SIe1A
45. Matrix-matrix product
AB = C
(in pictures)
Cij =
X
Aik Bkj
B
B
k
B
A
A
A
C
Map 1! Reduce 1! Reduce 2!
Align on columns Output Aik Bkj Output
keyed on (i,j)
sum(Aik Bkj)
45
David Gleich · Purdue
bit.ly/10SIe1A
46. Matrix-matrix product
AB = C
(in code)
Cij =
X
Aik Bkj
B
k
def joinmap(self, key, line):!
B
mtype = self.parsemat()!
vals = line.split()!
row = vals[0]!
rowvals = !
A
A
[(vals[i],float(vals[i+1])) !
for i in xrange(1,len(vals),2)]!
if mtype==1:!
# matrix A, output by col!
for val in rowvals:!
yield (val[0], (row, val[1]))!
else:!
yield (row, (rowvals,))!
Map 1!
Align on columns
46
David Gleich · Purdue
bit.ly/10SIe1A
47. Matrix-matrix product
AB = C
(in pictures)
Cij =
X
Aik Bkj
B
B
k
def joinred(self, key, line):!
B
# load the data into memory !
brow = []!
acol = []!
for val in vals:!
A
if len(val) == 1:!
A
A
brow.extend(val[0])!
else:!
acol.append(val)!
!
for (bcol,bval) in brow:!
for (arow,aval) in acol:!
yield ((arow,bcol),aval*bval)!
Map 1! Reduce 1!
Align on columns Output Aik Bkj
keyed on (i,j)
47
David Gleich · Purdue
bit.ly/10SIe1A
48. Matrix-matrix product
AB = C
(in pictures)
Cij =
X
Aik Bkj
B
B
k
B
A
A
A
C
def sumred(self, key, vals):!
yield (key, sum(vals))!
Map 1! Reduce 1! Reduce 2!
Align on columns Output Aik Bkj Output
keyed on (i,j)
sum(Aik Bkj)
48
David Gleich · Purdue
bit.ly/10SIe1A
49. Our social recommender
Follow along!
matrix-hadoop/recsys/recsys.py!
R is stored entry-wise
S
RT
!
$ gunzip –c data/rating.txt.gz!
139431556 591156 5!
139431556 1312460676 5!
139431556 204358 4
Object ID! 368725
139431556 User ID! Rating!
5!
S is stored entry-wise
!
$ gunzip –c data/rating.txt.gz!
3287060356 232085 -1!
3288305540 709420 1!
3290337156 204418 -1!
My ID! Other ID! Trust!
3294138244 269243 -1!
49
David Gleich · Purdue
bit.ly/10SIe1A
50. Social recommender
(in code)
B
Conceptually, the first step
is the same as the matrix-
matrix product.
def joinmap(self, key, line):!
B
We reorganize the data by parts = line.split('t')!
user-id to be able to map if len(parts) == 8: # ratings!
the trust relationships
objid = parts[0].strip()!
uid = parts[1].strip()!
A
A
rat = int(parts[2])!
yield (uid, (objid, rat))!
else len(parts) == 4: # trust!
myid = parts[0].strip()!
otherid = parts[1].strip()!
value = int(parts[2])!
if value 0:!
yield (otherid, (myid,))!
Map 1!
Align on columns
50
David Gleich · Purdue
bit.ly/10SIe1A
51. Matrix-matrix product
(in pictures)
B
B
Conceptually,
def joinred(self, key, vals):! the second step
B
tusers = [] # uids that trust key! is the same as
ratobjs = [] # objs rated by uid=key! the matrix-
for val in vals:!
matrix product
if len(val) == 1:!
too, we “map”
A
tusers.append(val[0])!
A
A
else:! the ratings from
ratobjs.append(val)! each trusted
! user back to the
for (objid, rat) in ratobjs:! source.
for uid in tusers:!
yield ((uid, objid), rat)!
Map 1! Reduce 1!
Align on columns Output Aik Bkj
keyed on (i,j)
51
David Gleich · Purdue
bit.ly/10SIe1A
52. Matrix-matrix product
AB = C
(in pictures)
Cij =
X
Aik Bkj
B
B
k
B
def avgred(self, key, vals):!
A
s = 0.!
A
A
C
n = 0!
for val in vals:!
s += val!
n += 1!
# the smoothed average of ratings!
yield key, !
(s+self.options.avg)/float(n+1) !
!
Map 1! Reduce 1! Reduce 2!
Align on columns Output Aik Bkj Output
keyed on (i,j)
sum(Aik Bkj)
52
David Gleich · Purdue
bit.ly/10SIe1A
53. Better ways to store
matrices in Hadoop
Block matrices minimize the
number of intermediate keys
and values used. I’d form them
No need for “integer” keys that based on the first reduce
fall between 1 and n!
B
B
A
A
53
David Gleich · Purdue
bit.ly/10SIe1A
54. Tall-and-skinny matrices are
common in BigData
A : m x n, m ≫ n
A1
Key is an arbitrary row-id
A2
Value is the 1 x n array
for a row
A3
Each submatrix Ai is an
A4
the input to a map task.
54
David Gleich · Purdue
bit.ly/10SIe1A
56. Error analysis of summation
s = 0; for i=1 to n: s = s + x[i]
fl(x + y ) = (x + y )(1 + )
X X X
16
fl(
xi ) xi nµ |xi | µ ⇡ 10
i i i
A simple summation formula has
error that is not always small if n is a billion
56
David Gleich · Purdue
bit.ly/10SIe1A
57. If your application matters
then watch out for this issue.
Use quad-precision arithmetic
or compensated summation
instead.
57
David Gleich · Purdue
bit.ly/10SIe1A
58. Compensated Summation
“Kahan summation algorithm” on Wikipedia
s = 0.; c = 0.;
Mathematically, c is always zero.
for i=1 to n:
On a computer, c can be non-zero
y = x[i] – c
The parentheses matter!
t = s + y
X
2
X
fl(csum(x))
xi (µ + nµ ) |xi |
c = (t – s) – y
i i
16
µ ⇡ 10
s = t
58
David Gleich · Purdue
bit.ly/10SIe1A
59. Collaborators, Friends, and
People who have taught me
MRTSQR! Sandia MapReduce!
Paul Constantine (Stanford)
Todd Plantenga
Austin Benson (Stanford)
Tammy Kolda
James Demmel (Berkeley)
Justin Basilico (now Netflix)
Simform! Others!
Jeremy Templeton (Sandia)
Margot Gerritsen (Stanford)
Joe Ruthruff (Sandia)
Yangyang Hou (Purdue)
Grants
Sandia CSAR
Joe Nichols (Stanford)
59
David Gleich · Purdue
bit.ly/10SIe1A
60. Questions?
60
Image from rockysprings, deviantart, CC share-alike