Matrix methods for Hadoop

Matrix Methods
with Hadoop
Slides bit.ly/10SIe1A
Code github.com/dgleich/matrix-hadoop-tutorial

DAVID F. GLEICH
ASSISTANT PROFESSOR "
COMPUTER SCIENCE "
PURDUE UNIVERSITY

1
David Gleich · Purdue
bit.ly/10SIe1A

2
bit.ly/10SIe1A

A bit of philosophy …

Image from rockysprings, deviantart, CC share-alike

3

4
bit.ly/10SIe1A

Matrix computations
2 3
A1,1 A1,2 ··· A1,n
6 . 7
. 7
6 A2,1 A2,2 ··· . 7
A=6 .
6 7
4 . .. ..
. . . Am 1,n 5
Am,1 ··· Am,n 1 Am,n
Ax Ax = b min kAx bk Ax = x
Operations
Linear " Least squares
Eigenvalues
systems

5
bit.ly/10SIe1A

Outcomes
Recognize relationships between matrix methods and
things you’ve already been doing"
Example SQL queries as matrix computations

Understand how to use Hadoop to compute these
matrix methods at scale for BigData"
Example Recommenders with social network info

Understand some of the issues that could arise.

6
bit.ly/10SIe1A

Ideal outcomes

How to use techniques from "
matrix computations in order "
to solve your problems quickly!

1986

7
bit.ly/10SIe1A

Taking the red pill …


8

Matrix computations
Physics
Databases
Statistics
Machine learning
Engineering
Information retrieval
Graphics
Computer vision
Bioinformatics
Social networks

bit.ly/10SIe1A

9

matrix computations "
≠"
linear algebra

10
bit.ly/10SIe1A

A SQL statement as a "
matrix computation

http://stackoverﬂow.com/questions/4217449/returning-average-rating-from-a-database-sql

How do I find the
average rating for
each product?

11
bit.ly/10SIe1A

A SQL statement as a "
matrix computation

SELECT!
p.product_id,!
p.name,!
AVG(pr.rating) AS rating_average!
http://stackoverﬂow.com/questions/4217449/returning-average-rating-from-a-database-sql
FROM products p!
How do I find the INNER JOIN product_ratings pr!
average rating for ON pr.product_id = p.product_id!
GROUP BY p.product_id!
each product? ORDER BY rating_average DESC!

12
bit.ly/10SIe1A

This SQL statement is a "
matrix computation!

13

SELECT!
...!
AVG(pr.rating)!
...!
GROUP BY p.product_id!
product_ratings

pid8 uid2 4
pid1
pid9 uid9 1
pid2
pid2 uid9 5
pid3
pid9 uid5 5
pid4
pid6 uid8 4
pid5
pid1 uid2 4
pid6
pid3 uid4 4
Is a matrix!
pid7
pid5 uid9 2
pid8
pid9 uid8 4
pid9
pid9 uid9 1

14
bit.ly/10SIe1A

But it’s a weird matrix"

product_ratings

pid8 uid2 4
pid1
pid9 uid9 1
pid2
pid2 uid9 5
pid3
pid9 uid5 5
pid4
pid6 uid8 4
pid5
pid1 uid2 4
pid6
pid3 uid4 4
Is a matrix!
pid7
pid5 uid9 2
pid8
pid9 uid8 4
pid9
pid9 uid9 1

Missing entries!

15
bit.ly/10SIe1A


Average"
of ratings
product_ratings

pid8 uid2 4
pid1
4
pid9 uid9 1
pid2
pid2 uid9 5
pid3
4
pid9 uid5 5
pid4
pid6 uid8 4
pid5
pid1 uid2 4
pid6
4 SELECT
pid3 uid4 4
Is a matrix!
pid7
AVG(r)
pid5 uid9 2
pid8
4
...
pid9 uid8 4
pid9
5
4 GROUP BY
pid

Matrix
Vector

16
bit.ly/10SIe1A

and not a linear operator
2 3
I
product_ratings
s a matrix
A1,1 A1,2 ··· A1,n
!
6 . 7
pid8 uid2 4
6 A2,1 A2,2 ··· . 7
. 7
pid9 uid9 1
A=6 .
6 7
4 . .. ..
pid2 uid9 5
. . . Am 1,n 5
pid9 uid5 5
pid6 uid8 4
pid1 uid2 4
2 P P 3
j A1,j / Pj “A1,j 6= 0”
pid3 uid4 4
pid5 uid9 2
P
j A2,j / j “A2,j 6= 0”
pid9 uid8 4
6 7
6 7
pid9 uid9 1
avg(A) = 6 . 7
4 .
. 5
P P
j Am,j / j “Am,j 6= 0”

17
bit.ly/10SIe1A

matrix computations "
≠"
linear algebra

18
bit.ly/10SIe1A

… but there is a linear
operator hiding …

2 P P 3
A1,1 / j “A1,j 6= 0” A1,2 / j “A1,j 6= 0” ···
P P
6A2,1 / · · ·7
P=4 j “A2,j 6= 0” A2,2 / j “A2,j 6= 0” 5
.
. ..
. .

avg(A) = Pe

e is the vector of all ones

19
bit.ly/10SIe1A

Hadoop, MapReduce,
and Matrix Methods

20
bit.ly/10SIe1A

MapReduce

21
bit.ly/10SIe1A

The MapReduce Framework
Originated at Google for indexing web Data scalable
pages and computing PageRank.
Maps
M M
1
2
1
M

2
M
Reduce
M M
R 3
4
M
Express algorithms in "
3
R
4
M M
“data-local operations”.
5
M Shuffle
5

Implement one type of Fault-tolerance by design
communication: shuffle.
Input stored in triplicate
Reduce input/"
M
Shuffle moves all data with M
output on disk
R
the same key to the same M
R
M
reducer.
Map output"
persisted to disk"

22
before shuffle
bit.ly/10SIe1A

wordcount "
is a matrix computation too
map(document) :
for word in document

D
D
emit (word, 1)
1
2

matrix,1
bigdata,1
hadoop,1
D
D
matrix,1
bigdata,1
hadoop,1
3
4
matrix,1
bigdata,1
hadoop,1
matrix,1
bigdata,1
hadoop,1
bigdata,1
hadoop,1
D
bigdata,1
hadoop,1
5
bigdata,1
hadoop,1
bigdata,1

reduce(word, counts) :
emit (word, sum(counts))

23
bit.ly/10SIe1A

wordcount "

2 3
doc1
A1,1 A1,2 ··· A1,n
6 . 7
. 7
6
doc2
A2,1 A2,2 ··· . 7
A=6 .
6 7 = A
4 . .. ..
. . . Am 1,n 5
docm

word count =
colsum(A)
= AT e
e is the vector of all ones

24
bit.ly/10SIe1A

inverted index"

2 3
doc1
A1,1 A1,2 ··· A1,n
6 . 7
. 7
6
doc2
A2,1 A2,2 ··· . 7
A=6 .
6 7 = A
4 . .. ..
. . . Am 1,n 5
docm

25
bit.ly/10SIe1A

inverted index"

2 3
term1
A1,1 A2,1 ··· Am,1
6 . 7
. 7
6A1,2 A2,2 ··· . 7
6
term2
6 . 7 = AT
4 . .. ..
. . . Am,n 1 5
termm
A1,n ··· Am 1,n Am,n

26
bit.ly/10SIe1A

A recommender system "
with social info

product_ratings
friends_links

pid8 uid2 4
uid6 uid1
pid9 uid9 1
uid8 uid9
pid2 uid9 5
uid7 uid7
pid9 uid5 5
uid7 uid4
pid6 uid8 4
uid6 uid2
pid1 uid2 4
uid7 uid1
pid3 uid4 4
uid3 uid1
pid5 uid9 2
uid1 uid8
pid9 uid8 4
uid7 uid3
pid9 uid9 1
uid9 uid1

27
bit.ly/10SIe1A

with social info

product_ratings
friends_links

pid8 uid2 4
uid6 uid1
pid9 uid9 1
uid8 uid9
pid2 uid9 5
uid7 uid7
2
pid9 uid5 5
3 uid7 uid4
2 3
pid1
A
pid6 uid8 4
1,1
pid1 uid2 4
A2,1 ··· uid6 uid2
uid1
uid7 uid1
A1,1 A2,1 ···
6A
pid3 uid4 4
pid2
1,2 A2,2 · · ·7 uid3 uid1
6A1,2 A2,2 · · ·7
4
pid5 uid9 2
5 uid2
uid1 uid8
4 5
.
pid9 uid8 4
. .. .. uid7 uid3
.
. .. ..
.
pid9 uid9 1
. . uid9 uid1
. . .

28
bit.ly/10SIe1A

with social info

product_ratings
friends_links

pid8 uid2 4
uid6 uid1
pid9 uid9 1
uid8 uid9
pid2 uid9 5
uid7 uid7
pid9 uid5 5
uid7 uid4
pid6 uid8 4
uid6 uid2

R
S
pid1 uid2 4
uid7 uid1
pid3 uid4 4
uid3 uid1
pid5 uid9 2
uid1 uid8
pid9 uid8 4
uid7 uid3
pid9 uid9 1
uid9 uid1

29
bit.ly/10SIe1A

with social info
Recommend each item based with something that is"
on the average rating of all “X = S RT”
almost a matrix-matrix"
trusted users
product

2 3 2 3
A1,1 A2,1 ··· A1,1 A2,1 ···
R
S
pid1
uid1
6
pid2
A1,2 A2,2 · · ·7 6
uid2
A1,2 A2,2 · · ·7
4 5 4 5
.
. .. .. .
. .. ..
. . . . . .
! ! 1
X X
Xuid,pid = Suid,uid2 Ruid2,pid · “Suid,uid2 and Ruid2,pid 6= 0”
uid2 uid2

30
bit.ly/10SIe1A

Tools I like

hadoop streaming
dumbo
mrjob
hadoopy
C++

31
bit.ly/10SIe1A

Tools I don’t use but other
people seem to like …

pig
java
hbase
Mahout is the closest thing to a library
for matrix computations in Hadoop. If
you like Java, you should probably
mahout
start there.

Eclipse
I’m a low-level guy

Cassandra

32
bit.ly/10SIe1A

hadoop streaming

the map function is a program"
(key,value) pairs are sent via stdin"
output (key,value) pairs goes to stdout

the reduce function is a program"
(key,value) pairs are sent via stdin"
keys are grouped"
output (key,value) pairs goes to stdout

33
bit.ly/10SIe1A

mrjob from

a wrapper around hadoop streaming for
map and reduce functions in python
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in line.split():
yield (word.lower(), 1)

def reducer(self, word, counts):
yield (word, sum(counts))

if __name__ == '__main__':
MRWordFreqCount.run()

34
bit.ly/10SIe1A

How can Hadoop streaming
Synthetic data test 100,000,000-by-500 matrix (~500GB)
Codes implemented in MapReduce streaming
possibly be fast?
Matrix stored as TypedBytes lists of doubles
Python frameworks use Numpy+Atlas
Custom C++ TypedBytes reader/writer with Atlas
500 GB matrix. Computing the R in a QR factorization. "
See my non-streaming Java implementation too
New next talk!
Iter 1 Iter 1 Iter 2 Overall
QR (secs.) Total (secs.) Total (secs.) Total (secs.)
Dumbo 67725 960 217 1177
Hadoopy 70909 612 118 730
C++ 15809 350 37 387
Java 436 66 502

C++ in streaming beats a native Java implementation.
All timing results from the Hadoop job tracker
mrjob could be faster if it used
David Gleich (Sandia) MapReduce 2011 16/22
Example available from
typedbytes for intermediate storage see
github.com/dgleich/mrtsqr"
https://github.com/Yelp/mrjob/pull/447
for veriﬁcation

35
bit.ly/10SIe1A

Ax = y
X
Matrix-vector product
yi = Aik xk
Follow along! k
matrix-hadoop/codes/smatvec.py!

x

A

36
bit.ly/10SIe1A

Where do matrix-vector
products arise?
Google’s PageRank

Computing cosine-similarity between one
document and all other documents

Predictions from kernel methods

Computing averages (the example above)

37
bit.ly/10SIe1A

Ax = y
X
Matrix-vector product
yi = Aik xk
Follow along! k
matrix-hadoop/codes/smatvec.py!

A is stored by row
x

$ head samples/smat_5_5.txt !
0 0 0.125 3 1.024 4 0.121!

A
1 0 0.597!
2 2 1.247!
3 4 -1.45! x is stored entry-wise
!
4 2 0.061! $ head samples/vec_5.txt!

0 0.241!
1 -0.98!
2 0.237!
3 -0.32!
4 0.080!

38
bit.ly/10SIe1A

Matrix-vector product Ax = y
X
(in pictures)
yi = Aik xk
k

x

x
x

A
A
A
y

Input
Map 1! Reduce 1! Reduce 2!
Align on columns Output Aik xk Output

keyed on row i
sum(Aik xk)

39

bit.ly/10SIe1A

X
(in pictures)
yi = Aik xk
k

x
def joinmap(self, key, line):!
vals = line.split()!
x
if len(vals) == 2:!
# the vector!
yield (vals[0], # row!

A
A
(float(vals[1]),)) # xi!
else:!
# the matrix!
row = vals[0]!
for i in xrange(1,len(vals),2):!
yield (vals[i], # column!
(row, # i,Aij!
float(vals[i+1])))!
Input
Map 1!
Align on columns

40
bit.ly/10SIe1A

X
(in pictures)
yi = Aik xk
k

x

x
def joinred(self, key, vals):!
vecval = 0. !
x
matvals = []!
for val in vals:!
if len(val) == 1:!

A
A
A
vecval += val[0]!
else:!
matvals.append(val) !
for val in matvals:!
yield (val[0], val[1]*vecval)!

Note that you should use a
Input
secondary sort to avoid Map 1! Reduce 1!
reading both in memory

Align on columns Output Aik xk

keyed on row i

41
bit.ly/10SIe1A

X
(in pictures)
yi = Aik xk
k

x

x
x

A
def sumred(self, key, vals):!

A
A
y
yield (key, sum(vals))!

Input
Align on columns Output Aik xk Output

keyed on row i
sum(Aik xk)

42

bit.ly/10SIe1A

AB = C
Matrix-matrix product
Cij =
X
Aik Bkj
Follow along! k
matrix-hadoop/codes/matmat.py!

B
A

43
bit.ly/10SIe1A

AB = C
Cij =
X
Aik Bkj
Follow along! k
matrix-hadoop/codes/matmat.py!

A is stored by row

B

$ head samples/smat_10_5_A.txt !
0 0 0.599 4 -1.53!
1!

A
2
3!
4
2 0.260!

0 0.267 1 0.839

B is stored by row

$ head samples/smat_5_5.txt !
0 0 0.125 3 1.024 4 0.121!
1 0 0.597!
2 2 1.247!

44

bit.ly/10SIe1A

AB = C
(in pictures)
Cij =
X
Aik Bkj

B
B
k

B
A
A
A
C
Align on columns Output Aik Bkj Output

keyed on (i,j)
sum(Aik Bkj)

45

bit.ly/10SIe1A

AB = C
(in code)
Cij =
X
Aik Bkj

B
k


B
mtype = self.parsemat()!
vals = line.split()!
row = vals[0]!
rowvals = !

A
A
[(vals[i],float(vals[i+1])) !
for i in xrange(1,len(vals),2)]!
if mtype==1:!
# matrix A, output by col!
for val in rowvals:!
yield (val[0], (row, val[1]))!
else:!
yield (row, (rowvals,))!
Map 1!
Align on columns

46
bit.ly/10SIe1A

AB = C
(in pictures)
Cij =
X
Aik Bkj

B
B
k

def joinred(self, key, line):!

B
# load the data into memory !
brow = []!
acol = []!
for val in vals:!

A
if len(val) == 1:!

A
A
brow.extend(val[0])!
else:!
acol.append(val)!
!
for (bcol,bval) in brow:!
for (arow,aval) in acol:!
yield ((arow,bcol),aval*bval)!
Map 1! Reduce 1!
Align on columns Output Aik Bkj

keyed on (i,j)

47
bit.ly/10SIe1A

AB = C
(in pictures)
Cij =
X
Aik Bkj

B
B
k

B
A
A
A
C
def sumred(self, key, vals):!
yield (key, sum(vals))!


keyed on (i,j)
sum(Aik Bkj)

48

bit.ly/10SIe1A

Our social recommender
Follow along!
matrix-hadoop/recsys/recsys.py!

R is stored entry-wise

S
RT
!
$ gunzip –c data/rating.txt.gz!
139431556 591156 5!
139431556 1312460676 5!
139431556 204358 4
Object ID! 368725
139431556 User ID! Rating!
5!

S is stored entry-wise
!
$ gunzip –c data/rating.txt.gz!
3287060356 232085 -1!
3288305540 709420 1!
3290337156 204418 -1!
My ID! Other ID! Trust!
3294138244 269243 -1!

49
bit.ly/10SIe1A

Social recommender
(in code)

B
Conceptually, the ﬁrst step
is the same as the matrix-
matrix product.


B
We reorganize the data by parts = line.split('t')!
user-id to be able to map if len(parts) == 8: # ratings!
the trust relationships
objid = parts[0].strip()!
uid = parts[1].strip()!

A
A
rat = int(parts[2])!
yield (uid, (objid, rat))!
else len(parts) == 4: # trust!
myid = parts[0].strip()!
otherid = parts[1].strip()!
value = int(parts[2])!
if value 0:!
yield (otherid, (myid,))!
Map 1!
Align on columns

50
bit.ly/10SIe1A

(in pictures)

B
B
Conceptually,
def joinred(self, key, vals):! the second step

B
tusers = [] # uids that trust key! is the same as
ratobjs = [] # objs rated by uid=key! the matrix-
for val in vals:!
matrix product
if len(val) == 1:!
too, we “map”

A
tusers.append(val[0])!

A
A
else:! the ratings from
ratobjs.append(val)! each trusted
! user back to the
for (objid, rat) in ratobjs:! source.
for uid in tusers:!
yield ((uid, objid), rat)!

Map 1! Reduce 1!
Align on columns Output Aik Bkj

keyed on (i,j)

51
bit.ly/10SIe1A

AB = C
(in pictures)
Cij =
X
Aik Bkj

B
B
k

B
def avgred(self, key, vals):!

A
s = 0.!

A
A
C
n = 0!
for val in vals:!
s += val!
n += 1!
# the smoothed average of ratings!
yield key, !
(s+self.options.avg)/float(n+1) !
!

keyed on (i,j)
sum(Aik Bkj)

52

bit.ly/10SIe1A

Better ways to store
matrices in Hadoop
Block matrices minimize the
number of intermediate keys
and values used. I’d form them
No need for “integer” keys that based on the ﬁrst reduce
fall between 1 and n!

B
B
A
A

53
bit.ly/10SIe1A

Tall-and-skinny matrices are
common in BigData

A : m x n, m ≫ n
A1

Key is an arbitrary row-id
A2
Value is the 1 x n array
for a row
A3

Each submatrix Ai is an
A4
the input to a map task.

54
bit.ly/10SIe1A

Double-precision ﬂoating point
was designed for the era
where “big” was 1000-10000

55
bit.ly/10SIe1A

Error analysis of summation

s = 0; for i=1 to n: s = s + x[i]

ﬂ(x + y ) = (x + y )(1 + )

X X X
16
ﬂ(

xi ) xi  nµ |xi | µ ⇡ 10
i i i

A simple summation formula has
error that is not always small if n is a billion

56
bit.ly/10SIe1A

If your application matters
then watch out for this issue.

Use quad-precision arithmetic
or compensated summation
instead.

57
bit.ly/10SIe1A

Compensated Summation
“Kahan summation algorithm” on Wikipedia

s = 0.; c = 0.;
Mathematically, c is always zero.

for i=1 to n:
On a computer, c can be non-zero

y = x[i] – c
The parentheses matter!

t = s + y

X
2
X
ﬂ(csum(x))

xi  (µ + nµ ) |xi |
c = (t – s) – y
i i
16
µ ⇡ 10
s = t

58
bit.ly/10SIe1A

Collaborators, Friends, and
People who have taught me

MRTSQR! Sandia MapReduce!
Paul Constantine (Stanford)
Todd Plantenga
Austin Benson (Stanford)
Tammy Kolda
James Demmel (Berkeley)
Justin Basilico (now Netﬂix)
Simform! Others!
Jeremy Templeton (Sandia)
Margot Gerritsen (Stanford)
Joe Ruthruff (Sandia)
Yangyang Hou (Purdue)
Grants
Sandia CSAR
Joe Nichols (Stanford)

59
bit.ly/10SIe1A

Questions?

60

Matrix methods for Hadoop

Recommended

Recommended

More Related Content

Similar to Matrix methods for Hadoop

Similar to Matrix methods for Hadoop (19)

More from David Gleich

More from David Gleich (20)

Recently uploaded

Recently uploaded (20)

Matrix methods for Hadoop