0
Recommendation and graph
algorithms in Hadoop and SQL
Code 
github.com/dgleich/matrix-hadoop-tutorial

@dgleich
dgleich@pu...
Matrix computations
A1,1

6
6 A2,1
A=6 .
6
4 .
.
Am,1

Ax

Ax = b

Operations

Linear "
systems

A1,2
A2,2
..
.
···

···
·...
Outcomes
Recognize relationships between matrix methods and
things you’ve already been doing"
Example SQL queries as matri...
David Gleich · Purdue

Ancestry.com

4

matrix computations "
≠"
linear algebra
World’s simplest
recommendation system.

David Gleich · Purdue

Ancestry.com

5

Suggest the average rating.
A SQL statement as a "
matrix computation

http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-datab...
A SQL statement as a "
matrix computation

David Gleich · Purdue

Ancestry.com

7

SELECT!
p.product_id,!
p.name,!
AVG(pr....
Image from rockysprings, deviantart, CC share-alike
David Gleich · Purdue

Ancestry.com

8

This SQL statement is a "
matr...
SELECT!
...!
AVG(pr.rating)!
...!
GROUP BY p.product_id!
product_ratings
pid1
pid2
pid3
pid4
pid5
pid6
Is a matrix!
 pid7
...
But it’s a weird matrix"

product_ratings
pid1
pid2
pid3
pid4
pid5
pid6
Is a matrix!
 pid7
pid8
pid9

Missing entries!

Da...
But it’s a weird matrix"

Average"
of ratings

product_ratings
pid1
pid2
pid3
pid4
pid5
pid6
Is a matrix!
 pid7
pid8
pid9
...
But it’s a weird matrix"
and not a linear operator
A1,2

6 A2,1
A=6 .
6
4 .
.
Am,1

A2,2
..
.
···

!
6

3

···
···
..
.
Am...
David Gleich · Purdue

Ancestry.com

13

matrix computations "
≠"
linear algebra
David Gleich · Purdue

Ancestry.com

14

Hadoop, MapReduce,
and Matrix Methods
MapReduce

data

data

data

Map

key
value
value

key
value

Map

Map

()

key
value
key
value
key
value

Map

key
value
...
The MapReduce Framework
Originated at Google for indexing web
pages and computing PageRank.

Data scalable
Maps
M
Reduce
M...
wordcount "
is a matrix computation too
map(document) :
for word in document
D

1

2

D

D

3

4

emit (word, 1)

D
5

mat...
wordcount "
is a matrix computation too
doc1

A1,1

6
6
doc2
 A2,1
A=6 .
6
4 .
.
docm
 Am,1
word count

A1,2
A2,2
..
.
···...
inverted index"
is a matrix computation too
doc1

A1,1

6
6
doc2
 A2,1
A=6 .
6
4 .
.
docm
Am,1

A1,2
A2,2
..
.
···

3

···...
inverted index"
is a matrix computation too
term1

A1,1

6
6A1,2
term2
6
6 .
4 .
.
termm
 A1,n

A2,1
A2,2
..
.
···

···
··...
A recommender system "
with social info
friends_links

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 ui...
A recommender system "
with social info
friends_links

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1
1,...
A recommender system "
with social info
friends_links

pid8 uid2 4
pid9 uid9 1
pid2 uid9 5
pid9 uid5 5
pid6 uid8 4
pid1 ui...
A recommender system "
with social info

2

A1,1
6
pid2
 A1,2
4
.
.
.
pid1

Xuid,pid =

A2,1
A2,2
..
.

R
X

uid2

“X = S ...
Tools I like

hadoop streaming

David Gleich · Purdue

Ancestry.com

25

dumbo
mrjob
hadoopy
C++
Tools I don’t use but other
people seem to like …
pig
java
hbase
mahout
Eclipse

Mahout is the closest thing to a library
...
hadoop streaming
the map function is a program"
(key,value) pairs are sent via stdin"
output (key,value) pairs goes to std...
mrjob from 
a wrapper around hadoop streaming for
map and reduce functions in python
class MRWordFreqCount(MRJob):
def map...
David Gleich · Purdue

Ancestry.com

29

Connected components in
SQL and Hadoop
Connected components

3 “components” in this graph

How can we find them
algorithmically …

David Gleich · Purdue

Ancestry...
Connected components
Algorithm!
Assign each node a random
component id.

David Gleich · Purdue

Ancestry.com

31

For each...
David Gleich · Purdue

Ancestry.com

32

DEMO
Computing Connected
Components in SQL
!
CREATE TABLE v2 AS (!
SELECT !
e.tail AS id,!
MIN(v.comp) as COMP!
FROM edges e!
I...
Matrix-vector product and
connected components in Hadoop
See example!
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Recommendation and graph algorithms in Hadoop and SQL
Upcoming SlideShare
Loading in...5
×

Recommendation and graph algorithms in Hadoop and SQL

781

Published on

A talk I gave at ancestry.com on Hadoop, SQL, recommendation and graph algorithms. It's a tutorial overview, there are better algorithms than those I describe, but these are a simple starting point.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
781
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Recommendation and graph algorithms in Hadoop and SQL"

  1. 1. Recommendation and graph algorithms in Hadoop and SQL Code github.com/dgleich/matrix-hadoop-tutorial @dgleich dgleich@purdue.edu DAVID F. GLEICH ASSISTANT PROFESSOR" COMPUTER SCIENCE" PURDUE UNIVERSITY David Gleich · Purdue Ancestry.com 1
  2. 2. Matrix computations A1,1 6 6 A2,1 A=6 . 6 4 . . Am,1 Ax Ax = b Operations Linear " systems A1,2 A2,2 .. . ··· ··· ··· .. . Am,n min kAx 1 3 A1,n . 7 . 7 . 7 7 Am 1,n 5 Am,n bk Least squares David Gleich · Purdue Ax = x Eigenvalues Ancestry.com 2 2
  3. 3. Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations See how to work with big graphs as large edge lists in Hadoop and SQL" Example Connected components David Gleich · Purdue Ancestry.com 3 Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info
  4. 4. David Gleich · Purdue Ancestry.com 4 matrix computations " ≠" linear algebra
  5. 5. World’s simplest recommendation system. David Gleich · Purdue Ancestry.com 5 Suggest the average rating.
  6. 6. A SQL statement as a " matrix computation http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql David Gleich · Purdue Ancestry.com 6 How do I find the average rating for each product?
  7. 7. A SQL statement as a " matrix computation David Gleich · Purdue Ancestry.com 7 SELECT! p.product_id,! p.name,! AVG(pr.rating) AS rating_average! http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql FROM products p! INNER JOIN product_ratings pr! How do I find the ON pr.product_id = p.product_id! average rating for GROUP BY p.product_id! each product? ORDER BY rating_average DESC!
  8. 8. Image from rockysprings, deviantart, CC share-alike David Gleich · Purdue Ancestry.com 8 This SQL statement is a " matrix computation!
  9. 9. SELECT! ...! AVG(pr.rating)! ...! GROUP BY p.product_id! product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 David Gleich · Purdue Ancestry.com 9 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
  10. 10. But it’s a weird matrix" product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 Missing entries! David Gleich · Purdue Ancestry.com 10 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
  11. 11. But it’s a weird matrix" Average" of ratings product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 4 4 4 5 Matrix David Gleich · Purdue 4 SELECT AVG(r) ... 4 GROUP BY pid Vector Ancestry.com 11 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4
  12. 12. But it’s a weird matrix" and not a linear operator A1,2 6 A2,1 A=6 . 6 4 . . Am,1 A2,2 .. . ··· ! 6 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 Am 1,n 5 Am,n P 2 P j A1,j / Pj “A1,j 6= 0” P 6 j A2,j / j “A2,j 6= 0” 6 avg(A) = 6 . . 4 . P P j Am,j / j “Am,j 6= 0” David Gleich · Purdue Ancestry.com 3 7 7 7 5 12 A1,1 I product_ratings s a matrix pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 2
  13. 13. David Gleich · Purdue Ancestry.com 13 matrix computations " ≠" linear algebra
  14. 14. David Gleich · Purdue Ancestry.com 14 Hadoop, MapReduce, and Matrix Methods
  15. 15. MapReduce data data data Map key value value key value Map Map () key value key value key value Map key value Shuffle key value value value key value Reduce data Reduce data Reduce data David Gleich · Purdue Ancestry.com 15 data key value
  16. 16. The MapReduce Framework Originated at Google for indexing web pages and computing PageRank. Data scalable Maps M Reduce M R M R M M Shuffle M M 1 2 M M 3 4 1 Express algorithms in " “data-local operations”. 3 Implement one type of communication: shuffle. Fault-tolerance by design 4 5 M 5 Input stored in triplicate Reduce input/" M output on disk M R M R M Map output" persisted to disk" before shuffle David Gleich · Purdue Ancestry.com 16 Shuffle moves all data with the same key to the same reducer. 2
  17. 17. wordcount " is a matrix computation too map(document) : for word in document D 1 2 D D 3 4 emit (word, 1) D 5 matrix,1 matrix,1 matrix,1 matrix,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 bigdata,1 reduce(word, counts) : emit (word, sum(counts)) David Gleich · Purdue Ancestry.com 17 D
  18. 18. wordcount " is a matrix computation too doc1 A1,1 6 6 doc2 A2,1 A=6 . 6 4 . . docm Am,1 word count A1,2 A2,2 .. . ··· = 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 = A Am 1,n 5 Am,n colsum(A) = AT e e is the vector of all ones David Gleich · Purdue Ancestry.com 18 2
  19. 19. inverted index" is a matrix computation too doc1 A1,1 6 6 doc2 A2,1 A=6 . 6 4 . . docm Am,1 A1,2 A2,2 .. . ··· 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 = A Am 1,n 5 Am,n David Gleich · Purdue Ancestry.com 19 2
  20. 20. inverted index" is a matrix computation too term1 A1,1 6 6A1,2 term2 6 6 . 4 . . termm A1,n A2,1 A2,2 .. . ··· ··· ··· .. . Am 1,n 3 Am,1 . 7 . 7 . 7 = AT 7 Am,n 1 5 Am,n David Gleich · Purdue Ancestry.com 20 2
  21. 21. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1 David Gleich · Purdue Ancestry.com 21 product_ratings
  22. 22. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 1,1 pid1 uid2 4 pid3 uid4 4 1,2 pid2 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid1 uid7 uid1 uid3 uid1 uid2 uid1 uid8 uid7 uid3 uid9 uid1 2 A 6A 4 . . . A2,1 A2,2 .. . 3 ··· · · ·7 5 .. . 2 A1,1 6A1,2 4 . . . David Gleich · Purdue A2,1 A2,2 .. . 3 ··· · · ·7 5 .. . Ancestry.com 22 product_ratings
  23. 23. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1 R S David Gleich · Purdue Ancestry.com 23 product_ratings
  24. 24. A recommender system " with social info 2 A1,1 6 pid2 A1,2 4 . . . pid1 Xuid,pid = A2,1 A2,2 .. . R X uid2 “X = S RT” 3 ··· · · ·7 5 .. . Suid,uid2 Ruid2,pid 2 A1,1 6 uid2 A1,2 4 . . . uid1 ! with something that is" almost a matrix-matrix" product · X uid2 A2,1 A2,2 .. . S 3 ··· · · ·7 5 .. . ! “Suid,uid2 and Ruid2,pid 6= 0” David Gleich · Purdue Ancestry.com 1 24 Recommend each item based on the average rating of all trusted users
  25. 25. Tools I like hadoop streaming David Gleich · Purdue Ancestry.com 25 dumbo mrjob hadoopy C++
  26. 26. Tools I don’t use but other people seem to like … pig java hbase mahout Eclipse Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably start there. I’m a low-level guy Cassandra David Gleich · Purdue Ancestry.com 26
  27. 27. hadoop streaming the map function is a program" (key,value) pairs are sent via stdin" output (key,value) pairs goes to stdout David Gleich · Purdue Ancestry.com 27 the reduce function is a program" (key,value) pairs are sent via stdin" keys are grouped" output (key,value) pairs goes to stdout
  28. 28. mrjob from a wrapper around hadoop streaming for map and reduce functions in python class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) David Gleich · Purdue Ancestry.com 28 if __name__ == '__main__': MRWordFreqCount.run()
  29. 29. David Gleich · Purdue Ancestry.com 29 Connected components in SQL and Hadoop
  30. 30. Connected components 3 “components” in this graph How can we find them algorithmically … David Gleich · Purdue Ancestry.com 30 … on a huge network?
  31. 31. Connected components Algorithm! Assign each node a random component id. David Gleich · Purdue Ancestry.com 31 For each node, take the minimum component id of itself and all neighbors.
  32. 32. David Gleich · Purdue Ancestry.com 32 DEMO
  33. 33. Computing Connected Components in SQL ! CREATE TABLE v2 AS (! SELECT ! e.tail AS id,! MIN(v.comp) as COMP! FROM edges e! INNER JOIN vector v! ON e.head = v.id! GROUP BY e.tail! );! Graph! Edges : id | head | tail ! ! “Vector”! ! v : id | comp! initialized to random ! component! DROP TABLE v;! ALTER TABLE v2 ! RENAME TO v;! ! ! David Gleich · Purdue Ancestry.com 33 ... Repeat ...!
  34. 34. Matrix-vector product and connected components in Hadoop See example!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×