Recommendation and graph algorithms in Hadoop and SQL

1,379 views

Published on

A talk I gave at ancestry.com on Hadoop, SQL, recommendation and graph algorithms. It's a tutorial overview, there are better algorithms than those I describe, but these are a simple starting point.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,379
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
38
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Recommendation and graph algorithms in Hadoop and SQL

  1. 1. Recommendation and graph algorithms in Hadoop and SQL Code github.com/dgleich/matrix-hadoop-tutorial @dgleich dgleich@purdue.edu DAVID F. GLEICH ASSISTANT PROFESSOR" COMPUTER SCIENCE" PURDUE UNIVERSITY David Gleich · Purdue Ancestry.com 1
  2. 2. Matrix computations A1,1 6 6 A2,1 A=6 . 6 4 . . Am,1 Ax Ax = b Operations Linear " systems A1,2 A2,2 .. . ··· ··· ··· .. . Am,n min kAx 1 3 A1,n . 7 . 7 . 7 7 Am 1,n 5 Am,n bk Least squares David Gleich · Purdue Ax = x Eigenvalues Ancestry.com 2 2
  3. 3. Outcomes Recognize relationships between matrix methods and things you’ve already been doing" Example SQL queries as matrix computations See how to work with big graphs as large edge lists in Hadoop and SQL" Example Connected components David Gleich · Purdue Ancestry.com 3 Understand how to use Hadoop to compute these matrix methods at scale for BigData" Example Recommenders with social network info
  4. 4. David Gleich · Purdue Ancestry.com 4 matrix computations " ≠" linear algebra
  5. 5. World’s simplest recommendation system. David Gleich · Purdue Ancestry.com 5 Suggest the average rating.
  6. 6. A SQL statement as a " matrix computation http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql David Gleich · Purdue Ancestry.com 6 How do I find the average rating for each product?
  7. 7. A SQL statement as a " matrix computation David Gleich · Purdue Ancestry.com 7 SELECT! p.product_id,! p.name,! AVG(pr.rating) AS rating_average! http://stackoverflow.com/questions/4217449/returning-average-rating-from-a-database-sql FROM products p! INNER JOIN product_ratings pr! How do I find the ON pr.product_id = p.product_id! average rating for GROUP BY p.product_id! each product? ORDER BY rating_average DESC!
  8. 8. Image from rockysprings, deviantart, CC share-alike David Gleich · Purdue Ancestry.com 8 This SQL statement is a " matrix computation!
  9. 9. SELECT! ...! AVG(pr.rating)! ...! GROUP BY p.product_id! product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 David Gleich · Purdue Ancestry.com 9 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
  10. 10. But it’s a weird matrix" product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 Missing entries! David Gleich · Purdue Ancestry.com 10 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1
  11. 11. But it’s a weird matrix" Average" of ratings product_ratings pid1 pid2 pid3 pid4 pid5 pid6 Is a matrix! pid7 pid8 pid9 4 4 4 5 Matrix David Gleich · Purdue 4 SELECT AVG(r) ... 4 GROUP BY pid Vector Ancestry.com 11 pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4
  12. 12. But it’s a weird matrix" and not a linear operator A1,2 6 A2,1 A=6 . 6 4 . . Am,1 A2,2 .. . ··· ! 6 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 Am 1,n 5 Am,n P 2 P j A1,j / Pj “A1,j 6= 0” P 6 j A2,j / j “A2,j 6= 0” 6 avg(A) = 6 . . 4 . P P j Am,j / j “Am,j 6= 0” David Gleich · Purdue Ancestry.com 3 7 7 7 5 12 A1,1 I product_ratings s a matrix pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 2
  13. 13. David Gleich · Purdue Ancestry.com 13 matrix computations " ≠" linear algebra
  14. 14. David Gleich · Purdue Ancestry.com 14 Hadoop, MapReduce, and Matrix Methods
  15. 15. MapReduce data data data Map key value value key value Map Map () key value key value key value Map key value Shuffle key value value value key value Reduce data Reduce data Reduce data David Gleich · Purdue Ancestry.com 15 data key value
  16. 16. The MapReduce Framework Originated at Google for indexing web pages and computing PageRank. Data scalable Maps M Reduce M R M R M M Shuffle M M 1 2 M M 3 4 1 Express algorithms in " “data-local operations”. 3 Implement one type of communication: shuffle. Fault-tolerance by design 4 5 M 5 Input stored in triplicate Reduce input/" M output on disk M R M R M Map output" persisted to disk" before shuffle David Gleich · Purdue Ancestry.com 16 Shuffle moves all data with the same key to the same reducer. 2
  17. 17. wordcount " is a matrix computation too map(document) : for word in document D 1 2 D D 3 4 emit (word, 1) D 5 matrix,1 matrix,1 matrix,1 matrix,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 hadoop,1 bigdata,1 bigdata,1 reduce(word, counts) : emit (word, sum(counts)) David Gleich · Purdue Ancestry.com 17 D
  18. 18. wordcount " is a matrix computation too doc1 A1,1 6 6 doc2 A2,1 A=6 . 6 4 . . docm Am,1 word count A1,2 A2,2 .. . ··· = 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 = A Am 1,n 5 Am,n colsum(A) = AT e e is the vector of all ones David Gleich · Purdue Ancestry.com 18 2
  19. 19. inverted index" is a matrix computation too doc1 A1,1 6 6 doc2 A2,1 A=6 . 6 4 . . docm Am,1 A1,2 A2,2 .. . ··· 3 ··· ··· .. . Am,n 1 A1,n . 7 . 7 . 7 7 = A Am 1,n 5 Am,n David Gleich · Purdue Ancestry.com 19 2
  20. 20. inverted index" is a matrix computation too term1 A1,1 6 6A1,2 term2 6 6 . 4 . . termm A1,n A2,1 A2,2 .. . ··· ··· ··· .. . Am 1,n 3 Am,1 . 7 . 7 . 7 = AT 7 Am,n 1 5 Am,n David Gleich · Purdue Ancestry.com 20 2
  21. 21. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1 David Gleich · Purdue Ancestry.com 21 product_ratings
  22. 22. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 1,1 pid1 uid2 4 pid3 uid4 4 1,2 pid2 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid1 uid7 uid1 uid3 uid1 uid2 uid1 uid8 uid7 uid3 uid9 uid1 2 A 6A 4 . . . A2,1 A2,2 .. . 3 ··· · · ·7 5 .. . 2 A1,1 6A1,2 4 . . . David Gleich · Purdue A2,1 A2,2 .. . 3 ··· · · ·7 5 .. . Ancestry.com 22 product_ratings
  23. 23. A recommender system " with social info friends_links pid8 uid2 4 pid9 uid9 1 pid2 uid9 5 pid9 uid5 5 pid6 uid8 4 pid1 uid2 4 pid3 uid4 4 pid5 uid9 2 pid9 uid8 4 pid9 uid9 1 uid6 uid1 uid8 uid9 uid7 uid7 uid7 uid4 uid6 uid2 uid7 uid1 uid3 uid1 uid1 uid8 uid7 uid3 uid9 uid1 R S David Gleich · Purdue Ancestry.com 23 product_ratings
  24. 24. A recommender system " with social info 2 A1,1 6 pid2 A1,2 4 . . . pid1 Xuid,pid = A2,1 A2,2 .. . R X uid2 “X = S RT” 3 ··· · · ·7 5 .. . Suid,uid2 Ruid2,pid 2 A1,1 6 uid2 A1,2 4 . . . uid1 ! with something that is" almost a matrix-matrix" product · X uid2 A2,1 A2,2 .. . S 3 ··· · · ·7 5 .. . ! “Suid,uid2 and Ruid2,pid 6= 0” David Gleich · Purdue Ancestry.com 1 24 Recommend each item based on the average rating of all trusted users
  25. 25. Tools I like hadoop streaming David Gleich · Purdue Ancestry.com 25 dumbo mrjob hadoopy C++
  26. 26. Tools I don’t use but other people seem to like … pig java hbase mahout Eclipse Mahout is the closest thing to a library for matrix computations in Hadoop. If you like Java, you should probably start there. I’m a low-level guy Cassandra David Gleich · Purdue Ancestry.com 26
  27. 27. hadoop streaming the map function is a program" (key,value) pairs are sent via stdin" output (key,value) pairs goes to stdout David Gleich · Purdue Ancestry.com 27 the reduce function is a program" (key,value) pairs are sent via stdin" keys are grouped" output (key,value) pairs goes to stdout
  28. 28. mrjob from a wrapper around hadoop streaming for map and reduce functions in python class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in line.split(): yield (word.lower(), 1) def reducer(self, word, counts): yield (word, sum(counts)) David Gleich · Purdue Ancestry.com 28 if __name__ == '__main__': MRWordFreqCount.run()
  29. 29. David Gleich · Purdue Ancestry.com 29 Connected components in SQL and Hadoop
  30. 30. Connected components 3 “components” in this graph How can we find them algorithmically … David Gleich · Purdue Ancestry.com 30 … on a huge network?
  31. 31. Connected components Algorithm! Assign each node a random component id. David Gleich · Purdue Ancestry.com 31 For each node, take the minimum component id of itself and all neighbors.
  32. 32. David Gleich · Purdue Ancestry.com 32 DEMO
  33. 33. Computing Connected Components in SQL ! CREATE TABLE v2 AS (! SELECT ! e.tail AS id,! MIN(v.comp) as COMP! FROM edges e! INNER JOIN vector v! ON e.head = v.id! GROUP BY e.tail! );! Graph! Edges : id | head | tail ! ! “Vector”! ! v : id | comp! initialized to random ! component! DROP TABLE v;! ALTER TABLE v2 ! RENAME TO v;! ! ! David Gleich · Purdue Ancestry.com 33 ... Repeat ...!
  34. 34. Matrix-vector product and connected components in Hadoop See example!

×