Algorithms on Hadoop at Last.fm<br />Mark Levy, 14 April 2011<br />
Classical uses of Hadoop<br />Computing Charts<br /><ul><li>1 billion scrobbles per month
Hadoop dfs keeps them safe
cluster adds them up</li></li></ul><li>Classical uses of Hadoop<br />Computing Charts<br /><ul><li>1 billion scrobbles per...
Hadoop dfs keeps them safe
cluster adds them up</li></ul>Reporting Royalties<br /><ul><li>copy streaming logs to dfs
cluster adds them up</li></ul>and so on...<br />
Algorithmic uses of Hadoop<br /><ul><li>Topic Modelling
Graph Recommendation
Audio Analysis
LSH indexing</li></ul>and so on...<br />
Topic Modelling<br />learning topics from documents<br />
Topic Modelling<br /><ul><li>learn topics from words in documents
use trained model for:
inference
smoothing
many applications
words and documents might really be itemIDs and user profiles</li></li></ul><li>Topic Modelling<br />inference: which topi...
labelling
snippet generation</li></ul>smoothing: which keywords not in the document are characteristic of its topics?<br /><ul><li>r...
ad targeting </li></li></ul><li>Topic Modelling: example<br />IBM Pharos<br />
Topic Modelling: LDA<br /><ul><li>Latent Dirichlet Allocation
graphical model</li></li></ul><li>Topic Modelling: LDA<br />language prior<br />language model<br />topic label<br />obser...
Topic Modelling:LDA<br /><ul><li>learn distributions θ, ψ, p(z|θ)</li></li></ul><li>Topic Modelling: LDA<br /><ul><li>lear...
use Gibbs Sampling (MCMC):
initialise all parameters to random values
loop till convergence:
consider one parameter at a time
compute a sampling distribution based on current values of all other parameters
sample a new value for the parameter</li></li></ul><li>Topic Modelling: LDA<br /><ul><li>Collapsed Gibbs Sampler
learn distributions p(z|w)</li></li></ul><li>Topic Modelling: LDA<br /><ul><li>Collapsed Gibbs Sampler (Griffiths & Steyve...
learn distributions p(z|w)</li></ul>= (C(w,z)+β)/(C(z)+V β)<br />∝ C(z,d)+α<br />
Topic Modelling: LDA<br /><ul><li>model is specified by word-topic matrix
initialise randomly
iterate:
sample a new topic for each word
update the matrix</li></li></ul><li>Topic Modelling: AD-LDA<br /><ul><li>Approximate Distributed LDA (Newman et al. 2007)
copy word-topic matrix to each machine
sample based on local copy
accumulate updates from all machines at end of iteration</li></li></ul><li>Topic Modelling: AD-LDA<br />class Initializati...
Topic Modelling: AD-LDA<br />class GibbsSamplingMapper:<br />   init():<br />      load current word-topic matrix<br />   ...
Topic Modelling: AD-LDA<br />class Reducer:<br />   reduce(key,val):      <br />      if val is a docID:<br />         # s...
Topic Modelling: Scalability<br /><ul><li>CPU bound due to cost of sampling
speedup by stratified sampling:</li></ul>treat “unlikely” topics separately<br />z unlikely for w in d if C(z,w) = C(z,d) ...
initial iterations slower, later faster</li></ul>only sample “likely” topics<br />
Topic Modelling: Scalability<br /><ul><li>trained a model on Last.fm shouts</li></li></ul><li>Topic Modelling: Scalability...
200 topics, 76M documents, 670M words
Upcoming SlideShare
Loading in …5
×

Algorithms on Hadoop at Last.fm

7,379 views
7,278 views

Published on

Algorithms on Hadoop at Last.fm. Talk at HUGUK 14 April 2011.

Published in: Technology
3 Comments
36 Likes
Statistics
Notes
No Downloads
Views
Total views
7,379
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
0
Comments
3
Likes
36
Embeds 0
No embeds

No notes for slide

Algorithms on Hadoop at Last.fm

  1. 1. Algorithms on Hadoop at Last.fm<br />Mark Levy, 14 April 2011<br />
  2. 2. Classical uses of Hadoop<br />Computing Charts<br /><ul><li>1 billion scrobbles per month
  3. 3. Hadoop dfs keeps them safe
  4. 4. cluster adds them up</li></li></ul><li>Classical uses of Hadoop<br />Computing Charts<br /><ul><li>1 billion scrobbles per month
  5. 5. Hadoop dfs keeps them safe
  6. 6. cluster adds them up</li></ul>Reporting Royalties<br /><ul><li>copy streaming logs to dfs
  7. 7. cluster adds them up</li></ul>and so on...<br />
  8. 8. Algorithmic uses of Hadoop<br /><ul><li>Topic Modelling
  9. 9. Graph Recommendation
  10. 10. Audio Analysis
  11. 11. LSH indexing</li></ul>and so on...<br />
  12. 12. Topic Modelling<br />learning topics from documents<br />
  13. 13. Topic Modelling<br /><ul><li>learn topics from words in documents
  14. 14. use trained model for:
  15. 15. inference
  16. 16. smoothing
  17. 17. many applications
  18. 18. words and documents might really be itemIDs and user profiles</li></li></ul><li>Topic Modelling<br />inference: which topics is a document about?<br /><ul><li>clustering
  19. 19. labelling
  20. 20. snippet generation</li></ul>smoothing: which keywords not in the document are characteristic of its topics?<br /><ul><li>recommendation
  21. 21. ad targeting </li></li></ul><li>Topic Modelling: example<br />IBM Pharos<br />
  22. 22. Topic Modelling: LDA<br /><ul><li>Latent Dirichlet Allocation
  23. 23. graphical model</li></li></ul><li>Topic Modelling: LDA<br />language prior<br />language model<br />topic label<br />observed word<br />topic probability<br />
  24. 24. Topic Modelling:LDA<br /><ul><li>learn distributions θ, ψ, p(z|θ)</li></li></ul><li>Topic Modelling: LDA<br /><ul><li>learn distributions θ, ψ, p(z|θ)
  25. 25. use Gibbs Sampling (MCMC):
  26. 26. initialise all parameters to random values
  27. 27. loop till convergence:
  28. 28. consider one parameter at a time
  29. 29. compute a sampling distribution based on current values of all other parameters
  30. 30. sample a new value for the parameter</li></li></ul><li>Topic Modelling: LDA<br /><ul><li>Collapsed Gibbs Sampler
  31. 31. learn distributions p(z|w)</li></li></ul><li>Topic Modelling: LDA<br /><ul><li>Collapsed Gibbs Sampler (Griffiths & Steyvers, 2004)
  32. 32. learn distributions p(z|w)</li></ul>= (C(w,z)+β)/(C(z)+V β)<br />∝ C(z,d)+α<br />
  33. 33. Topic Modelling: LDA<br /><ul><li>model is specified by word-topic matrix
  34. 34. initialise randomly
  35. 35. iterate:
  36. 36. sample a new topic for each word
  37. 37. update the matrix</li></li></ul><li>Topic Modelling: AD-LDA<br /><ul><li>Approximate Distributed LDA (Newman et al. 2007)
  38. 38. copy word-topic matrix to each machine
  39. 39. sample based on local copy
  40. 40. accumulate updates from all machines at end of iteration</li></li></ul><li>Topic Modelling: AD-LDA<br />class InitializationMapper:<br /> map(docID,text):<br /> # represent doc as word-topic pairs<br /> doc = {}<br /> for w in text:<br /> sample z at random<br /> doc[w] = z<br /> yield docID,doc<br />
  41. 41. Topic Modelling: AD-LDA<br />class GibbsSamplingMapper:<br /> init():<br /> load current word-topic matrix<br /> map(docID,doc):<br /> for w,z in doc:<br /> compute p(z|w) from matrix,doc<br /> sample new_z from p(z|w)<br /> doc[w] = new_z<br /> yield docID,doc<br /> for w,z in doc:<br /> yield (w,z),1<br />
  42. 42. Topic Modelling: AD-LDA<br />class Reducer:<br /> reduce(key,val): <br /> if val is a docID:<br /> # save new topic assignments<br /> yield key,val<br /> else:<br /> # update word-topic matrix<br /> matrix[key] += val<br />
  43. 43. Topic Modelling: Scalability<br /><ul><li>CPU bound due to cost of sampling
  44. 44. speedup by stratified sampling:</li></ul>treat “unlikely” topics separately<br />z unlikely for w in d if C(z,w) = C(z,d) = 0<br /><ul><li>also makes word-topic matrix sparse 
  45. 45. initial iterations slower, later faster</li></ul>only sample “likely” topics<br />
  46. 46. Topic Modelling: Scalability<br /><ul><li>trained a model on Last.fm shouts</li></li></ul><li>Topic Modelling: Scalability<br /><ul><li>trained a model on Last.fm shouts
  47. 47. 200 topics, 76M documents, 670M words
  48. 48. 80 map tasks
  49. 49. initially 25 minutes per training iteration
  50. 50. falls to 5 minutes by iteration #50
  51. 51. thanks stratified sampling </li></li></ul><li>Topic Modelling: Scalability<br /><ul><li>scalable to any number of documents, just add more machines to cluster</li></ul>limitations:<br /><ul><li>runtime linear in number of topics
  52. 52. word-topic matrix still large, need an efficient map data structure</li></li></ul><li>Graph Recommendations<br />propagating labels on a graph<br />
  53. 53. Graph Recommendations<br /><ul><li>random walk on user-item graph</li></ul>t<br />U<br />
  54. 54. Graph Recommendations<br /><ul><li>random walk on user-item graph
  55. 55. many short routes from U to t ⇒ recommend!</li></ul>t<br />U<br />
  56. 56. Graph Recommendations<br /><ul><li>random walk is equivalent to Label Propagation (Baluja et al., 2008)
  57. 57. belongs to family of algorithms that are easy to code in map-reduce</li></li></ul><li>Label Propagation<br /><ul><li>start with partially labelled graph
  58. 58. user nodes are labelled with known items
  59. 59. each label has an associated weight
  60. 60. iterate:
  61. 61. propagate labels to adjacent nodes
  62. 62. accumulate and renormalise at each node
  63. 63. prune number of labels held at each node
  64. 64. final labels include some unknown items</li></li></ul><li>Label Propagation<br /><ul><li>after convergence or some iterations:
  65. 65. labels at item nodes are similar items
  66. 66. new labels at user nodes are recommendations</li></li></ul><li><ul><li>user-track graph, edge weights = scrobbles</li></ul>Label Propagation<br />a<br />2<br />4<br />U<br />b<br />4<br />1<br />c<br />1<br />V<br />2<br />d<br />3<br />5<br />W<br />e<br />3<br />3<br />4<br />f<br />X<br />
  67. 67. <ul><li>user nodes are labelled with scrobbled tracks </li></ul>Label Propagation<br />a<br />2<br />(a,0.2),(b,0.4),(c,0.4)<br />4<br />U<br />b<br />4<br />1<br />c<br />(b,0.5),(d,0.5)<br />1<br />V<br />2<br />3<br />d<br />(b,0.2),(d,0.3),(e,0.5)<br />5<br />W<br />e<br />3<br />3<br />4<br />(a,0.3),(d,0.3),(e,0.4)<br />f<br />X<br />
  68. 68. Label Propagation<br /><ul><li>propagate, accumulate, normalise </li></ul>a<br />2<br />(a,0.2),(b,0.4),(c,0.4)<br />4<br />U<br />b<br />4<br />1<br />(b,0.5),(d,0.5)<br />c<br />1<br />V<br />2<br />1 x (b,0.5),(d,0.5)<br />3 x (b,0.2),(d,0.3),(e,0.5)<br /><ul><li>(b,0.37),d(0.47),(e,0.17)</li></ul>next iteration e will propagate to user Y<br />3<br />d<br />(b,0.2),(d,0.3),(e,0.5)<br />5<br />W<br />e<br />3<br />3<br />(a,0.3),(d,0.3),(e,0.4)<br />4<br />f<br />X<br />
  69. 69. Map-Reduce Graph Algorithms<br /><ul><li>general approach assuming:
  70. 70. no global state
  71. 71. state at node recomputed from scratch on each iteration from incoming messages
  72. 72. examples:
  73. 73. breadth-first search
  74. 74. page rank
  75. 75. label propagation</li></li></ul><li>Map-Reduce Graph Algorithms<br /><ul><li>serialize graph as adjacency lists:
  76. 76. initialize state at each node, write to disk</li></ul>a<br />2<br />U,[(a,2),(b,4),(c,4)]<br />4<br />U<br />b<br />4<br />c<br />
  77. 77. Map-Reduce Graph Algorithms<br /><ul><li>inputs: adjacency lists, state at each node
  78. 78. output: updated state at each node
  79. 79. map(nodeID,value):
  80. 80. join adjacency list and state
  81. 81. emit a message to each node in ajacency list
  82. 82. reduce(nodeID,messages):
  83. 83. process messages at each node
  84. 84. update state</li></li></ul><li>Label Propagation<br />class PropagatingMapper:<br /> map(nodeID,value):<br /> # value holds label-weight pairs<br /> # and adjacency list for node<br /> labels,adj_list = value<br /> for node,weight in adj_list:<br /> # send a “stripe” of label-weight<br /> # pairs to each neighbouring node<br /> msg = [(label,prob*weight) for label,prob in labels]<br /> yield node,msg<br />
  85. 85. Label Propagation<br />class Reducer:<br /> reduce(nodeID,msgs):<br /> # accumulate<br /> labels = defaultdict(lambda:0)<br /> for msg in msgs:<br /> for label,w in msg:<br /> labels[label] += w<br /> # normalise, prune<br /> normalise(labels,MAX_LABELS_PER_NODE)<br /> yield nodeID,labels<br />
  86. 86. Label Propagation: Refinements<br /><ul><li>hubs link items with nothing in common</li></li></ul><li>Label Propagation: Refinements<br /><ul><li>want to favour short paths, discount hubs (users who like everything, items that everybody likes)
  87. 87. mimic abandoning random walk:
  88. 88. propagate a dummy label
  89. 89. base its weight on degree of source node
  90. 90. ignore dummy label in final output</li></li></ul><li>M-R Graph Algorithms: Scalability<br /><ul><li>ran Label Propagation on artist charts graph
  91. 91. well... just a small part of it:
  92. 92. 1M artists, 1.8M users, 350M edges
  93. 93. set MAX_LABELS_PER_NODE = 150
  94. 94. soon takes 2+ hours per iteration</li></li></ul><li>M-R Graph Algorithms: Scalability<br /><ul><li>scalable to graphs with many nodes , just add more machines to cluster</li></ul>limitations:<br /><ul><li>“map-increase”
  95. 95. after some iterations mappers propagate MAX_LABELS_PER_NODE updates along every edge
  96. 96. lots of disk for mapper output
  97. 97. reducers slow and/or OOM</li></li></ul><li>Audio Analysis<br />abusing a Hadoop cluster for fun<br />
  98. 98. Audio Analysis<br /><ul><li>beat locations, bpm
  99. 99. key estimation
  100. 100. chord sequence estimation
  101. 101. energy
  102. 102. music/speech?
  103. 103. ...</li></li></ul><li>Audio Analysis<br />requirements:<br /><ul><li>start with a big list of tracks
  104. 104. pull audio from its own dfs
  105. 105. run C++ analysis code on it 
  106. 106. write verbose output somewhere
  107. 107. don’t take too long
  108. 108. don’t make our dev machines unusable</li></li></ul><li>Audio Analysis: dumbaudio<br />solution:<br /><ul><li>dumbo + bash
  109. 109. zip up binary and libs
  110. 110. extract them on each machine
  111. 111. run binary in map task with subprocess.Popen</li></li></ul><li>Audio Analysis: dumbaudio<br />class AnalysisMapper:<br /> init():<br /> extract(analyzer.tar.bz2,”bin”)<br />map(key,trackID):<br />file = fetch_audio_file(trackID)<br />p = subprocess.Popen(<br /> [“bin/analyzer”,file],<br /> stdout = subprocess.PIPE)<br /> (out,err) = p.communicate()<br /> yield trackID,out<br />
  112. 112. Last.fm Developer Credits<br /><ul><li>LDA: Olivier Gillet, Mark Levy
  113. 113. Label Propagation: Mark Levy
  114. 114. dumbaudio: Marcus Holland-Moritz
  115. 115. with help from James Grant, Klaas Bosteels</li></ul>Thanks for listening!<br />mark@last.fm @gamboviol<br />

×