Graphical Models for the Internet (Part1)

538 views
479 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
538
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Graphical Models for the Internet (Part1)

  1. 1. Graphical Models for the Internet Amr Ahmed & Alexander SmolaYahoo! Research & Australian National University, Santa Clara, CA amahmed@cs.cmu.edu, alex@smola.org
  2. 2. ThanksMohamed Joey Yucheng Qirong Shravan Aly Gonzalez Low Ho Narayanamurthy Vanja Choon Hui Eric James SergiyJosifovski Teo Xing Petterson Matyusevich Jake Shuang Hong Vishy Markus AlexandrosEisenstein Yang Vishwanathan Weimer Karatzoglou
  3. 3. 1. Data on the Internet
  4. 4. Size calibration• Tiny (2 cores) (512MB, 50MFlops, 1000 samples)• Small (4 cores) (4GB, 10GFlops, 100k samples)• Medium (16 cores) (32GB, 100GFlops, 1M samples)• Large (256 cores) (512GB, 1TFlops, 100M samples)• Massive ... need to work hard get it to work
  5. 5. Data• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News) >10B useful webpages• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
  6. 6. The Web for $100k/month•• Webpages (content, graph) Clicks (ad, page, social) • 10 billion pages• Users (OpenID, FB Connect) (this is a small subset, maybe 10%)• e-mails (Hotmail, Y!Mail, Gmail) 10k/page = 100TB• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery) ($10k for disks or EBS 1 month )•• Installed apps (Android market etc.) Location (Latitude, Loopt, Foursquared) • 1000 machines• User generated content (Wikipedia & co) 10ms/page = 1 day• Ads (display, text, DoubleClick, Yahoo) afford 1-10 MIP/page• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local) ($20k on EC2 for 0.68$/h) • 10 Gbit link• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon) ($10k/month via ISP or EC2)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing) • 1 day for raw data• Timestamp (everything) • 300ms/page roundtrip• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress) • 1000 servers for 1 month• Microblogs (Twitter, Jaiku, Meme) ($70k on EC2 for 0.085$/h)
  7. 7. Data - Identity & Graph• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) 100M-1B vertices
  8. 8. Data - User generated content• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) >1B images, 40h video/minute
  9. 9. Data - User generated content• Webpages (content, graph)• Clicks (ad, page, social) crawl it• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme) >1B images, 40h video/minute
  10. 10. Data - Messages• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)•• News articles (BBC, NYTimes, Y!News) Blog posts (Tumblr, Wordpress) >1B texts• Microblogs (Twitter, Jaiku, Meme)
  11. 11. Data - Messages• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)•• News articles (BBC, NYTimes, Y!News) Blog posts (Tumblr, Wordpress) >1B texts impossible without NDA• Microblogs (Twitter, Jaiku, Meme)
  12. 12. Data - User Tracking• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress) alex.smola.org• Microblogs (Twitter, Jaiku, Meme) >1B ‘identities’
  13. 13. Data - User Tracking• Webpages (content, graph)• Clicks (ad, page, social)• Users (OpenID, FB Connect)• e-mails (Hotmail, Y!Mail, Gmail)• Photos, Movies (Flickr, YouTube, Vimeo ...)• Cookies / tracking info (see Ghostery)• Installed apps (Android market etc.)• Location (Latitude, Loopt, Foursquared)• User generated content (Wikipedia & co)• Ads (display, text, DoubleClick, Yahoo)• Comments (Disqus, Facebook)• Reviews (Yelp, Y!Local)• Third party features (e.g. Experian)• Social connections (LinkedIn, Facebook)• Purchase decisions (Netflix, Amazon)• Instant Messages (YIM, Skype, Gtalk)• Search terms (Google, Bing)• Timestamp (everything)• News articles (BBC, NYTimes, Y!News)• Blog posts (Tumblr, Wordpress)• Microblogs (Twitter, Jaiku, Meme)
  14. 14. Personalization• 100-1000M users • Spam filtering • Personalized targeting & collaborative filtering • News recommendation • Advertising• Large parameter space (25 parameters = 100GB)• Distributed storage (need it on every server)• Distributed optimization• Model synchronization• Time dependence• Graph structure
  15. 15. no labels(implicit) Labels (typical case)• Ads • Graphs• Click feedback • Document collections• Emails • Email/IM/Discussions• Tags • Query stream• Editorial data is very expensive! Do not use!
  16. 16. Challenges• Scale • Billions of instances (documents, clicks, users, ads) • Rich data structure (ontology, categories, tags) • Model does not fit into single machine• Modeling • Plenty of unlabeled data, temporal structure, side information • User-understandable structure • Solve problem. Don’t simply apply clustering/LDA/PCA/ICA • We only cover building blocks• Inference • 10k-100k clusters/discrete objects, 1M-100M unique tokens • Communication
  17. 17. Roadmap• Tools • Load distribution, balancing and synchronization • Clustering, Topic Models• Models • Dynamic non-parametric models • Sequential latent variable models• Inference Algorithms • Distributed batch • Sequential Monte Carlo• Applications • User profiling • News content analysis & recommendation
  18. 18. 2. Basic Tools
  19. 19. SystemsAlgorithms run on MANY REAL and FAULTY boxes not Turing machines. So we need to deal with it.
  20. 20. Commodity Hardware• High Performance Computing Very reliable, custom built, expensive• Consumer hardware Cheap, efficient, easy to replicate, Not very reliable, deal with it!
  21. 21. Slide from talk of Jeff Deanhttp://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//people/jeff/stanford-295-talk.pdf
  22. 22. Map Reduce• 1000s of (faulty) machines• Lots of jobs are mostly embarrassingly parallel (except for a sorting/transpose phase)• Functional programming origins • Map(key,value) processes each (key,value) pair and outputs a new (key,value) pair • Reduce(key,value) reduces all instances with same key to aggregate• Example - extremely naive wordcount • Map(docID, document) for each document emit many (wordID, count) pairs • Reduce(wordID, count) sum over all counts for given wordID and emit (wordID, aggregate) from Ramakrishnan, Sakrejda, Canon, DoE 2011
  23. 23. Map Reduce• 1000s of (faulty) machines• Lots of jobs are mostly embarrassingly parallel (except for a sorting/transpose phase)• Functional programming origins • Map(key,value) processes each (key,value) pair and outputs a new (key,value) pair • Reduce(key,value) reduces all instances with same key to aggregate• Example - extremely naive wordcount • Map(docID, document) for each document emit many (wordID, count) pairs • Reduce(wordID, count) sum over all counts for given wordID and emit (wordID, aggregate)
  24. 24. Map Reduce easy fault tolerance (simply restart workers) disk based inter process communicationmoves computation to data map(key,value) reduce(key,value) Ghemawat & Dean, 2003
  25. 25. Map Reduce• Map must be stateless in blocks• Reduce must be commutative in data• Fault tolerance • Start jobs where the data is (nodes run the filesystem, too) • Restart machines if maps fail (have replicas) • Restart reducers based on intermediate data• Good fit for many algorithms• Good if only a small number of MapReduce iterations needed• Need to request machines at each iteration (time consuming)• State lost in between maps• Communication only via file I/O• Need to wait for last reducer
  26. 26. Map Reduce• Map must be stateless in blocks• Reduce must be commutative in data• Fault tolerance • Start jobs where the data is (nodes run the filesystem, too) • Restart machines if maps fail (have replicas) • Restart reducers based on intermediate data• Good fit for many algorithms• Good if only a small number of MapReduce iterations needed• Need to request machines at each iteration (time consuming)• State lost in between maps unsuitable for• Communication only via file I/O• Need to wait for last reducer algorithms with many iterations
  27. 27. Many alternatives• Dryad/LINQ Microsoft - directed acyclic graphs• S4 Yahoo - streaming directed acyclic graphs• Pregel Google - bulk synchronous processing• YARN Use Hadoop scheduler directly• Mesos, Hadoop workalikes & patches
  28. 28. Clustering
  29. 29. ClusteringDensity Estimation θ n Y p(x, ) = p( ) p(xi | ) i=1 xClustering K n Y Y θ p(x, y, ) = p(⇥) p( k ) p(yi |⇥)p(xi | , yi ) k=1 i=1 y x
  30. 30. ClusteringDensity Estimation θ n Y p(x, ) = p( ) p(xi | ) i=1 find θ xClustering K n Y Y θ p(x, y, ) = p(⇥) p( k ) p(yi |⇥)p(xi | , yi ) k=1 i=1 y x
  31. 31. ClusteringDensity Estimation log-concave θ n Y p(x, ) = p( ) p(xi | ) i=1 find θ xClustering K n Y Y θ p(x, y, ) = p(⇥) p( k ) p(yi |⇥)p(xi | , yi ) k=1 i=1 general nonlinear y x
  32. 32. Clustering• Optimization problem X maximize p(x, y, ) y K X n X X maximize log p(⇥) + log p( k ) + log [p(yi |⇥)p(xi | , yi )] k=1 i=1 yi Y• Options • Direct nonconvex optimization (e.g. BFGS) • Sampling (draw from the joint distribution) for memory efficiency • Variational approximation (concave lower bounds aka EM algorithm)
  33. 33. Clustering• Integrate out y θ • Integrate out θ θ Y y x x x• Nonconvex • Y is coupled optimization • Sampling problem • Collapsed p• EM algorithm p(y|x) p({x} | {xi : yi = y} ⇥ Xfake )p(y|Y ⇥ Yfake )
  34. 34. Clustering• Integrate out y θ • Integrate out θ θ Y y x x x• Nonconvex • Y is coupled optimization • Sampling problem • Collapsed p• EM algorithm p(y|x) p({x} | {xi : yi = y} ⇥ Xfake )p(y|Y ⇥ Yfake )
  35. 35. Clustering• Integrate out y θ • Integrate out θ θ Y y x x x• Nonconvex • Y is coupled optimization • Sampling problem • Collapsed p• EM algorithm p(y|x) p({x} | {xi : yi = y} ⇥ Xfake )p(y|Y ⇥ Yfake )
  36. 36. Gibbs sampling• Sampling: Draw an instance x from distribution p(x)• Gibbs sampling: • In most cases direct sampling not possible • Draw one set of variables at a time 0.45 0.05 0.05 0.45
  37. 37. Gibbs sampling• Sampling: Draw an instance x from distribution p(x)• Gibbs sampling: • In most cases direct sampling not possible • Draw one set of variables at a time (b,g) - draw p(.,g) 0.45 0.05 0.05 0.45
  38. 38. Gibbs sampling• Sampling: Draw an instance x from distribution p(x)• Gibbs sampling: • In most cases direct sampling not possible • Draw one set of variables at a time (b,g) - draw p(.,g) (g,g) - draw p(g,.) 0.45 0.05 0.05 0.45
  39. 39. Gibbs sampling• Sampling: Draw an instance x from distribution p(x)• Gibbs sampling: • In most cases direct sampling not possible • Draw one set of variables at a time (b,g) - draw p(.,g) (g,g) - draw p(g,.) 0.45 0.05 (g,g) - draw p(.,g) 0.05 0.45
  40. 40. Gibbs sampling• Sampling: Draw an instance x from distribution p(x)• Gibbs sampling: • In most cases direct sampling not possible • Draw one set of variables at a time (b,g) - draw p(.,g) (g,g) - draw p(g,.) 0.45 0.05 (g,g) - draw p(.,g) (b,g) - draw p(b,.) 0.05 0.45
  41. 41. Gibbs sampling• Sampling: Draw an instance x from distribution p(x)• Gibbs sampling: • In most cases direct sampling not possible • Draw one set of variables at a time (b,g) - draw p(.,g) (g,g) - draw p(g,.) 0.45 0.05 (g,g) - draw p(.,g) (b,g) - draw p(b,.) 0.05 0.45 (b,b) ...
  42. 42. Gibbs sampling for clustering
  43. 43. Gibbs sampling for clustering randominitialization
  44. 44. Gibbs sampling for clustering samplecluster labels
  45. 45. Gibbs sampling for clustering resamplecluster model
  46. 46. Gibbs sampling for clustering resamplecluster labels
  47. 47. Gibbs sampling for clustering resamplecluster model
  48. 48. Gibbs sampling for clustering resamplecluster labels
  49. 49. Gibbs sampling for clustering resamplecluster model e.g. Mahout Dirichlet Process Clustering
  50. 50. Topic models
  51. 51. Grouping objects
  52. 52. Grouping objects Singapore
  53. 53. Grouping objects
  54. 54. Grouping objects
  55. 55. Grouping objects airline university restaurant
  56. 56. Grouping objects AustraliaUSA Singapore
  57. 57. Topic Models Australia Singapore university USA airlineairline Singapore university USA Singapore food food
  58. 58. Clustering & Topic Models Clustering ? group objects by prototypes
  59. 59. Clustering & Topic Models Clustering Topics ? group objects decompose objects by prototypes into prototypes
  60. 60. Clustering & Topic Models clustering Latent Dirichlet Allocation α prior α prior cluster topic θ probability θ probability cluster y label y topic label instance instance x x
  61. 61. Clustering & Topic Models Cluster/ topic x membership = Documents distributions clustering: (0, 1) matrix topic model: stochastic matrix LSI: arbitrary matrices
  62. 62. Topics in textLatent Dirichlet Allocation; Blei, Ng, Jordan, JMLR 2003
  63. 63. Joint Probability Distribution p(⇤, z, ⌅, x| , ⇥) K Y m Y α = p(⌅k |⇥) p(⇤i | ) k=1 i=1 topic m,mi Y θi probability p(zij |⇤i )p(xij |zij , ⌅) i,j zij topic label instancelanguage prior β ψk xij
  64. 64. Joint Probability Distribution sample Ψindependently sample θ p(⇤, z, ⌅, x| , ⇥) independently K Y m Y α = p(⌅k |⇥) p(⇤i | ) k=1 i=1 topic m,mi Y θi probability p(zij |⇤i )p(xij |zij , ⌅) i,j sample z zij topic label independently instance language prior β ψk xij
  65. 65. Joint Probability Distribution sample Ψindependently sample θ slo p(⇤, z, ⌅, x| , ⇥) independently w K Y m Y α = p(⌅k |⇥) p(⇤i | ) k=1 i=1 topic m,mi Y θi probability p(zij |⇤i )p(xij |zij , ⌅) i,j sample z zij topic label independently instance language prior β ψk xij
  66. 66. Collapsed Sampler p(z, x| , ⇥) m Y k Y= p(zi | ) p({xij |zij = k} |⇥) α i=1 k=1 topic θi probability zij topic label instancelanguage prior β ψk xij
  67. 67. Collapsed Sampler p(z, x| , ⇥) m Y k Y= p(zi | ) p({xij |zij = k} |⇥) α i=1 k=1 topic sample z θi probability sequentially zij topic label instancelanguage prior β ψk xij
  68. 68. Collapsed Sampler p(z, x| , ⇥) fa m Y k Y st= p(zi | ) p({xij |zij = k} |⇥) α i=1 k=1 topic sample z θi probability sequentially zij topic label instancelanguage prior β ψk xij
  69. 69. Collapsed Sampler Griffiths & Steyvers, 2005 p(z, x| , ⇥) m Y k Y= p(zi | ) p({xij |zij = k} |⇥) α i=1 k=1 topic ij θi probabilityn (t, d) + t n ij (t, w) + t P Pn i (d) + t t n i (t) + t t zij topic label instancelanguage prior β ψk xij
  70. 70. Collapsed Sampler Griffiths & Steyvers, 2005 p(z, x| , ⇥) fa m Y k Y st= p(zi | ) p({xij |zij = k} |⇥) α i=1 k=1 topic ij θi probabilityn (t, d) + t n ij (t, w) + t P Pn i (d) + t t n i (t) + t t zij topic label instancelanguage prior β ψk xij
  71. 71. Sequential Algorithm (Gibbs sampler)• For 1000 iterations do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Update CPU local (word, topic) table • Update global (word, topic) table
  72. 72. Sequential Algorithm (Gibbs sampler)• For 1000 iterations do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Update CPU local (word, topic) table • Update global (word, topic) table this kills parallelism
  73. 73. 3. Design Principles
  74. 74. Scaling Problems
  75. 75. 3 Problems mean data cluster ID variancecluster weight
  76. 76. 3 Problems data local stateglobal state
  77. 77. 3 Problems huge only local too big forsingle machine
  78. 78. 3 Problems Vanilla LDAglobal state local state User profiling dataglobal state
  79. 79. 3 Problems Vanilla LDAglobal state local state User profiling dataglobal state
  80. 80. 3 Problems local state does not fitis too large into memory network load & barriersglobal stateis too large does not fit into memory
  81. 81. 3 Problems local state does not fit stream localis too large data from disk into memory network load & barriersglobal stateis too large does not fit into memory
  82. 82. 3 Problems local state does not fit stream localis too large data from disk into memory network load asynchronous synchronization & barriersglobal stateis too large does not fit into memory
  83. 83. 3 Problems local state does not fit stream localis too large data from disk into memory network load asynchronous synchronization & barriersglobal stateis too large does not fit partial view into memory
  84. 84. Global state synchronization
  85. 85. Lock and Barrier• Changes in μ affect distribution in z• Changes in z affect distribution in μ (in particular for a collapsed sampler)• Lock z, then recompute μ• Lock all but single zi (for collapsed sampler)
  86. 86. Variable replication background sync 1 copy per machine• No locks between machines to access z• Synchronization mechanism for global μ needed• In LDA this is the local copy of the (topic,word) counts
  87. 87. Distributionglobal replica cluster rack
  88. 88. Distributionglobal replica cluster rack
  89. 89. Message Passing • Start with common state • Child stores old and new state • Parent keeps global state • Transmit differences asynchronously • Inverse element for difference • Abelian group for commutativity (sum, log-sum, cyclic group, exponential families) local to global global to local x xold global old x old x x x + (x x ) old globalx global x global + x x
  90. 90. Distribution• Dedicated server for variables • Insufficient bandwidth (hotspots) • Insufficient memory• Select server via consistent hashing (random trees a la Karger et al. 1999 if needed) m(x) = argmin h(x, m) m2M
  91. 91. Distribution & fault tolerance• Storage is O(1/k) per machine• Communication is O(1) per machine• Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)• O(k) open connections per machine• O(1/k) throughput per machine m(x) = argmin h(x, m) m2M
  92. 92. Distribution & fault tolerance• Storage is O(1/k) per machine• Communication is O(1) per machine• Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)• O(k) open connections per machine• O(1/k) throughput per machine m(x) = argmin h(x, m) m2M
  93. 93. Distribution & fault tolerance• Storage is O(1/k) per machine• Communication is O(1) per machine• Fast snapshots O(1/k) per machine (stop sync and dump state per vertex)• O(k) open connections per machine• O(1/k) throughput per machine m(x) = argmin h(x, m) m2M
  94. 94. Synchronization Protocol
  95. 95. Synchronization • Data rate between machines is O(1/k) • Machines operate asynchronously (barrier free) • Solution • Schedule message pairs • Communicate with r random machines simultaneouslylocal r=1global
  96. 96. Synchronization • Data rate between machines is O(1/k) • Machines operate asynchronously (barrier free) • Solution • Schedule message pairs • Communicate with r random machines simultaneouslylocal r=1global
  97. 97. Synchronization • Data rate between machines is O(1/k) • Machines operate asynchronously (barrier free) • Solution • Schedule message pairs • Communicate with r random machines simultaneouslylocal r=1global
  98. 98. Synchronization• Data rate between machines is O(1/k)• Machines operate asynchronously (barrier free)• Solution • Schedule message pairs • Communicate with r random machines simultaneously • Use Luby-Rackoff PRPG for load balancing• Efficiency guarantee 4 simultaneous connections are sufficient
  99. 99. Architecture
  100. 100. Sequential Algorithm (Gibbs sampler)• For 1000 iterations do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Update CPU local (word, topic) table • Update global (word, topic) table
  101. 101. Sequential Algorithm (Gibbs sampler)• For 1000 iterations do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Update CPU local (word, topic) table • Update global (word, topic) table this kills parallelism
  102. 102. Distributed asynchronous sampler• For 1000 iterations do (independently per computer) • For each thread/core do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Generate computer local (word, topic) message • In parallel update local (word, topic) table • In parallel update global (word, topic) table
  103. 103. Distributed asynchronous sampler• For 1000 iterations do (independently per computer) • For each thread/core do • For each document do • For each word in the document do • Resample topic for the word • Update local (document, topic) table • Generate computer local (word, topic) message • In parallel update local (word, topic) table • In parallel update global (word, topic) table concurrent minimal continuous barrier cpu hdd net view sync free
  104. 104. Multicore Architecture Intel Threading Building Blockstokens sampler sampler diagnostics file count output to sampler & topics combiner sampler updater file optimization samplertopics joint state table• Decouple multithreaded sampling and updating (almost) avoids stalling for locks in the sampler• Joint state table • much less memory required • samplers syncronized (10 docs vs. millions delay)• Hyperparameter update via stochastic gradient descent• No need to keep documents in memory (streaming)
  105. 105. Cluster Architecture sampler sampler sampler sampler ice• Distributed (key,value) storage via ICE• Background asynchronous synchronization • single word at a time to avoid deadlocks • no need to have joint dictionary • uses disk, network, cpu simultaneously
  106. 106. Cluster Architecture sampler sampler sampler sampler ice ice ice ice• Distributed (key,value) storage via ICE• Background asynchronous synchronization • single word at a time to avoid deadlocks • no need to have joint dictionary • uses disk, network, cpu simultaneously
  107. 107. Making it work• Startup • Naive: randomly initialize topics on each node (read from disk if already assigned - hotstart) • Forward sampling for startup much faster • Aggregate changes on the fly• Failover • State constantly being written to disk (worst case we lose 1 iteration out of 1000) • Restart via standard startup routine• Achilles heel: need to restart from checkpoint if even a single machine dies.
  108. 108. Easily extensible• Better language model (topical n-grams) can process millions of users (vs 1000s)• Conditioning on side information (upstream) estimate topic based on authorship, source, joint user model ...• Conditioning on dictionaries (downstream) integrate topics between different languages• Time dependent sampler for user model approximate inference per episode
  109. 109. Google Mallet Irvine’08 Irvine’09 Yahoo LDA LDAMulticore no yes yes yes yes Cluster MPI no MPI point 2 point ICE dictionary separate jointState table separate separate split sparse sparse asynchronous synchronous synchronous synchronous asynchronousSchedule approximate exact exact exact exact messages
  110. 110. Speed (2010 numbers)• 1M documents per day on 1 computer (1000 topics per doc, 1000 words per doc)• 350k documents per day per node (context switches & memcached & stray reducers)• 8 Million docs (Pubmed) (sampler does not burn in well - too short doc) • Irvine: 128 machines, 10 hours • Yahoo: 1 machine, 11 days • Yahoo: 20 machines, 9 hours• 20 Million docs (Yahoo! News Articles) • Yahoo: 100 machines, 12 hours
  111. 111. Fast sampler• 8 Million documents, 1000 topics, {100,200,400} machines, LDA• Red (symmetric latency bound message passing)• Blue (asynchronous bandwidth bound message passing & message scheduling) • 10x faster synchronization time • 10x faster snapshots • Scheduling improves 10% already on 150 machines
  112. 112. Roadmap• Tools • Load distribution, balancing and synchronization • Clustering, Topic Models• Models • Dynamic non-parametric models • Sequential latent variable models• Inference Algorithms • Distributed batch • Sequential Monte Carlo• Applications • User profiling • News content analysis & recommendation

×