GoTo Amsterdam 2013 Skinned

230
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
230
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

GoTo Amsterdam 2013 Skinned

  1. 1. 1©MapR Technologies - Confidential Old and New Building Blocks Come Together For Big Data
  2. 2. 2©MapR Technologies - Confidential  Contact: tdunning@maprtech.com @ted_dunning  Slides and such http://slideshare.net/tdunning  Hash tags: #mapr #gotoams #d3 #node
  3. 3. 3©MapR Technologies - Confidential Embarrassment of Riches  d3.js allows really pretty pictures  node.js allows simple (not just web) servers  Storm does real-time  Hadoop does big data  d3 allows very cool visualizations
  4. 4. 4©MapR Technologies - Confidential D3 demo
  5. 5. 5©MapR Technologies - Confidential node demo
  6. 6. 6©MapR Technologies - Confidential Hadoop demo
  7. 7. 7©MapR Technologies - Confidential But …  Web camp – everything is a service with a URL or a DOM  Big data camp – non-traditional file systems  Everybody else – files and databases  They don’t like to talk to each other
  8. 8. 8©MapR Technologies - Confidential Why Not Tiered Architectures?  Tiered architectures – translations between services and cultures – standard corporate answer  Feels like molasses
  9. 9. 9©MapR Technologies - Confidential The Vision  Integrate – multiple computing paradigms – many computing communities  How? – common storage, queuing and data platforms
  10. 10. 10©MapR Technologies - Confidential For Example, …  Incoming documents with text – store in file-based queues – index in real-time using Storm and Solr – add initial engagement class, “don’t-know”  Search for documents using original text – add random noise, small for well understood docs, large for “don’t-know” docs  Record engagement
  11. 11. 11©MapR Technologies - Confidential Add Analysis  Process engagement logs – item-item cooccurrence – user-item histories  Update search index – indicator items – decrease uncertainty on well understood docs  Update user profile – item history
  12. 12. 12©MapR Technologies - Confidential Search Again  Now searches use recent views + text – recent views query indicator fields – text queries normal text data – add noise as appropriate
  13. 13. 13©MapR Technologies - Confidential And Draw a Picture  Searches and clicks can be logged – real-time metrics – real-time trending topics  What’s hot, what’s not  Popular searches  Document clusters  Word clouds
  14. 14. 14©MapR Technologies - Confidential In Pictures
  15. 15. 15©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources
  16. 16. 16©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine
  17. 17. 17©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis
  18. 18. 18©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis Usage analysis RenderingAdmin queries
  19. 19. 19©MapR Technologies - Confidential Which Technology? Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis Usage analysis Admin queries Rendering Storm/node Solr MapR D3/node Other
  20. 20. 20©MapR Technologies - Confidential Yeah, But …  This isn’t as easy as it looks  Take the real-time / long-time part
  21. 21. 21©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  22. 22. 22©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  23. 23. 23©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Complete history
  24. 24. 24©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history
  25. 25. 25©MapR Technologies - Confidential Users Catcher Storm Topic Queue Web-server http Web Data MapR
  26. 26. 26©MapR Technologies - Confidential Closer Look – Catcher Protocol Data Sources Catcher Cluster Catcher Cluster Data Sources The data sources and catchers communicate with a very simple protocol. Hello() => list of catchers Log(topic,message) => (OK|FAIL, redirect-to-catcher)
  27. 27. 27©MapR Technologies - Confidential Closer Look – Catcher Queues Catcher Cluster Catcher Cluster The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop. Each topic file is appended by exactly one catcher. Topic files are kept in shared file storage. Topic File Topic File
  28. 28. 28©MapR Technologies - Confidential Closer Look – ProtoSpout The ProtoSpout tails the topic files, parses log records into tuples and injects them into the Storm topology. Last fully acked position stored in shared file system. Topic File Topic File ProtoSpout
  29. 29. 29©MapR Technologies - Confidential Yeah, But …  What was that about adding noise in scoring?  Why would I do that??  Is there a simple answer?
  30. 30. 30©MapR Technologies - Confidential Thompson Sampling  Select each shell according to the probability that it is the best  Probability that it is the best can be computed using posterior  But I promised a simple answer P(i is best) = I E[ri |q]= max j E[rj |q] é ëê ù ûúò P(q | D) dq
  31. 31. 31©MapR Technologies - Confidential Thompson Sampling – Take 2  Sample θ  Pick i to maximize reward  Record result from using i q ~P(q | D) i = argmax j E[r |q]
  32. 32. 32©MapR Technologies - Confidential Nearly Forgotten until Recently  Citations for Thompson sampling
  33. 33. 33©MapR Technologies - Confidential Bayesian Bandit for the Search  Compute distributions based on data so far  Sample scores s1, s2 … – based on actual score – plus per doc noise from these distributions  Rank docs by si  Lemma 1: The probability of showing doc i at first position will match the probability it is the best  Lemma 2: This is as good as it gets
  34. 34. 34©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  35. 35. 35©MapR Technologies - Confidential Yeah, But …  Isn’t recommendations complicated?  How can I implement this?
  36. 36. 36©MapR Technologies - Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  37. 37. 37©MapR Technologies - Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
  38. 38. 38©MapR Technologies - Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  39. 39. 39©MapR Technologies - Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  40. 40. 40©MapR Technologies - Confidential Problems with Raw Cooccurrence  Very popular items co-occur with everything – Welcome document – Elevator music  That isn’t interesting – We want anomalous cooccurrence
  41. 41. 41©MapR Technologies - Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2 t3 not t3 t1 2 1 not t1 1 1
  42. 42. 42©MapR Technologies - Confidential Spot the Anomaly  Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.44 0.98 2.26 7.15
  43. 43. 43©MapR Technologies - Confidential Root LLR Details  In R entropy = function(k) { -sum(k*log((k==0)+(k/sum(k)))) } rootLLr = function(k) { sign = … sign * sqrt( (entropy(rowSums(k))+entropy(colSums(k)) - entropy(k))/2) }  Like sqrt(mutual information * N/2) See http://bit.ly/16DvLVK
  44. 44. 44©MapR Technologies - Confidential Threshold by Score  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  45. 45. 45©MapR Technologies - Confidential Threshold by Score  Significant cooccurrence => Indicators t1 t2 t3 t4 t1 1 0 0 1 t2 0 1 0 1 t3 0 0 1 1 t4 1 0 0 1
  46. 46. 46©MapR Technologies - Confidential Yeah, But …  Why go to all this trouble?  Does it really help?
  47. 47. 47©MapR Technologies - Confidential Real-life example
  48. 48. 48©MapR Technologies - Confidential The Real Life Issues  Exploration  Diversity  Speed  Not the last percent
  49. 49. 49©MapR Technologies - Confidential The Second Page
  50. 50. 50©MapR Technologies - Confidential Make it Worse to Make It Better  Add noise to rank 1 2 8 7 6 3 5 4 10 13 21 18 12 9 14 24 34 28 32 17 11 27 40 30 41 49 16 15 35 23 19 22 26 31 20 43 25 29 33 62 38 60 74 53 36 37 39 70 45 44 46 71 42 69 47 63 52 57 51 48  Results are worse today  But better tomorrow
  51. 51. 51©MapR Technologies - Confidential Anti-Flood  200 of the same result is no better than 2  The recommender list is a portfolio of results – If probability of success is highly correlated, then probability of at least one success is much lower  Suppressing items similar to higher ranking items helps
  52. 52. 52©MapR Technologies - Confidential The Punchline  Hybrid systems really can work today  Middle tiers aren’t as interesting as they used to be – No need for Flume … queue directly in big data system – No need for external queues, tail the data directly with Storm – No need for query systems for presentation data … read it directly with node  Absolutely require common frameworks and standard interfaces  You can do this today!
  53. 53. 53©MapR Technologies - Confidential  Contact: tdunning@maprtech.com @ted_dunning  Slides and such http://slideshare.net/tdunning  Hash tags: #mapr #gotoams #d3 #node
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×