GoTo Amsterdam 2013 Skinned
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
404
On Slideshare
404
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1©MapR Technologies - Confidential Old and New Building Blocks Come Together For Big Data
  • 2. 2©MapR Technologies - Confidential  Contact: tdunning@maprtech.com @ted_dunning  Slides and such http://slideshare.net/tdunning  Hash tags: #mapr #gotoams #d3 #node
  • 3. 3©MapR Technologies - Confidential Embarrassment of Riches  d3.js allows really pretty pictures  node.js allows simple (not just web) servers  Storm does real-time  Hadoop does big data  d3 allows very cool visualizations
  • 4. 4©MapR Technologies - Confidential D3 demo
  • 5. 5©MapR Technologies - Confidential node demo
  • 6. 6©MapR Technologies - Confidential Hadoop demo
  • 7. 7©MapR Technologies - Confidential But …  Web camp – everything is a service with a URL or a DOM  Big data camp – non-traditional file systems  Everybody else – files and databases  They don’t like to talk to each other
  • 8. 8©MapR Technologies - Confidential Why Not Tiered Architectures?  Tiered architectures – translations between services and cultures – standard corporate answer  Feels like molasses
  • 9. 9©MapR Technologies - Confidential The Vision  Integrate – multiple computing paradigms – many computing communities  How? – common storage, queuing and data platforms
  • 10. 10©MapR Technologies - Confidential For Example, …  Incoming documents with text – store in file-based queues – index in real-time using Storm and Solr – add initial engagement class, “don’t-know”  Search for documents using original text – add random noise, small for well understood docs, large for “don’t-know” docs  Record engagement
  • 11. 11©MapR Technologies - Confidential Add Analysis  Process engagement logs – item-item cooccurrence – user-item histories  Update search index – indicator items – decrease uncertainty on well understood docs  Update user profile – item history
  • 12. 12©MapR Technologies - Confidential Search Again  Now searches use recent views + text – recent views query indicator fields – text queries normal text data – add noise as appropriate
  • 13. 13©MapR Technologies - Confidential And Draw a Picture  Searches and clicks can be logged – real-time metrics – real-time trending topics  What’s hot, what’s not  Popular searches  Document clusters  Word clouds
  • 14. 14©MapR Technologies - Confidential In Pictures
  • 15. 15©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources
  • 16. 16©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine
  • 17. 17©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis
  • 18. 18©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis Usage analysis RenderingAdmin queries
  • 19. 19©MapR Technologies - Confidential Which Technology? Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis Usage analysis Admin queries Rendering Storm/node Solr MapR D3/node Other
  • 20. 20©MapR Technologies - Confidential Yeah, But …  This isn’t as easy as it looks  Take the real-time / long-time part
  • 21. 21©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  • 22. 22©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  • 23. 23©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Complete history
  • 24. 24©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history
  • 25. 25©MapR Technologies - Confidential Users Catcher Storm Topic Queue Web-server http Web Data MapR
  • 26. 26©MapR Technologies - Confidential Closer Look – Catcher Protocol Data Sources Catcher Cluster Catcher Cluster Data Sources The data sources and catchers communicate with a very simple protocol. Hello() => list of catchers Log(topic,message) => (OK|FAIL, redirect-to-catcher)
  • 27. 27©MapR Technologies - Confidential Closer Look – Catcher Queues Catcher Cluster Catcher Cluster The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop. Each topic file is appended by exactly one catcher. Topic files are kept in shared file storage. Topic File Topic File
  • 28. 28©MapR Technologies - Confidential Closer Look – ProtoSpout The ProtoSpout tails the topic files, parses log records into tuples and injects them into the Storm topology. Last fully acked position stored in shared file system. Topic File Topic File ProtoSpout
  • 29. 29©MapR Technologies - Confidential Yeah, But …  What was that about adding noise in scoring?  Why would I do that??  Is there a simple answer?
  • 30. 30©MapR Technologies - Confidential Thompson Sampling  Select each shell according to the probability that it is the best  Probability that it is the best can be computed using posterior  But I promised a simple answer P(i is best) = I E[ri |q]= max j E[rj |q] é ëê ù ûúò P(q | D) dq
  • 31. 31©MapR Technologies - Confidential Thompson Sampling – Take 2  Sample θ  Pick i to maximize reward  Record result from using i q ~P(q | D) i = argmax j E[r |q]
  • 32. 32©MapR Technologies - Confidential Nearly Forgotten until Recently  Citations for Thompson sampling
  • 33. 33©MapR Technologies - Confidential Bayesian Bandit for the Search  Compute distributions based on data so far  Sample scores s1, s2 … – based on actual score – plus per doc noise from these distributions  Rank docs by si  Lemma 1: The probability of showing doc i at first position will match the probability it is the best  Lemma 2: This is as good as it gets
  • 34. 34©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  • 35. 35©MapR Technologies - Confidential Yeah, But …  Isn’t recommendations complicated?  How can I implement this?
  • 36. 36©MapR Technologies - Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  • 37. 37©MapR Technologies - Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
  • 38. 38©MapR Technologies - Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  • 39. 39©MapR Technologies - Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  • 40. 40©MapR Technologies - Confidential Problems with Raw Cooccurrence  Very popular items co-occur with everything – Welcome document – Elevator music  That isn’t interesting – We want anomalous cooccurrence
  • 41. 41©MapR Technologies - Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2 t3 not t3 t1 2 1 not t1 1 1
  • 42. 42©MapR Technologies - Confidential Spot the Anomaly  Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.44 0.98 2.26 7.15
  • 43. 43©MapR Technologies - Confidential Root LLR Details  In R entropy = function(k) { -sum(k*log((k==0)+(k/sum(k)))) } rootLLr = function(k) { sign = … sign * sqrt( (entropy(rowSums(k))+entropy(colSums(k)) - entropy(k))/2) }  Like sqrt(mutual information * N/2) See http://bit.ly/16DvLVK
  • 44. 44©MapR Technologies - Confidential Threshold by Score  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  • 45. 45©MapR Technologies - Confidential Threshold by Score  Significant cooccurrence => Indicators t1 t2 t3 t4 t1 1 0 0 1 t2 0 1 0 1 t3 0 0 1 1 t4 1 0 0 1
  • 46. 46©MapR Technologies - Confidential Yeah, But …  Why go to all this trouble?  Does it really help?
  • 47. 47©MapR Technologies - Confidential Real-life example
  • 48. 48©MapR Technologies - Confidential The Real Life Issues  Exploration  Diversity  Speed  Not the last percent
  • 49. 49©MapR Technologies - Confidential The Second Page
  • 50. 50©MapR Technologies - Confidential Make it Worse to Make It Better  Add noise to rank 1 2 8 7 6 3 5 4 10 13 21 18 12 9 14 24 34 28 32 17 11 27 40 30 41 49 16 15 35 23 19 22 26 31 20 43 25 29 33 62 38 60 74 53 36 37 39 70 45 44 46 71 42 69 47 63 52 57 51 48  Results are worse today  But better tomorrow
  • 51. 51©MapR Technologies - Confidential Anti-Flood  200 of the same result is no better than 2  The recommender list is a portfolio of results – If probability of success is highly correlated, then probability of at least one success is much lower  Suppressing items similar to higher ranking items helps
  • 52. 52©MapR Technologies - Confidential The Punchline  Hybrid systems really can work today  Middle tiers aren’t as interesting as they used to be – No need for Flume … queue directly in big data system – No need for external queues, tail the data directly with Storm – No need for query systems for presentation data … read it directly with node  Absolutely require common frameworks and standard interfaces  You can do this today!
  • 53. 53©MapR Technologies - Confidential  Contact: tdunning@maprtech.com @ted_dunning  Slides and such http://slideshare.net/tdunning  Hash tags: #mapr #gotoams #d3 #node