Goto amsterdam-2013-skinned

829 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
829
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
33
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Goto amsterdam-2013-skinned

  1. 1. 1©MapR Technologies - ConfidentialOld and New Building BlocksCome Together For Big Data
  2. 2. 2©MapR Technologies - Confidential Contact:tdunning@maprtech.com@ted_dunning Slides and suchhttp://slideshare.net/tdunning Hash tags: #mapr #goto #d3 #node
  3. 3. 3©MapR Technologies - ConfidentialEmbarrassment of Riches d3.js allows really pretty pictures node.js allows simple (not just web) servers Storm does real-time Hadoop does big data d3 allows very cool visualizations
  4. 4. 4©MapR Technologies - ConfidentialD3 demo
  5. 5. 5©MapR Technologies - Confidentialnode demo
  6. 6. 6©MapR Technologies - ConfidentialHadoopdemo
  7. 7. 7©MapR Technologies - ConfidentialBut … Web camp– everything is a service with a URL or a DOM Big data camp– non-traditional file systems Everybody else– files and databases They don’t like to talk to each other
  8. 8. 8©MapR Technologies - ConfidentialWhy Not Tiered Architectures? Tiered architectures– translations between services and cultures– standard corporate answer Feels like molasses
  9. 9. 9©MapR Technologies - ConfidentialThe Vision Integrate– multiple computing paradigms– many computing communities How?– common storage, queuing and data platforms
  10. 10. 10©MapR Technologies - ConfidentialFor Example, … Incoming documents with text– store in file-based queues– index in real-time using Storm and Solr– add initial engagement class, “don’t-know” Search for documents using original text– add random noise, small for well understood docs, large for “don’t-know”docs Record engagement
  11. 11. 11©MapR Technologies - ConfidentialAdd Analysis Process engagement logs– item-item cooccurrence– user-item histories Update search index– indicator items– decrease uncertainty on well understood docs Update user profile– item history
  12. 12. 12©MapR Technologies - ConfidentialSearch Again Now searches use recent views + text– recent views query indicator fields– text queries normal text data– add noise as appropriate
  13. 13. 13©MapR Technologies - ConfidentialAnd Draw a Picture Searches and clicks can be logged– real-time metrics– real-time trending topics What’s hot, what’s not Popular searches Document clusters Word clouds
  14. 14. 14©MapR Technologies - ConfidentialIn Pictures
  15. 15. 15©MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsources
  16. 16. 16©MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine
  17. 17. 17©MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine LogsRecommendationanalysis
  18. 18. 18©MapR Technologies - ConfidentialIn PicturesDocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine LogsRecommendationanalysisUsageanalysisRenderingAdminqueries
  19. 19. 19©MapR Technologies - ConfidentialWhich Technology?DocqueueSearchindexReal-timeindexingDocsourcesUserqueriesSearchengine LogsRecommendationanalysisUsageanalysisAdminqueriesRenderingStorm/nodeSolrMapRD3/nodeOther
  20. 20. 20©MapR Technologies - ConfidentialYeah, But … This isn’t as easy as it looks Take the real-time / long-time part
  21. 21. 21©MapR Technologies - ConfidentialtnowHadoop is Not Very Real-timeUnprocessedDataFullyprocessedLatest fullperiodHadoop jobtakes thislong for thisdata
  22. 22. 22©MapR Technologies - ConfidentialtnowHadoop worksgreat back hereStormworkshereReal-time and Long-time togetherBlendedviewBlendedviewBlendedView
  23. 23. 23©MapR Technologies - ConfidentialSolRIndexerSolRIndexerSolrindexingCooccurrence(Mahout)Item meta-dataIndexshardsCompletehistory
  24. 24. 24©MapR Technologies - ConfidentialSolRIndexerSolRIndexerSolrsearchWeb tierItem meta-dataIndexshardsUserhistory
  25. 25. 25©MapR Technologies - ConfidentialUsersCatcher StormTopicQueueWeb-serverhttpWebDataMapR
  26. 26. 26©MapR Technologies - ConfidentialCloser Look – Catcher ProtocolDataSourcesCatcherClusterCatcherClusterDataSourcesThe data sources and catcherscommunicate with a very simpleprotocol.Hello() => list of catchersLog(topic,message) =>(OK|FAIL, redirect-to-catcher)
  27. 27. 27©MapR Technologies - ConfidentialCloser Look – Catcher QueuesCatcherClusterCatcherClusterThe catchers forward log requeststo the correct catcher and returnthat host in the reply to allow theclient to avoid the extra hop.Each topic file is appended byexactly one catcher.Topic files are kept in shared filestorage.TopicFileTopicFile
  28. 28. 28©MapR Technologies - ConfidentialCloser Look – ProtoSpoutThe ProtoSpout tails the topic files,parses log records into tuples andinjects them into the Stormtopology.Last fully acked position stored inshared file system.TopicFileTopicFileProtoSpout
  29. 29. 29©MapR Technologies - ConfidentialYeah, But … What was that about adding noise in scoring? Why would I do that?? Is there a simple answer?
  30. 30. 30©MapR Technologies - ConfidentialThompson Sampling Select each shell according to the probability that it is the best Probability that it is the best can be computed using posterior But I promised a simple answerP(i is best) = I E[ri |q]= maxjE[rj |q]éëêùûúò P(q | D) dq
  31. 31. 31©MapR Technologies - ConfidentialThompson Sampling – Take 2 Sample θ Pick i to maximize reward Record result from using iq ~P(q | D)i = argmaxjE[r |q]
  32. 32. 32©MapR Technologies - ConfidentialNearly Forgotten until Recently Citations for Thompson sampling
  33. 33. 33©MapR Technologies - ConfidentialBayesian Bandit for the Search Compute distributions based on data so far Sample scores s1, s2 …– based on actual score– plus per doc noise from these distributions Rank docs by si Lemma 1: The probability of showing doc i at first position willmatch the probability it is the best Lemma 2: This is as good as it gets
  34. 34. 34©MapR Technologies - ConfidentialAnd it works!11000 100 200 300 400 500 600 700 800 900 10000.1200.010.020.030.040.050.060.070.080.090.10.11nregretε- greedy, ε = 0.05Bayesian Bandit with Gamma- Normal
  35. 35. 35©MapR Technologies - ConfidentialYeah, But … Isn’t recommendations complicated? How can I implement this?
  36. 36. 36©MapR Technologies - ConfidentialRecommendation Basics History:User Thing1 32 43 42 33 21 12 1
  37. 37. 37©MapR Technologies - ConfidentialRecommendation Basics History as matrix: (t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) oncet1 t2 t3 t4u1 1 0 1 0u2 1 0 1 1u3 0 1 0 1
  38. 38. 38©MapR Technologies - ConfidentialA Quick Simplification Users who do h Also do rAhATAh( )ATA( )hUser-centric recommendationsItem-centric recommendations
  39. 39. 39©MapR Technologies - ConfidentialRecommendation Basics Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2
  40. 40. 40©MapR Technologies - ConfidentialProblems with Raw Cooccurrence Very popular items co-occur with everything– Welcome document– Elevator music That isn’t interesting– We want anomalous cooccurrence
  41. 41. 41©MapR Technologies - ConfidentialRecommendation Basics Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2t3 not t3t1 2 1not t1 1 1
  42. 42. 42©MapR Technologies - ConfidentialSpot the Anomaly Root LLR is roughly like standard deviationsA not AB 13 1000not B 1000 100,000A not AB 1 0not B 0 2A not AB 1 0not B 0 10,000A not AB 10 0not B 0 100,0000.44 0.982.26 7.15
  43. 43. 43©MapR Technologies - ConfidentialRoot LLR Details In Rentropy = function(k) {-sum(k*log((k==0)+(k/sum(k))))}rootLLr = function(k) {sign = …sign * sqrt((entropy(rowSums(k))+entropy(colSums(k))- entropy(k))/2)} Like sqrt(mutual information * N/2)See http://bit.ly/16DvLVK
  44. 44. 44©MapR Technologies - ConfidentialThreshold by Score Coocurrencet1 t2 t3 t4t1 2 0 2 1t2 0 1 0 1t3 2 0 1 1t4 1 1 1 2
  45. 45. 45©MapR Technologies - ConfidentialThreshold by Score Significant cooccurrence => Indicatorst1 t2 t3 t4t1 1 0 0 1t2 0 1 0 1t3 0 0 1 1t4 1 0 0 1
  46. 46. 46©MapR Technologies - ConfidentialYeah, But … Why go to all this trouble? Does it really help?
  47. 47. 47©MapR Technologies - ConfidentialReal-life example
  48. 48. 48©MapR Technologies - ConfidentialThe Real Life Issues Exploration Diversity Speed Not the last percent
  49. 49. 49©MapR Technologies - ConfidentialThe Second Page
  50. 50. 50©MapR Technologies - ConfidentialMake it Worse to Make It Better Add noise to rank1 2 8 7 6 3 5 4 10 13 21 18 12 9 14 24 34 28 32 1711 27 40 30 41 49 16 15 35 23 19 22 26 31 20 43 25 29 33 6238 60 74 53 36 37 39 70 45 44 46 71 42 69 47 63 52 57 51 48 Results are worse today But better tomorrow
  51. 51. 51©MapR Technologies - ConfidentialAnti-Flood 200 of the same result is no better than 2 The recommender list is a portfolio of results– If probability of success is highly correlated, then probability of at least onesuccess is much lower Suppressing items similar to higher ranking items helps
  52. 52. 52©MapR Technologies - ConfidentialThe Punchline Hybrid systems really can work today Middle tiers aren’t as interesting as they used to be– No need for Flume … queue directly in big data system– No need for external queues, tail the data directly with Storm– No need for query systems for presentation data … read it directly withnode Absolutely require common frameworks and standard interfaces You can do this today!
  53. 53. 53©MapR Technologies - Confidential Contact:tdunning@maprtech.com@ted_dunning Slides and suchhttp://slideshare.net/tdunning Hash tags: #mapr #goto #d3 #node

×