Your SlideShare is downloading. ×
0
1©MapR Technologies - Confidential
Old and New Building Blocks
Come Together For Big Data
2©MapR Technologies - Confidential
 Contact:
tdunning@maprtech.com
@ted_dunning
 Slides and such
http://slideshare.net/t...
3©MapR Technologies - Confidential
Embarrassment of Riches
 d3.js allows really pretty pictures
 node.js allows simple (...
4©MapR Technologies - Confidential
D3 demo
5©MapR Technologies - Confidential
node demo
6©MapR Technologies - Confidential
Hadoop
demo
7©MapR Technologies - Confidential
But …
 Web camp
– everything is a service with a URL or a DOM
 Big data camp
– non-tr...
8©MapR Technologies - Confidential
Why Not Tiered Architectures?
 Tiered architectures
– translations between services an...
9©MapR Technologies - Confidential
The Vision
 Integrate
– multiple computing paradigms
– many computing communities
 Ho...
10©MapR Technologies - Confidential
For Example, …
 Incoming documents with text
– store in file-based queues
– index in ...
11©MapR Technologies - Confidential
Add Analysis
 Process engagement logs
– item-item cooccurrence
– user-item histories
...
12©MapR Technologies - Confidential
Search Again
 Now searches use recent views + text
– recent views query indicator fie...
13©MapR Technologies - Confidential
And Draw a Picture
 Searches and clicks can be logged
– real-time metrics
– real-time...
14©MapR Technologies - Confidential
In Pictures
15©MapR Technologies - Confidential
In Pictures
Doc
queue
Search
index
Real-time
indexing
Doc
sources
16©MapR Technologies - Confidential
In Pictures
Doc
queue
Search
index
Real-time
indexing
Doc
sources
User
queries
Search
...
17©MapR Technologies - Confidential
In Pictures
Doc
queue
Search
index
Real-time
indexing
Doc
sources
User
queries
Search
...
18©MapR Technologies - Confidential
In Pictures
Doc
queue
Search
index
Real-time
indexing
Doc
sources
User
queries
Search
...
19©MapR Technologies - Confidential
Which Technology?
Doc
queue
Search
index
Real-time
indexing
Doc
sources
User
queries
S...
20©MapR Technologies - Confidential
Yeah, But …
 This isn’t as easy as it looks
 Take the real-time / long-time part
21©MapR Technologies - Confidential
t
now
Hadoop is Not Very Real-time
Unprocessed
Data
Fully
processed
Latest full
period...
22©MapR Technologies - Confidential
t
now
Hadoop works
great back here
Storm
works
here
Real-time and Long-time together
B...
23©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
sh...
24©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
histo...
25©MapR Technologies - Confidential
Users
Catcher Storm
Topic
Queue
Web-server
http
Web
Data
MapR
26©MapR Technologies - Confidential
Closer Look – Catcher Protocol
Data
Sources
Catcher
Cluster
Catcher
Cluster
Data
Sourc...
27©MapR Technologies - Confidential
Closer Look – Catcher Queues
Catcher
Cluster
Catcher
Cluster
The catchers forward log ...
28©MapR Technologies - Confidential
Closer Look – ProtoSpout
The ProtoSpout tails the topic files,
parses log records into...
29©MapR Technologies - Confidential
Yeah, But …
 What was that about adding noise in scoring?
 Why would I do that??
 I...
30©MapR Technologies - Confidential
Thompson Sampling
 Select each shell according to the probability that it is the best...
31©MapR Technologies - Confidential
Thompson Sampling – Take 2
 Sample θ
 Pick i to maximize reward
 Record result from...
32©MapR Technologies - Confidential
Nearly Forgotten until Recently
 Citations for Thompson sampling
33©MapR Technologies - Confidential
Bayesian Bandit for the Search
 Compute distributions based on data so far
 Sample s...
34©MapR Technologies - Confidential
And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.0...
35©MapR Technologies - Confidential
Yeah, But …
 Isn’t recommendations complicated?
 How can I implement this?
36©MapR Technologies - Confidential
Recommendation Basics
 History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
37©MapR Technologies - Confidential
Recommendation Basics
 History as matrix:
 (t1, t3) cooccur 2 times,
 (t1, t4) once...
38©MapR Technologies - Confidential
A Quick Simplification
 Users who do h
 Also do r
Ah
AT
Ah( )
AT
A( )h
User-centric ...
39©MapR Technologies - Confidential
Recommendation Basics
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 ...
40©MapR Technologies - Confidential
Problems with Raw Cooccurrence
 Very popular items co-occur with everything
– Welcome...
41©MapR Technologies - Confidential
Recommendation Basics
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 ...
42©MapR Technologies - Confidential
Spot the Anomaly
 Root LLR is roughly like standard deviations
A not A
B 13 1000
not ...
43©MapR Technologies - Confidential
Root LLR Details
 In R
entropy = function(k) {
-sum(k*log((k==0)+(k/sum(k))))
}
rootL...
44©MapR Technologies - Confidential
Threshold by Score
 Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1...
45©MapR Technologies - Confidential
Threshold by Score
 Significant cooccurrence => Indicators
t1 t2 t3 t4
t1 1 0 0 1
t2 ...
46©MapR Technologies - Confidential
Yeah, But …
 Why go to all this trouble?
 Does it really help?
47©MapR Technologies - Confidential
Real-life example
48©MapR Technologies - Confidential
The Real Life Issues
 Exploration
 Diversity
 Speed
 Not the last percent
49©MapR Technologies - Confidential
The Second Page
50©MapR Technologies - Confidential
Make it Worse to Make It Better
 Add noise to rank
1 2 8 7 6 3 5 4 10 13 21 18 12 9 1...
51©MapR Technologies - Confidential
Anti-Flood
 200 of the same result is no better than 2
 The recommender list is a po...
52©MapR Technologies - Confidential
The Punchline
 Hybrid systems really can work today
 Middle tiers aren’t as interest...
53©MapR Technologies - Confidential
 Contact:
tdunning@maprtech.com
@ted_dunning
 Slides and such
http://slideshare.net/...
Upcoming SlideShare
Loading in...5
×

GoTo Amsterdam 2013 Skinned

204

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
204
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "GoTo Amsterdam 2013 Skinned"

  1. 1. 1©MapR Technologies - Confidential Old and New Building Blocks Come Together For Big Data
  2. 2. 2©MapR Technologies - Confidential  Contact: tdunning@maprtech.com @ted_dunning  Slides and such http://slideshare.net/tdunning  Hash tags: #mapr #gotoams #d3 #node
  3. 3. 3©MapR Technologies - Confidential Embarrassment of Riches  d3.js allows really pretty pictures  node.js allows simple (not just web) servers  Storm does real-time  Hadoop does big data  d3 allows very cool visualizations
  4. 4. 4©MapR Technologies - Confidential D3 demo
  5. 5. 5©MapR Technologies - Confidential node demo
  6. 6. 6©MapR Technologies - Confidential Hadoop demo
  7. 7. 7©MapR Technologies - Confidential But …  Web camp – everything is a service with a URL or a DOM  Big data camp – non-traditional file systems  Everybody else – files and databases  They don’t like to talk to each other
  8. 8. 8©MapR Technologies - Confidential Why Not Tiered Architectures?  Tiered architectures – translations between services and cultures – standard corporate answer  Feels like molasses
  9. 9. 9©MapR Technologies - Confidential The Vision  Integrate – multiple computing paradigms – many computing communities  How? – common storage, queuing and data platforms
  10. 10. 10©MapR Technologies - Confidential For Example, …  Incoming documents with text – store in file-based queues – index in real-time using Storm and Solr – add initial engagement class, “don’t-know”  Search for documents using original text – add random noise, small for well understood docs, large for “don’t-know” docs  Record engagement
  11. 11. 11©MapR Technologies - Confidential Add Analysis  Process engagement logs – item-item cooccurrence – user-item histories  Update search index – indicator items – decrease uncertainty on well understood docs  Update user profile – item history
  12. 12. 12©MapR Technologies - Confidential Search Again  Now searches use recent views + text – recent views query indicator fields – text queries normal text data – add noise as appropriate
  13. 13. 13©MapR Technologies - Confidential And Draw a Picture  Searches and clicks can be logged – real-time metrics – real-time trending topics  What’s hot, what’s not  Popular searches  Document clusters  Word clouds
  14. 14. 14©MapR Technologies - Confidential In Pictures
  15. 15. 15©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources
  16. 16. 16©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine
  17. 17. 17©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis
  18. 18. 18©MapR Technologies - Confidential In Pictures Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis Usage analysis RenderingAdmin queries
  19. 19. 19©MapR Technologies - Confidential Which Technology? Doc queue Search index Real-time indexing Doc sources User queries Search engine Logs Recommendation analysis Usage analysis Admin queries Rendering Storm/node Solr MapR D3/node Other
  20. 20. 20©MapR Technologies - Confidential Yeah, But …  This isn’t as easy as it looks  Take the real-time / long-time part
  21. 21. 21©MapR Technologies - Confidential t now Hadoop is Not Very Real-time Unprocessed Data Fully processed Latest full period Hadoop job takes this long for this data
  22. 22. 22©MapR Technologies - Confidential t now Hadoop works great back here Storm works here Real-time and Long-time together Blended view Blended view Blended View
  23. 23. 23©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Complete history
  24. 24. 24©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history
  25. 25. 25©MapR Technologies - Confidential Users Catcher Storm Topic Queue Web-server http Web Data MapR
  26. 26. 26©MapR Technologies - Confidential Closer Look – Catcher Protocol Data Sources Catcher Cluster Catcher Cluster Data Sources The data sources and catchers communicate with a very simple protocol. Hello() => list of catchers Log(topic,message) => (OK|FAIL, redirect-to-catcher)
  27. 27. 27©MapR Technologies - Confidential Closer Look – Catcher Queues Catcher Cluster Catcher Cluster The catchers forward log requests to the correct catcher and return that host in the reply to allow the client to avoid the extra hop. Each topic file is appended by exactly one catcher. Topic files are kept in shared file storage. Topic File Topic File
  28. 28. 28©MapR Technologies - Confidential Closer Look – ProtoSpout The ProtoSpout tails the topic files, parses log records into tuples and injects them into the Storm topology. Last fully acked position stored in shared file system. Topic File Topic File ProtoSpout
  29. 29. 29©MapR Technologies - Confidential Yeah, But …  What was that about adding noise in scoring?  Why would I do that??  Is there a simple answer?
  30. 30. 30©MapR Technologies - Confidential Thompson Sampling  Select each shell according to the probability that it is the best  Probability that it is the best can be computed using posterior  But I promised a simple answer P(i is best) = I E[ri |q]= max j E[rj |q] é ëê ù ûúò P(q | D) dq
  31. 31. 31©MapR Technologies - Confidential Thompson Sampling – Take 2  Sample θ  Pick i to maximize reward  Record result from using i q ~P(q | D) i = argmax j E[r |q]
  32. 32. 32©MapR Technologies - Confidential Nearly Forgotten until Recently  Citations for Thompson sampling
  33. 33. 33©MapR Technologies - Confidential Bayesian Bandit for the Search  Compute distributions based on data so far  Sample scores s1, s2 … – based on actual score – plus per doc noise from these distributions  Rank docs by si  Lemma 1: The probability of showing doc i at first position will match the probability it is the best  Lemma 2: This is as good as it gets
  34. 34. 34©MapR Technologies - Confidential And it works! 11000 100 200 300 400 500 600 700 800 900 1000 0.12 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 n regret ε- greedy, ε = 0.05 Bayesian Bandit with Gamma- Normal
  35. 35. 35©MapR Technologies - Confidential Yeah, But …  Isn’t recommendations complicated?  How can I implement this?
  36. 36. 36©MapR Technologies - Confidential Recommendation Basics  History: User Thing 1 3 2 4 3 4 2 3 3 2 1 1 2 1
  37. 37. 37©MapR Technologies - Confidential Recommendation Basics  History as matrix:  (t1, t3) cooccur 2 times,  (t1, t4) once,  (t2, t4) once,  (t3, t4) once t1 t2 t3 t4 u1 1 0 1 0 u2 1 0 1 1 u3 0 1 0 1
  38. 38. 38©MapR Technologies - Confidential A Quick Simplification  Users who do h  Also do r Ah AT Ah( ) AT A( )h User-centric recommendations Item-centric recommendations
  39. 39. 39©MapR Technologies - Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  40. 40. 40©MapR Technologies - Confidential Problems with Raw Cooccurrence  Very popular items co-occur with everything – Welcome document – Elevator music  That isn’t interesting – We want anomalous cooccurrence
  41. 41. 41©MapR Technologies - Confidential Recommendation Basics  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2 t3 not t3 t1 2 1 not t1 1 1
  42. 42. 42©MapR Technologies - Confidential Spot the Anomaly  Root LLR is roughly like standard deviations A not A B 13 1000 not B 1000 100,000 A not A B 1 0 not B 0 2 A not A B 1 0 not B 0 10,000 A not A B 10 0 not B 0 100,000 0.44 0.98 2.26 7.15
  43. 43. 43©MapR Technologies - Confidential Root LLR Details  In R entropy = function(k) { -sum(k*log((k==0)+(k/sum(k)))) } rootLLr = function(k) { sign = … sign * sqrt( (entropy(rowSums(k))+entropy(colSums(k)) - entropy(k))/2) }  Like sqrt(mutual information * N/2) See http://bit.ly/16DvLVK
  44. 44. 44©MapR Technologies - Confidential Threshold by Score  Coocurrence t1 t2 t3 t4 t1 2 0 2 1 t2 0 1 0 1 t3 2 0 1 1 t4 1 1 1 2
  45. 45. 45©MapR Technologies - Confidential Threshold by Score  Significant cooccurrence => Indicators t1 t2 t3 t4 t1 1 0 0 1 t2 0 1 0 1 t3 0 0 1 1 t4 1 0 0 1
  46. 46. 46©MapR Technologies - Confidential Yeah, But …  Why go to all this trouble?  Does it really help?
  47. 47. 47©MapR Technologies - Confidential Real-life example
  48. 48. 48©MapR Technologies - Confidential The Real Life Issues  Exploration  Diversity  Speed  Not the last percent
  49. 49. 49©MapR Technologies - Confidential The Second Page
  50. 50. 50©MapR Technologies - Confidential Make it Worse to Make It Better  Add noise to rank 1 2 8 7 6 3 5 4 10 13 21 18 12 9 14 24 34 28 32 17 11 27 40 30 41 49 16 15 35 23 19 22 26 31 20 43 25 29 33 62 38 60 74 53 36 37 39 70 45 44 46 71 42 69 47 63 52 57 51 48  Results are worse today  But better tomorrow
  51. 51. 51©MapR Technologies - Confidential Anti-Flood  200 of the same result is no better than 2  The recommender list is a portfolio of results – If probability of success is highly correlated, then probability of at least one success is much lower  Suppressing items similar to higher ranking items helps
  52. 52. 52©MapR Technologies - Confidential The Punchline  Hybrid systems really can work today  Middle tiers aren’t as interesting as they used to be – No need for Flume … queue directly in big data system – No need for external queues, tail the data directly with Storm – No need for query systems for presentation data … read it directly with node  Absolutely require common frameworks and standard interfaces  You can do this today!
  53. 53. 53©MapR Technologies - Confidential  Contact: tdunning@maprtech.com @ted_dunning  Slides and such http://slideshare.net/tdunning  Hash tags: #mapr #gotoams #d3 #node
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×