LA HUG 2012 02-07
Upcoming SlideShare
Loading in...5
×
 

LA HUG 2012 02-07

on

  • 304 views

Hadoop User Group talk in L.A. (2012)

Hadoop User Group talk in L.A. (2012)

Statistics

Views

Total Views
304
Views on SlideShare
304
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
  • Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.

LA HUG 2012 02-07 LA HUG 2012 02-07 Presentation Transcript

  • Beating up on Bayesian Bandits
  • Mahout • Scalable Data Mining for Everybody
  • What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)
  • What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)
  • Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  • Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  • Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training – Now with MORE topping!
  • An Example
  • And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
  • Feature Encoding
  • Hashed Encoding
  • Feature Collisions
  • How it Works • We are given “features” – Often binary values in a vector • Algorithm learns weights – Weighted sum of feature * weight is the key • Each weight is a single real value
  • A Quick Diversion • You see a coin – What is the probability of heads? – Could it be larger or smaller than that? • I flip the coin and while it is in the air ask again • I catch the coin and ask again • I look at the coin (and you don’t) and ask again • Why does the answer change? – And did it ever have a single value?
  • A First Conclusion • Probability as expressed by humans is subjective and depends on information and experience
  • A Second Conclusion • A single number is a bad way to express uncertain knowledge • A distribution of values might be better
  • I Dunno
  • 5 and 5
  • 2 and 10
  • The Cynic Among Us
  • A Second Diversion
  • Two-armed Bandit
  • Which One to Play? • One may be better than the other • The better machine pays off at some rate • Playing the other will pay off at a lesser rate – Playing the lesser machine has “opportunity cost” • But how do we know which is which? – Explore versus Exploit!
  • Algorithmic Costs • Option 1 – Explicitly code the explore/exploit trade-off • Option 2 – Bayesian Bandit
  • Bayesian Bandit • Compute distributions based on data • Sample p1 and p2 from these distributions • Put a coin in bandit 1 if p1 > p2 • Else, put the coin in bandit 2
  • The Basic Idea • We can encode a distribution by sampling • Sampling allows unification of exploration and exploitation • Can be extended to more general response models
  • Deployment with Storm/MapR Impression Logs Click Logs Targeting Engine Conversion Detector Model Selector RPC Online Model Online Model Online Model RPC RPC RPC Conversion Dashboard RPC Training Training Training All state managed transactionally in MapR file system
  • Service Architecture MapR Lockless Storage Services MapR Pluggable Service Management Storm HadoopImpression Logs Click Logs Targeting Engine Conversion Detector Model Selector RPC Online Model Online Model Online Model RPC RPC RPC Conversion Dashboard RPC Training Training Training
  • Find Out More • Me: tdunning@mapr.com ted.dunning@gmail.com tdunning@apache.com • MapR: http://www.mapr.com • Mahout: http://mahout.apache.org • Code: https://github.com/tdunning