Boston hug

2,881 views

Published on

Talk about why and how machine learning works with Hadoop with recent developments for real-time operation.

Published in: Technology, Business
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,881
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
56
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • The value of analytics always increases with more data, but the rate of increase drops dramatically after an initial quick increase.
  • In classical analytics, the cost of doing analytics increases sharply.
  • The result is a net value that has a sharp optimum in the area where value is increasing rapidly and cost is not yet increasing so rapidly.
  • New techniques such as Hadoop result in linear scaling of cost. This is a change in shape and it causes a qualitative change in the way that costs trade off against value to give net value. As technology improves, the slope of this cost line is also changing rapidly over time.
  • This next sequence shows how the net value changes with different slope linear cost models.
  • Notice how the best net value has jumped up significantly
  • And as the line approaches horizontal, the highest net value occurs at dramatically larger data scale.
  • No information would give a relative expected payoff of -0.25. This graph shows 25, 50 and 75%-ile results for sampled experiments with uniform random probabilities. Convergence to optimum is nearly equal to the optimum sqrt(n). Note the log scale on number of trials
  • Here is how the system converges in terms of how likely it is to pick the better bandit with probabilities that are only slightly different. After 1000 trials, the system is already giving 75% of the bandwidth to the better option. This graph was produced by averaging several thousand runs with the same probabilities.
  • Boston hug

    1. 1. Machine Learning with Hadoop
    2. 2. Agenda• Why Big Data? Why now?• What can you do with big data?• How does it work? 2
    3. 3. Slow Motion Explosion 3
    4. 4. Why Now? • But Moore’s law has applied for a long time • Why is Hadoop/Big Data exploding now? • Why not 10 years ago? • Why not 20?2/15/2012 4
    5. 5. Size Matters, but …• If it were just availability of data then existing big companies would adopt big data technology first 5
    6. 6. Size Matters, but …• If it were just availability of data then existing big companies would adopt big data technology first They didn’t 6
    7. 7. Or Maybe Cost• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte 7
    8. 8. Or Maybe Cost• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte They didn’t 8
    9. 9. Backwards adoption• Under almost any threshold argument startups would not adopt big data technology first 9
    10. 10. Backwards adoption• Under almost any threshold argument startups would not adopt big data technology first They did 10
    11. 11. Everywhere at Once?• Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small 11
    12. 12. Everywhere at Once?• Something very strange is happening – Big data is being applied at many different scales – At many value scales – By large companies and small Why? 12
    13. 13. Analytics Scaling Laws• Analytics scaling is all about the 80-20 rule – Big gains for little initial effort – Rapidly diminishing returns• The key to net value is how costs scale – Old school – exponential scaling – Big data – linear scaling, low constant• Cost/performance has changed radically – IF you can use many commodity boxes
    14. 14. You’re kidding, people do that? We didn’t know that! We should have known thatWe knew that
    15. 15. NSA, non-proliferation 1 0.75 Industry-wide data consortiumValue 0.5 In-house analytics Intern with a spreadsheet 0.25 Anybody with eyes 0 0 500 1000 1500 2,000 Scale
    16. 16. 1 0.75 Net value optimum has aValue 0.5 sharp peak well before maximum effort 0.25 0 0 500 1000 1500 2,000 Scale
    17. 17. But scaling laws are changingboth slope and shape
    18. 18. 1 0.75Value 0.5 More than just a little 0.25 0 0 500 1000 1500 2,000 Scale
    19. 19. 1 0.75Value 0.5 They are changing a LOT! 0.25 0 0 500 1000 1500 2,000 Scale
    20. 20. 1 0.75Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale
    21. 21. 1 0.75Value 0.5 0.25 0 0 500 1000 1500 2,000 Scale
    22. 22. 1 0.75 A tipping point is reached and things change radically …Value 0.5 Initially, linear cost scaling actually makes things worse 0.25 0 0 500 1000 1500 2,000 Scale
    23. 23. Pre-requisites for Tipping• To reach the tipping point,• Algorithms must scale out horizontally – On commodity hardware – That can and will fail• Data practice must change – Denormalized is the new black – Flexible data dictionaries are the rule – Structured data becomes rare
    24. 24. So that is why and why now 26
    25. 25. So that is why, and why nowWhat can you do with it? And how? 27
    26. 26. Agenda• Mahout outline – Recommendations – Clustering – Classification• Hybrid Parallel/Sequential Systems• Real-time learning
    27. 27. Agenda• Mahout outline – Recommendations – Clustering – Classification • Supervised on-line learning • Feature hashing• Hybrid Parallel/Sequential Systems• Real-time learning
    28. 28. Classification in Detail• Naive Bayes Family – Hadoop based training• Decision Forests – Hadoop based training• Logistic Regression (aka SGD) – fast on-line (sequential) training
    29. 29. Classification in Detail• Naive Bayes Family – Hadoop based training• Decision Forests – Hadoop based training• Logistic Regression (aka SGD) – fast on-line (sequential) training
    30. 30. Classification in Detail• Naive Bayes Family – Hadoop based training• Decision Forests – Hadoop based training• Logistic Regression (aka SGD) – fast on-line (sequential) training – Now with MORE topping!
    31. 31. How it Works• We are given “features” – Often binary values in a vector• Algorithm learns weights – Weighted sum of feature * weight is the key• Each weight is a single real value
    32. 32. An Example
    33. 33. FeaturesFrom: Thu, Paul 20, 2010 at 10:51 AMDate: Dr. May AcquahDear Sir,From: George <george@fumble-tech.com>Re: Proposal for over-invoice Contract BenevolenceHi Ted, was a pleasure talking to you last nightBased on information gathered from the idea ofat the Hadoop User Group. I liked the Indiahospital directory, I am pleased to propose agoing for lunch together. Are you availableconfidential business noon? for our mutualtomorrow (Friday) at dealbenefit. I have in my possession, instruments(documentation) to transfer the sum of33,100,000.00 eur thirty-three million one hundredthousand euros, only) into a foreign companysbank account for our favor....
    34. 34. But …• Text and words aren’t suitable features• We need a numerical vector• So we use binary vectors with lots of slots
    35. 35. Feature Encoding
    36. 36. Hashed Encoding
    37. 37. Feature Collisions
    38. 38. Training Data
    39. 39. Training Data
    40. 40. Training Data Joining, combining,Raw transforming Training examplesdata with target values Parsing Tokens Encoding Training Vectors algorithm
    41. 41. Full Scale Training Side-data Now via NFSI Featuren Sequential extraction Datap SGD and joinu Learning downt sampling Map-reduce
    42. 42. Hybrid Model Development Logs Group by User Count Training data user sessions transaction Shared filesystem patternsBig-data clusterLegacy modeling Training data Account info Merge PROC Model LOGISTIC 44
    43. 43. Enter the Pig Vector• Pig UDF’s for – Vector encoding define EncodeVector org.apache.mahout.pig.encoders.EncodeVector( 10,x+y+1, x:numeric, y:numeric, z:numeric); – Model training vectors = foreach docs generate newsgroup, encodeVector(*) as v; grouped = group vectors all; model = foreach grouped generate 1 as key, train(vectors) as model;
    44. 44. Real-time Developments• Storm + Hadoop + Mapr – Real-time with Storm – Long-term with Hadoop – State checkpoints with MapR• Add the Bayesian Bandit for on-line learning
    45. 45. Aggregate Splicing Storm handles the Hadoop handles the presentt past
    46. 46. Mobile Network Monitor Transaction dataGeo-dispersed ingest servers Batch aggregation Retro-analysis interface HBase Real-time dashboard and alerts 48
    47. 47. A Quick Diversion• You see a coin – What is the probability of heads? – Could it be larger or smaller than that?• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change? – And did it ever have a single value?
    48. 48. A First Conclusion• Probability as expressed by humans is subjective and depends on information and experience
    49. 49. A Second Conclusion• A single number is a bad way to express uncertain knowledge• A distribution of values might be better
    50. 50. I Dunno
    51. 51. 5 and 5
    52. 52. 2 and 10
    53. 53. Bayesian Bandit• Compute distributions based on data• Sample p1 and p2 from these distributions• Put a coin in bandit 1 if p1 > p2• Else, put the coin in bandit 2
    54. 54. The Basic Idea• We can encode a distribution by sampling• Sampling allows unification of exploration and exploitation• Can be extended to more general response models
    55. 55. Deployment with Storm/MapR Targeting Online Engine Model RPC RPC Model Selector RPC Online RPC Model Impression Logs Training Conversion Online Training Detector Model Training Click Logs RPC All state managed transactionally in MapR file system Conversion Dashboard
    56. 56. Service Architecture MapR Pluggable Service Management StormTargeting Online Engine Model RPC RPC Model Selector RPC OnlineImpression Logs Conversion Detector RPC Training Training Model Online Hadoop Model TrainingClick Logs RPCConversionDashboard MapR Lockless Storage Services
    57. 57. Find Out More• Me: tdunning@mapr.com ted.dunning@gmail.com tdunning@apache.com• MapR: http://www.mapr.com• Mahout: http://mahout.apache.org• Code: https://github.com/tdunning

    ×