Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning in Big Data

1,155 views

Published on

Machine Learning in Big Data

Published in: Technology
  • Be the first to comment

Machine Learning in Big Data

  1. 1. Machine Learning in Big Data - Look forward or be left behind V. William Porto Hadoop Summit Dublin 2016
  2. 2. Overview of RedPoint Global 2  RedPoint Global Inc. 2016 Confidential Launchedin2006 Foundedandstaffedbyindustryveterans Headquarters: Wellesley,Massachusetts OfficesinUS,UK,Australia,Philippines Globalcustomerbase Servesmostmajorindustries
  3. 3. Overview of RedPoint Global 3  RedPoint Global Inc. 2016 Confidential MAGIC QUADRANT Data Quality MAGIC QUADRANT Integrated Marketing Management MAGIC QUADRANT Multichannel Campaign Management MAGIC QUADRANT Digital Marketing Hubs FORRESTER WAVE™ Cross-channel Campaign Management FORRESTER WAVE™ Data Quality Solutions
  4. 4. 4  RedPoint Global Inc. 2015 Confidential With apologies to Gary Larson Hadoop
  5. 5. 5  RedPoint Global Inc. 2015 Confidential Machine Learning – why bother? If you have always done it that way, it is probably wrong” - Charles Kettering
  6. 6. 6  RedPoint Global Inc. 2015 Confidential Machine Learning – keeping ahead of the curve • Three basic tenants for success in today’s world • Prediction - you need to learn and use what you’ve learned • Optimization - the world is a dynamic place • Automation - because people don’t scale well
  7. 7. 7  RedPoint Global Inc. 2015 Confidential Machine Learning – what really is it all about? • Learning vs. instruction • Humans learn instinctively – computers not so much • Intelligent Systems • Memory • Prediction (modeling) • Assessment • Feedback • Adaptation
  8. 8. 8  RedPoint Global Inc. 2015 Confidential Data Modeling – what, why, how • Regression – what happened in the past • Prediction – what will happen in the future “Prediction is very difficult – especially if it’s about the future” - Nihls Bohr
  9. 9. 9  RedPoint Global Inc. 2015 Confidential Data Modeling – what, why, how The wide world of data modeling • Supervised models • you have historical data and known correlated outputs (truth) • Unsupervised models • historical data, but may not have (or trust) associated outputs
  10. 10. 10  RedPoint Global Inc. 2015 Confidential Decision Trees Major Assumption: the world is discrete • fast, easy to understand, no linearity assumptions • ‘human time’ required, unbalanced and/or large trees
  11. 11. 11  RedPoint Global Inc. 2015 Confidential Standard Linear Models Assumption: the world is linear • the real world really isn’t linear • all errors are not all equal • easy to get misleading results ? ! Which line is best?
  12. 12. 12  RedPoint Global Inc. 2015 Confidential Generalized ‘Non-Linear’ Models Assumptions • underlying functional mapping is known • all errors are equal • data is ‘well-conditioned’ • ‘standard’ error distribution • Polynomials • Exponentials (e.g., Gaussian, Poisson) • Piece-wise linear
  13. 13. 13  RedPoint Global Inc. 2015 Confidential Non-Linear Models Assumption: data is representative • ‘universal’ modeling tools • fast execution • no linearity assumptions • lots of parameters, many techniques • difficult to explain Artificial Neural Network
  14. 14. 14  RedPoint Global Inc. 2015 Confidential User Story: Predict Retention / Attrition Historical Behavioral Data Customer Rating Retention Customer Name Loyalty Member Days Since Last Purchase Immediate Relatives Household Children Customer ID Latest Purchase Price Latest Purchase Item ID Region Code Customer Capture Method Customer Contact Code Domicile 1 1 Allen, Geraldine yes 29 0 2 24160 211.39 B5 MW 2 6 St Louis, MO 1 1 Anderson, Harry no 48 0 3 19952 26.55 E12 NE 3 New York, NY 1 1 Andrews, Cynthia yes 63 1 0 13502 77.95 D7 NE 10 6 Hudson, NY 1 0 Andrews, Thomas Jr no 39 0 0 112050 0 A36 SW Los Angeles, CA 1 1 Appleton, Mary yes 53 2 3 11769 51.49 C101 NE D Bayside, Queens, NY 1 0 Ashbury, Jeffrey no 47 1 0 PC 17757 29.99 C62 C64 NE 124 New York, NY 1 1 Aston, Mrs. yes 18 1 0 PC 17757 29.99 C62 C64 NE 4 New York, NY 1 1 Barber, Ellen yes 26 0 2 19877 78.85 S 6 1 1 Barkley, Henry no 80 0 0 27042 30 A23 NE B Yorktown, PA 1 0 Baumann, David no 0 0 PC 17318 25.99 NE New York, NY 1 1 Bazzeno, Alice yes 32 0 1 11813 76.95 D15 C 8 34 1 0 Beattie, Mr. Samuel no 36 0 0 13050 75.29 C6 C A 11 Winnipeg, MN 1 1 Beckworth, June yes 47 1 1 11751 52.49 D35 NE 5 New York, NY 1 1 Behr, John no 26 0 0 111369 30 C148 NE 5 New York, NY 1 1 Biden, Roseanne yes 42 0 0 PC 17757 127.99 C 4 1 1 Bird, Ellen yes 29 0 0 PC 17483 18.95 C97 S 8 1 0 Birnbaum, Jason no 25 0 0 13905 26 C 148 San Francisco, CA
  15. 15. 15  RedPoint Global Inc. 2015 Confidential User Story: Predict Customer Retention / Attrition Machine Learning Processing Chain - Training
  16. 16. 16  RedPoint Global Inc. 2015 Confidential User Story: Predict Retention / Attrition Machine Learning Processing Chain - Prediction Reward predicted ‘retainees’ with targeted product offerings Give potential attrition customers special incentives to stay with the business
  17. 17. 17  RedPoint Global Inc. 2015 Confidential User Story: Accurate vs. Useful Prediction Sparse data + Least-Squares (Linear) Classifier • Task: predict chance of purchasing a sundry item • Result: ‘best’ model always predicts “none” • Analysis: LS algorithm assumes all errors are equal Bread Cake & Pie Chocolate Coffee Cookie Diesel Juice & Smoothies Lubricants Milk Other Bakery Premium Sandwich Snack Tea Total Transaction Total Revenue 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3000 0 0 0 0 0 3 0 0 0 0 0 0 0 0 3 2000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 1800 0 0 0 0 0 5 0 0 0 0 0 0 0 0 6 4800 0 0 0 2 0 0 0 0 0 0 0 0 0 0 2 100 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1828 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 16460 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 1000 0 0 0 0 0 2 0 0 0 0 0 0 0 0 2 1500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 4600 0 0 0 0 0 11 0 0 0 0 0 0 0 0 11 19381.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1860 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 9838.82 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 11000 0 0 0 0 0 5 0 0 0 0 0 0 0 0 19 18225 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 800 0 0 0 0 0 0 0 0 0 0 0 1 0 0 7 7990 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 3820 0 0 0 0 0 1 0 0 0 0 0 0 0 0 55 43230
  18. 18. 18  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation – group think Collaborative Filtering Relationship Matrix
  19. 19. 19  RedPoint Global Inc. 2015 Confidential Personalization – not really !=
  20. 20. 20  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation Similarity? Customer Browser Gender Age Sector Income Sector Married Children Homeowner Recent Baby Clothes Purchase George IE9 M 0 A N 0 1 N Carol Chrome F 1 B Y 1 0 Y Mary IE9 F 0 A N 1 0 Y Dist(George,Carol) = 8 Dist(George,Mary) = 4 Dist(Carol,Mary) = 4 Can you afford to target (George,Mary) the same way as (Carol,Mary) ?
  21. 21. 21  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation Basic Question – which one describes the data the best? Raw data How many clusters are there ? Two Clusters Four Clusters Six Clusters
  22. 22. 22  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation with Statistics • relatively simple • data distribution assumptions • initialization dependencies 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 Raw Data 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 Ellipsoidal Clustering 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 K-Means Clustering
  23. 23. 23  RedPoint Global Inc. 2015 Confidential Clustering/Segmentation – data driven • let the data speak for itself • multiple data projection ‘views’ • important boundary relationships (“swing voters”) Customer Demographics
  24. 24. 24  RedPoint Global Inc. 2015 Confidential User Story: Clustering / Segmentation ML Clustering - Training ML Clustering – Processing New Data
  25. 25. 25  RedPoint Global Inc. 2015 Confidential Model Selection – how to choose? • Basic Model Type (prediction or segmentation) • inputs + correlated outputs • inputs only? • Basic Questions: • what to use for my problem? • parameters? • is this the best choice? • could I do better, and how?
  26. 26. 26  RedPoint Global Inc. 2015 Confidential Optimization – Evolving better solutions • Simulated Evolution • fast, efficient search • always have a solution • arbitrary ‘evaluation’ functions • can start with existing solution(s) • Variation – alter model type, parameters • Assessment – how well does the model work? • Selection – survival of the fittest
  27. 27. 27  RedPoint Global Inc. 2015 Confidential Evolutionary Optimization – Evaluation Function • can use any measureable data • no continuity assumptions • no differentiability assumptions • no symmetry assumptions Sunshine Hurricane 20 -1000 5 50 Sunshine Hurricane Prediction Reality (Truth)
  28. 28. 28  RedPoint Global Inc. 2015 Confidential User Story: Optimizing Classification Models Task: Predict Retention/Attrition 62.00 70.2 72.3 73.4 75.2 34.8 28.8 24.5 22.1 20.9 0.00 20.00 40.00 60.00 80.00 100.00 0 1 2 3 4 5 6 Performance Generation Model Performance Optimization Classification Accuracy Test Set Error (RMS) 17 Potential input features (customer demographics) 2 outputs (retention/attrition) 1300 Training Samples (50 – 50, A / B Split) 1300 Test Samples ( naïve test data )
  29. 29. 29  RedPoint Global Inc. 2015 Confidential Use Case – Fully Adaptive Feedback (Next Best Offer) DB Historical User Behavior (stimulus/response) Train / Update Model Non-Adaptive (Fixed) Mode Randomized A/B/C Offer Selection Adaptive ML Mode ML Prediction Offer Selection Operation (Trigger) Ad / Offer (stimulus) Feedback Cycle
  30. 30. 30  RedPoint Global Inc. 2015 Confidential Five Keys to Successful Machine Learning • Let the data speak for itself – don’t force fit your models • Remember, all errors are not all equal – use this to your advantage • True learning requires continual adaptation ! • Automate the process with feedback – remove the “man-in-the-loop” • Trust the optimization process – it really works!
  31. 31. 31  RedPoint Global Inc. 2015 Confidential Q&A Contact Info Visit : www.redpoint.net Bill Porto Sr. Engineering Analyst RedPoint Global Inc. vwporto@redpoint.net Want More Information about this topic? Fill out your card or go to redpoint.net/hadoopeurope

×