Introduction to Mahout with HDInsight

5,194 views

Published on

In this session, we will introduce a Mahout, a machine learning library that has multiple algorithms implemented on top of Hadoop and HDInsight. We will start by introducing the foundational concepts needed to understand clustering, classification and collaborative filtering before demonstrating what it takes to get started with Mahout. In addition to learning how you get Mahout set-up, you will learn what it takes to process and prepare data, how to execute an “embarrassing parallel” batch recommendation job and subsequently how to integrate the result back into your existing ecosystem.

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,194
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
43
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Data Shift – changes in data collection or UI can create artificial shifting
  • Introduction to Mahout with HDInsight

    1. 1. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Introduction to Mahout with HDInsight (Hadoop) Chris Price Senior BI Consultant @BluewaterSQL
    2. 2. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Intro Chris Price Senior BI Consultant with Pragmatic Works Author Regular Speaker Data Geek & Super Dad! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com
    3. 3. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Survey  Whose currently using Machine Learning?  Google  Facebook  LinkedIn  Twitter  Amazon  Wal-Mart
    4. 4. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Outline  Mahout Introduction  The Algorithms  Hands On:  A recommendation engine
    5. 5. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Riding the Elephant  Born out of the Apache Lucene project  Top-level Apache project  A scalable machine learning library  Fast, Efficient & Pragmatic  Many of the algorithms can be run on Hadoop
    6. 6. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Algorithms  Collaborative Filtering  Item/User Recommenders  Clustering  Grouping movies by type  Classification  Categorizing documents  Frequent Itemset  Market basket analysis
    7. 7. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Collaborative Filtering  Find subset of users who have similar taste/preferences to target user and use this subset for recommendations  Types:  User-Based  Item-Based  Examples:  Amazon
    8. 8. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering  Group similar objects  Examples:  News Aggregator  Customer Grouping
    9. 9. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering  Algorithms:  K-Means  Fuzzy K-Means  Mean Shift  Canopy  Dirichlet  Similarity Distance:  Euclidean  Squared Euclidean  Cosine  Tanimoto  Manhattan ** Also weighted measures
    10. 10. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering
    11. 11. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Classification  Using a pre-determined set of groups:  Predict the type of a new object based on its features  Classifiable Data  Continuous – Quantitative Value (i.e. Stock Price)  Categorical – Small known set (i.e. Colors)  Word-Like – Large unknown set  Text-Like – Many word-like that are unordered  Examples:  Spam Identification  Photo Facial Recognition
    12. 12. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Frequent Itemset  Examples:  Product Placement  Market Basket Analysis  Query Recommendations
    13. 13. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Mahout on HDInsight  Installation  Download  http://www.apache.org/dyn/closer.cgi/mahout/  Unpack  Add to Path (Environment Variable)
    14. 14. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Recommendation Engine  Define the Business Objective  Metrics  Context  Identify Data  Sources  Normalization  Data Shift  Which Algorithm?  Integration?
    15. 15. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Business Objective Navigational Inefficiency Cross-Sell Up-Sell Increase # of Orders Increase Items per Order Increase Average Item Price Website Increase Revenue
    16. 16. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Handling Context ??? January 20 degrees & Snowing…..
    17. 17. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Data Acquisition  Sources of Data for Recommendation  Implicit  Ratings  Feedback  Demographics  Pyschographics (Personality/Lifestyle/Attitude),  Ephemeral Need (Need for a moment)  Explicit  Purchase History  Click/Browse History  Product/Item  Taxonomy  Attributes  Descriptions
    18. 18. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Data Preparation  Preparation  Remove Outliers (Z-Score)  Remove frequent buyers (Skew)  Normalize Data (Unity-Based)  Beware of Data Shift
    19. 19. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Algorithms  Collaborative Filtering (Mahout)  User-Based  Item-Based  Content-Based (Mahout Clustering)  Data Mining (SSAS)  Association  Clustering
    20. 20. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com CF Recommendations  Neighborhood Formation  Similarity Metrics  Pearson Correlation  Euclidean Distance  Spearman Correlation  Cosine  Tanimoto Coefficient  Log-Likelihood
    21. 21. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com CF Pseudo-Code for each item i that u has no preference for each user v that has a preference for i compute similarity s between u and v calculate running average of v‘s preference for i, weighted by s return top ranked (weighted average) i Restrict to Neighborhood
    22. 22. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Testing  Smell Test  Built-In (Requires Java Coding)  Root Mean Squared Error (RMSE)  Average Absolute Difference RandomUtils.useTestSeed() Evaluator.evaluate(builder,null,0.7,1.0)
    23. 23. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Recommendation Engine Steps  1 – Generate List of ItemIDs  2 – Create Preference Vector  3 – Count Unique Users  4 – Transpose Preference Vectors  5 – Row Similarity  Compute Weights  Computer Similarities  Similarity Matrix  6 – Pre-Partial Multiply, Similarity Matrix  7 – Pre-Partial Multiply, Preferences  8 – Partial Multiple (Steps 6 & 7)  9 – Filter Items  10 – Aggregate & Recommend
    24. 24. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Batch Integration  ETL Data to HDFS  SSIS  Map/Reduce  Process with Mahout  ETL Results  Map/Reduce  Hive/Sqoop
    25. 25. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Hands-On Demo
    26. 26. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Resources  Mahout in Action Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman  Hadoop: The Definitive Guide  Tom White
    27. 27. MAKING BUSINESS INTELLIGENT www.pragmaticworks.comMAKING BUSINESS INTELLIGENT www.pragmaticworks.com Thank you! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com

    ×