MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Introduction to Mahout
with HDInsight (Hadoop)
Chris Price
Senior BI Co...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Intro
Chris Price
Senior BI Consultant with Pragmatic Works
Author
Regu...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Survey
 Whose currently using Machine Learning?
 Google
 Facebook
 ...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Outline
 Mahout Introduction
 The Algorithms
 Hands On:
 A recommen...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Riding the Elephant
 Born out of the Apache Lucene project
 Top-level...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Algorithms
 Collaborative Filtering
 Item/User Recommenders
 Cluster...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Collaborative Filtering
 Find subset of users who have similar
taste/p...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Clustering
 Group similar objects
 Examples:
 News Aggregator
 Cust...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Clustering
 Algorithms:
 K-Means
 Fuzzy K-Means
 Mean Shift
 Canop...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Clustering
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Classification
 Using a pre-determined set of groups:
 Predict the ty...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Frequent Itemset
 Examples:
 Product Placement
 Market Basket Analys...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Mahout on HDInsight
 Installation
 Download
 http://www.apache.org/d...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Recommendation Engine
 Define the Business Objective
 Metrics
 Conte...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Business Objective
Navigational
Inefficiency
Cross-Sell
Up-Sell
Increas...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Handling Context
???
January
20 degrees & Snowing…..
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Data Acquisition
 Sources of Data for Recommendation
 Implicit
 Rati...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Data Preparation
 Preparation
 Remove Outliers (Z-Score)
 Remove fre...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Algorithms
 Collaborative Filtering (Mahout)
 User-Based
 Item-Based...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
CF Recommendations
 Neighborhood Formation
 Similarity Metrics
 Pear...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
CF Pseudo-Code
for each item i that u has no preference
for each user v...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Testing
 Smell Test
 Built-In (Requires Java Coding)
 Root Mean Squa...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Recommendation Engine Steps
 1 – Generate List of ItemIDs
 2 – Create...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Batch Integration
 ETL Data to HDFS
 SSIS
 Map/Reduce
 Process with...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Hands-On Demo
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Resources
 Mahout in Action
Sean Owen, Robin Anil,
Ted Dunning, Ellen ...
MAKING BUSINESS INTELLIGENT
www.pragmaticworks.comMAKING BUSINESS INTELLIGENT
www.pragmaticworks.com
Thank you!
@Bluewater...
Upcoming SlideShare
Loading in...5
×

Introduction to Mahout with HDInsight

4,640

Published on

In this session, we will introduce a Mahout, a machine learning library that has multiple algorithms implemented on top of Hadoop and HDInsight. We will start by introducing the foundational concepts needed to understand clustering, classification and collaborative filtering before demonstrating what it takes to get started with Mahout. In addition to learning how you get Mahout set-up, you will learn what it takes to process and prepare data, how to execute an “embarrassing parallel” batch recommendation job and subsequently how to integrate the result back into your existing ecosystem.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,640
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Data Shift – changes in data collection or UI can create artificial shifting
  • Introduction to Mahout with HDInsight

    1. 1. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Introduction to Mahout with HDInsight (Hadoop) Chris Price Senior BI Consultant @BluewaterSQL
    2. 2. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Intro Chris Price Senior BI Consultant with Pragmatic Works Author Regular Speaker Data Geek & Super Dad! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com
    3. 3. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Survey  Whose currently using Machine Learning?  Google  Facebook  LinkedIn  Twitter  Amazon  Wal-Mart
    4. 4. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Outline  Mahout Introduction  The Algorithms  Hands On:  A recommendation engine
    5. 5. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Riding the Elephant  Born out of the Apache Lucene project  Top-level Apache project  A scalable machine learning library  Fast, Efficient & Pragmatic  Many of the algorithms can be run on Hadoop
    6. 6. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Algorithms  Collaborative Filtering  Item/User Recommenders  Clustering  Grouping movies by type  Classification  Categorizing documents  Frequent Itemset  Market basket analysis
    7. 7. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Collaborative Filtering  Find subset of users who have similar taste/preferences to target user and use this subset for recommendations  Types:  User-Based  Item-Based  Examples:  Amazon
    8. 8. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering  Group similar objects  Examples:  News Aggregator  Customer Grouping
    9. 9. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering  Algorithms:  K-Means  Fuzzy K-Means  Mean Shift  Canopy  Dirichlet  Similarity Distance:  Euclidean  Squared Euclidean  Cosine  Tanimoto  Manhattan ** Also weighted measures
    10. 10. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering
    11. 11. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Classification  Using a pre-determined set of groups:  Predict the type of a new object based on its features  Classifiable Data  Continuous – Quantitative Value (i.e. Stock Price)  Categorical – Small known set (i.e. Colors)  Word-Like – Large unknown set  Text-Like – Many word-like that are unordered  Examples:  Spam Identification  Photo Facial Recognition
    12. 12. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Frequent Itemset  Examples:  Product Placement  Market Basket Analysis  Query Recommendations
    13. 13. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Mahout on HDInsight  Installation  Download  http://www.apache.org/dyn/closer.cgi/mahout/  Unpack  Add to Path (Environment Variable)
    14. 14. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Recommendation Engine  Define the Business Objective  Metrics  Context  Identify Data  Sources  Normalization  Data Shift  Which Algorithm?  Integration?
    15. 15. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Business Objective Navigational Inefficiency Cross-Sell Up-Sell Increase # of Orders Increase Items per Order Increase Average Item Price Website Increase Revenue
    16. 16. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Handling Context ??? January 20 degrees & Snowing…..
    17. 17. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Data Acquisition  Sources of Data for Recommendation  Implicit  Ratings  Feedback  Demographics  Pyschographics (Personality/Lifestyle/Attitude),  Ephemeral Need (Need for a moment)  Explicit  Purchase History  Click/Browse History  Product/Item  Taxonomy  Attributes  Descriptions
    18. 18. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Data Preparation  Preparation  Remove Outliers (Z-Score)  Remove frequent buyers (Skew)  Normalize Data (Unity-Based)  Beware of Data Shift
    19. 19. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Algorithms  Collaborative Filtering (Mahout)  User-Based  Item-Based  Content-Based (Mahout Clustering)  Data Mining (SSAS)  Association  Clustering
    20. 20. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com CF Recommendations  Neighborhood Formation  Similarity Metrics  Pearson Correlation  Euclidean Distance  Spearman Correlation  Cosine  Tanimoto Coefficient  Log-Likelihood
    21. 21. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com CF Pseudo-Code for each item i that u has no preference for each user v that has a preference for i compute similarity s between u and v calculate running average of v‘s preference for i, weighted by s return top ranked (weighted average) i Restrict to Neighborhood
    22. 22. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Testing  Smell Test  Built-In (Requires Java Coding)  Root Mean Squared Error (RMSE)  Average Absolute Difference RandomUtils.useTestSeed() Evaluator.evaluate(builder,null,0.7,1.0)
    23. 23. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Recommendation Engine Steps  1 – Generate List of ItemIDs  2 – Create Preference Vector  3 – Count Unique Users  4 – Transpose Preference Vectors  5 – Row Similarity  Compute Weights  Computer Similarities  Similarity Matrix  6 – Pre-Partial Multiply, Similarity Matrix  7 – Pre-Partial Multiply, Preferences  8 – Partial Multiple (Steps 6 & 7)  9 – Filter Items  10 – Aggregate & Recommend
    24. 24. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Batch Integration  ETL Data to HDFS  SSIS  Map/Reduce  Process with Mahout  ETL Results  Map/Reduce  Hive/Sqoop
    25. 25. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Hands-On Demo
    26. 26. MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Resources  Mahout in Action Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman  Hadoop: The Definitive Guide  Tom White
    27. 27. MAKING BUSINESS INTELLIGENT www.pragmaticworks.comMAKING BUSINESS INTELLIGENT www.pragmaticworks.com Thank you! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×