Introduction to Mahout with HDInsight
Upcoming SlideShare
Loading in...5
×
 

Introduction to Mahout with HDInsight

on

  • 4,144 views

In this session, we will introduce a Mahout, a machine learning library that has multiple algorithms implemented on top of Hadoop and HDInsight. We will start by introducing the foundational concepts ...

In this session, we will introduce a Mahout, a machine learning library that has multiple algorithms implemented on top of Hadoop and HDInsight. We will start by introducing the foundational concepts needed to understand clustering, classification and collaborative filtering before demonstrating what it takes to get started with Mahout. In addition to learning how you get Mahout set-up, you will learn what it takes to process and prepare data, how to execute an “embarrassing parallel” batch recommendation job and subsequently how to integrate the result back into your existing ecosystem.

Statistics

Views

Total Views
4,144
Views on SlideShare
4,144
Embed Views
0

Actions

Likes
1
Downloads
31
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Data Shift – changes in data collection or UI can create artificial shifting

Introduction to Mahout with HDInsight Introduction to Mahout with HDInsight Presentation Transcript

  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Introduction to Mahout with HDInsight (Hadoop) Chris Price Senior BI Consultant @BluewaterSQL
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Intro Chris Price Senior BI Consultant with Pragmatic Works Author Regular Speaker Data Geek & Super Dad! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Survey  Whose currently using Machine Learning?  Google  Facebook  LinkedIn  Twitter  Amazon  Wal-Mart
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Outline  Mahout Introduction  The Algorithms  Hands On:  A recommendation engine
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Riding the Elephant  Born out of the Apache Lucene project  Top-level Apache project  A scalable machine learning library  Fast, Efficient & Pragmatic  Many of the algorithms can be run on Hadoop
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Algorithms  Collaborative Filtering  Item/User Recommenders  Clustering  Grouping movies by type  Classification  Categorizing documents  Frequent Itemset  Market basket analysis
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Collaborative Filtering  Find subset of users who have similar taste/preferences to target user and use this subset for recommendations  Types:  User-Based  Item-Based  Examples:  Amazon
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering  Group similar objects  Examples:  News Aggregator  Customer Grouping
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering  Algorithms:  K-Means  Fuzzy K-Means  Mean Shift  Canopy  Dirichlet  Similarity Distance:  Euclidean  Squared Euclidean  Cosine  Tanimoto  Manhattan ** Also weighted measures
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Clustering
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Classification  Using a pre-determined set of groups:  Predict the type of a new object based on its features  Classifiable Data  Continuous – Quantitative Value (i.e. Stock Price)  Categorical – Small known set (i.e. Colors)  Word-Like – Large unknown set  Text-Like – Many word-like that are unordered  Examples:  Spam Identification  Photo Facial Recognition
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Frequent Itemset  Examples:  Product Placement  Market Basket Analysis  Query Recommendations
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Mahout on HDInsight  Installation  Download  http://www.apache.org/dyn/closer.cgi/mahout/  Unpack  Add to Path (Environment Variable)
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Recommendation Engine  Define the Business Objective  Metrics  Context  Identify Data  Sources  Normalization  Data Shift  Which Algorithm?  Integration?
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Business Objective Navigational Inefficiency Cross-Sell Up-Sell Increase # of Orders Increase Items per Order Increase Average Item Price Website Increase Revenue
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Handling Context ??? January 20 degrees & Snowing…..
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Data Acquisition  Sources of Data for Recommendation  Implicit  Ratings  Feedback  Demographics  Pyschographics (Personality/Lifestyle/Attitude),  Ephemeral Need (Need for a moment)  Explicit  Purchase History  Click/Browse History  Product/Item  Taxonomy  Attributes  Descriptions
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Data Preparation  Preparation  Remove Outliers (Z-Score)  Remove frequent buyers (Skew)  Normalize Data (Unity-Based)  Beware of Data Shift
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Algorithms  Collaborative Filtering (Mahout)  User-Based  Item-Based  Content-Based (Mahout Clustering)  Data Mining (SSAS)  Association  Clustering
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com CF Recommendations  Neighborhood Formation  Similarity Metrics  Pearson Correlation  Euclidean Distance  Spearman Correlation  Cosine  Tanimoto Coefficient  Log-Likelihood
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com CF Pseudo-Code for each item i that u has no preference for each user v that has a preference for i compute similarity s between u and v calculate running average of v‘s preference for i, weighted by s return top ranked (weighted average) i Restrict to Neighborhood
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Testing  Smell Test  Built-In (Requires Java Coding)  Root Mean Squared Error (RMSE)  Average Absolute Difference RandomUtils.useTestSeed() Evaluator.evaluate(builder,null,0.7,1.0)
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Recommendation Engine Steps  1 – Generate List of ItemIDs  2 – Create Preference Vector  3 – Count Unique Users  4 – Transpose Preference Vectors  5 – Row Similarity  Compute Weights  Computer Similarities  Similarity Matrix  6 – Pre-Partial Multiply, Similarity Matrix  7 – Pre-Partial Multiply, Preferences  8 – Partial Multiple (Steps 6 & 7)  9 – Filter Items  10 – Aggregate & Recommend
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Batch Integration  ETL Data to HDFS  SSIS  Map/Reduce  Process with Mahout  ETL Results  Map/Reduce  Hive/Sqoop
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Hands-On Demo
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.com Resources  Mahout in Action Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman  Hadoop: The Definitive Guide  Tom White
  • MAKING BUSINESS INTELLIGENT www.pragmaticworks.comMAKING BUSINESS INTELLIGENT www.pragmaticworks.com Thank you! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com