• Like
  • Save
Running with Elephants: Predictive Analytics with HDInsight
Upcoming SlideShare
Loading in...5
×
 

Running with Elephants: Predictive Analytics with HDInsight

on

  • 1,228 views

Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of ...

Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of the demo as your clickstream data at our fictional e-commerce website drives user and product recommendations using the built-in Mahout (Taste) algorithms. In this action pack session, real-world and practical solutions for moving data into and out of HDFS (with Sqoop), using Mongo or HBase as a source/destination and of course handling Mahout processing in distributive mode will all be covered.

Statistics

Views

Total Views
1,228
Views on SlideShare
1,228
Embed Views
0

Actions

Likes
1
Downloads
37
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Running with Elephants: Predictive Analytics with HDInsight Running with Elephants: Predictive Analytics with HDInsight Presentation Transcript

    • Running with Elephants Predictive Analytics with Mahout & HDInsight
    • Introduction Chris Price Senior BI Consultant with Pragmatic Works Author Regular Speaker Data Geek & Super Dad! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com
    • You are the demo…. SQL Brewhaus http://sqlbrewhaus.azurewebsites.net Create an Account… Rate some beers… Don’t worry your info will only be sold to the HIGHEST bidder
    • Agenda • Business Case for Recommendations • How a Recommendation Engine Works • Recommendation Implementation & Integration • Evaluating Recommendations • Challenges of Implementing Recommendations
    • Making the Business Case Objective Increase Revenue Increase # of Orders Increase Items per Order Increase Average Item Price Up-Sell Website Navigational Inefficiency Cross-Sell
    • Business Case Example Increased Revenue
    • Recommendation Engines • Take observation data and use data mining/machine learning algorithms to predict outcomes • Assumptions: • People with similar interest have common preferences • Sufficiently large number of preferences available
    • Recommendation Options • Collaborative Filtering (Mahout) • User-Based • Item-Based • Content-Based (Mahout Clustering) • Data Mining (SSAS) • Association • Clustering
    • Technology • A scalable machine learning library • Fast, Efficient & Pragmatic • Many of the algorithms can be run on Hadoop HDInsight • Hadoop on Windows • HDInsight on Windows Azure (Seamlessly scale in the cloud) • HortonWorks Data Platform/HDP (On-Premise Solution)
    • Generating Recommendations 1. Sources of Data 2. Clean & Prepare Data 3. Generate Recommendations • Build User/Item matrix • Calculate User Similarity • Form Neighborhoods • Generate Recommendations
    • Sources of Data • Implicit • Ratings • Feedback • Demographics • Psychographics (Personality/Lifestyle/Attitude), • Ephemeral Need (Need for a moment) • Explicit • Purchase History • Click/Browse History • Product/Item • Taxonomy • Attributes • Descriptions Our focus for today
    • Data Preparation • Clean-Up: • Remove Outliers (Z-Score) • Remove frequent buyers (Skew) • Normalize Data (Unity-Based) • Format Data into CSV input file: <User ID>, <Item ID>, <Rating>
    • How it Works? • Build a User/Item Matrix Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 … 1 1 N
    • Neighborhood Formation U2 U1 U5 U3 U6 U7 U4
    • Neighborhood Formation • Requires some experimentation • Similarity Metrics • Pearson Correlation • Euclidean Distance • Spearman Correlation • Cosine • Tanimoto Coefficient • Log-Likelihood
    • How it Works? • Find users similar to U5 • Use a similarity metric (kNN) • U1 & U7 are identified as most similar to U5 Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 … 1 1 N
    • How it Works? • Generate Recommendations: • Find items that have not been reviewed (I1 and I6) • Predict rating by taking weighted sum Items Users 1 2 3 4 5 6 7 8 9 10 … n 1 1 1 1 0.5 1 1 2 1 1 1 3 1 1 1 1 1 4 1 1 1 5 1 1 6 0.7 1
    • Pseudo-Code Implementation for each item i that u has no preference for each user v that has a preference for i compute similarity s between u and v calculate running average of v‘s preference for i, weighted by s return top ranked (weighted average) i Restrict to Neighborhood
    • Mahout Implementation • Real-Time Recommendations • Write Java Code and host in JVM Instance • Limited scalability • Requires Training Data • Integration typically handled through web services • Batch-Based Recommendations • Uses MapReduce jobs on Hadoop • Offline, Slow, yet scalable • Out-of-the-box recommender jobs
    • Mahout MapReduce Implementation 1 – Generate List of ItemIDs 2 – Create Preference Vector 3 – Count Unique Users 4 – Transpose Preference Vectors 5 – Row Similarity • Compute Weights • Computer Similarities • Similarity Matrix 6 – Pre-Partial Multiply, Similarity Matrix 7 – Pre-Partial Multiply, Preferences 8 – Partial Multiple (Steps 6 & 7) 9 – Filter Items 10 – Aggregate & Recommend
    • Integrating Mahout • Real-Time • Requires Java coding • Web Service • Process: • Load training data (memory pressure) • Generate recommendations • Batch • ETL from source • Generate input file (UserID, ItemID, Rating) • Load to HDFS • Process with Mahout/Hadoop • ETL output from HDFS/Hadoop • 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5] • UserID [ItemID:Estimate Rating, ………]
    • Handling Recommendations Storing Recommendations: • Hive • Data Warehouse system for Hadoop • Hive ODBC Driver • MongoDB • Leading NOSQL database • JSON-like storage with flexible schema • C#/.Net MongoDB Driver • HBase • Open-source distributed, column-oriented database modeled after Google’s BigTable • Use Pig/MapReduce to process output files and load HBase table • Java API for easy reading • Source System (SQL Server, etc)
    • Evaluating the Recommendations • How good are your recommendations? • How do you evaluate the recommendation engine? • Two options both split data into test & training data sets: • Average Difference • Root-Mean Square • How it works? I1 I2 I3 Estimated Review 3.5 4.0 1.5 Actual Review 4.0 2.0 2.0 Absolute Difference 0.5 2.0 0.5 Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0 Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23
    • Evaluating the Recommendations DataModel model = new FileDataModel(new File(“ratings.csv”)); RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator(); RecommenderBuilder bldr = new RecommenderBuilder(){ @Override public Recommender buildRecommender(DataModel model) throws TasteException{ //Use the Pearson Correlation to calculate similarity UserSimilarity similarity = new PearsonCorrelationSimilarity(model); //Generate neighborhoods of approx. 10 users UserNeighborhood hood = new NearestUserNeighborhood(10, similarity, model); return new GenericUserBasedRecommender(model, hood, similarity); } }; //Use 70% of the data to train the model and 30% to test double score = eval.evaluate(bldr, model, 0.7, 1.0);
    • Challenges 1. Context 2. Cold Start 3. Data Scarsity 4. Popularity Bias 5. Curse of Dimensionality
    • Context Challenges ??? January 20 degrees & Snowing…..
    • Other Challenges • Cold Start • Occurs when either a new item or new user is introduced • Can be handled by: • Can substitute average item/user profile • Use another recommendation generation technique (Content-Based) • Data Sparsity • Too many items/user make finding intersections difficult • Popularity Bias • Skewed towards popular items, people with “unique” taste are left out • Curse of Dimensionality • More items/user leads to more noise and greater error
    • Resources Mahout in Action Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman Hadoop: The Definitive Guide Tom White
    • Thank You! @BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com QUESTIONS???