• Save
How to do Predictive Analytics with Limited Data
Upcoming SlideShare
Loading in...5
×
 

How to do Predictive Analytics with Limited Data

on

  • 1,026 views

http://www.datameer.com It is frustrating: even with petabytes of data on a Hadoop cluster, one still encounters situations where there’s a lack of key data for a wide variety of big data analytic ...

http://www.datameer.com It is frustrating: even with petabytes of data on a Hadoop cluster, one still encounters situations where there’s a lack of key data for a wide variety of big data analytic use cases. You might have billions of clicks on your web site, but only a few users choose to rate a product. There might be millions of text documents on your cluster, but it is too expensive to have someone categorize more than a tiny fraction of them. In principle, this is where predictive modeling could help. For instance, one could learn a model to predict user ratings so you can better serve product recommendations based on those expected ratings. Or, one could create a model to automatically categorize text documents, saving countless hours and dollars. The main problem is that there is only a limited amount of training material (i.e. user ratings, categorized documents) and it is thus hard to generate good models.

As it turns out, recent research on machine learning techniques has found a way to deal effectively with such situations with a technique called semi-supervised learning. These techniques are often able to leverage the vast amount of related, but unlabeled data to generate accurate models. In this talk, we will give an overview of the most common techniques including co-training regularization. We first explain the principles and underlying assumptions of semi-supervised learning and then show how to implement such methods with Hadoop.

Statistics

Views

Total Views
1,026
Views on SlideShare
1,024
Embed Views
2

Actions

Likes
2
Downloads
0
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

How to do Predictive Analytics with Limited Data How to do Predictive Analytics with Limited Data Presentation Transcript

  • How to do Predictive Analytics with Limited Data Ulrich Rueckert © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Agenda Introduction and Motivation Semi-Supervised Learning • Generative Models • Large Margin Approaches • Similarity Based Methods Conclusion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K + + + 52 47K - 47 25 38K 22K - 33 47K ? + 20 43 26 41K - 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income + Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • Predictive Modeling Test Data Traditional Supervised Learning • • • • • Age Test set: new customers What happens if only few customers rate a book? 41K ? 95K 52K ? ? 20 45K ? 43 75K ? 26 52 51K 47K ? ? 47 38K ? 25 22K ? 33 Training set: observations on previous customers ? 60 35 Will a new customer like this book? 67K 39 Customers can rate books. LikesBook 22 Promotion on bookseller’s web page Income 47K ? Target Label Attributes Age Income LikesBook 24 60K + 65 80K - Model Training Data Prediction © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K ? + ? 52 47K - 47 25 38K 22K ? ? 33 47K ? ? 20 43 26 41K ? 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income ? Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Beyond Mixtures of Gaussians Expectation-Maximization • • Can be adjusted to all kinds of mixture models E.g. use Naive Bayes as mixture model for text classification Self-Training • • • • Learn model on labeled instances only Apply model to unlabeled instances Learn new model on all instances Repeat until convergence © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Low Density Assumption Semi-Supervised Support Vector Machine • • • Minimize distance to labeled and unlabeled instances Parameter to fine-tune influence of unlabeled instances Additional constraint: keep class balance correct Implementation • • Margins of labeled instances Regularizer 2 min kwk w +C +C n X `(yi (wT xi + b)) i=1 n+m X ? `(|wT xi + b|) i=n+1 Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Margins of unlabeled instances
  • Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 2 1 3 Smarter Approaches • Data 1 4 1 1 1 1
  • Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 1 2 2 1 2 2 2 2 3 Smarter Approaches • Data 1 2 4 1 2
  • Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 3 1 4 2 2 1 2 2 2 3 2 4 3 Smarter Approaches • Data 3 1 3 2 3 3 3 4 4 4 1 4 2 4 3 4 4
  • Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors
  • Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors Result
  • Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 1 1 1 2 2 2 1 2 2 2 3 3 2 3 3 3 4 4 3 4 4 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Reducers 4
  • Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • Conclusion Semi-Supervised Learning • Only few training instances have labels • Unlabeled instances can still provide valuable signal Different assumptions lead to different approaches • • • Cluster assumption: generative models Low density assumption: semi-supervised support vector machines Manifold assumption: label propagation Code available: https://github.com/Datameer-Inc/lesel © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • @Datameer Thursday, October 31, 13