Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Cortana Analytics Workshop: Big Dat... by MSAdvAnalytics 239 views
- On the Influence Propagation of Web... by abidhavp 256 views
- Financial Data Analytics with Hadoop by Chicago Hadoop Us... 1377 views
- How to do Data Science Without the ... by Datameer 813 views
- Datameer Analytics Solution by templedf 2621 views
- Complement Your Existing Data Wareh... by Datameer 3544 views

1,286 views

1,149 views

1,149 views

Published on

As it turns out, recent research on machine learning techniques has found a way to deal effectively with such situations with a technique called semi-supervised learning. These techniques are often able to leverage the vast amount of related, but unlabeled data to generate accurate models. In this talk, we will give an overview of the most common techniques including co-training regularization. We first explain the principles and underlying assumptions of semi-supervised learning and then show how to implement such methods with Hadoop.

No Downloads

Total views

1,286

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

0

Comments

0

Likes

4

No embeds

No notes for slide

- 1. How to do Predictive Analytics with Limited Data Ulrich Rueckert © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 2. Agenda Introduction and Motivation Semi-Supervised Learning • Generative Models • Large Margin Approaches • Similarity Based Methods Conclusion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 3. Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K + + + 52 47K - 47 25 38K 22K - 33 47K ? + 20 43 26 41K - 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income + Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
- 4. Predictive Modeling Test Data Traditional Supervised Learning • • • • • Age Test set: new customers What happens if only few customers rate a book? 41K ? 95K 52K ? ? 20 45K ? 43 75K ? 26 52 51K 47K ? ? 47 38K ? 25 22K ? 33 Training set: observations on previous customers ? 60 35 Will a new customer like this book? 67K 39 Customers can rate books. LikesBook 22 Promotion on bookseller’s web page Income 47K ? Target Label Attributes Age Income LikesBook 24 60K + 65 80K - Model Training Data Prediction © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 5. Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K ? + ? 52 47K - 47 25 38K 22K ? ? 33 47K ? ? 20 43 26 41K ? 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income ? Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
- 6. Supervised Learning Classiﬁcation • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 7. Supervised Learning Classiﬁcation • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 8. Supervised Learning Classiﬁcation • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 9. Supervised Learning Classiﬁcation • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 10. Supervised Learning Classiﬁcation • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 11. Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 12. Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 13. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many diﬀerent algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classiﬁcation targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 14. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many diﬀerent algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classiﬁcation targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 15. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many diﬀerent algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classiﬁcation targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 16. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many diﬀerent algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classiﬁcation targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 17. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 18. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 19. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 20. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 21. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 22. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 23. Beyond Mixtures of Gaussians Expectation-Maximization • • Can be adjusted to all kinds of mixture models E.g. use Naive Bayes as mixture model for text classiﬁcation Self-Training • • • • Learn model on labeled instances only Apply model to unlabeled instances Learn new model on all instances Repeat until convergence © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 24. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any speciﬁc form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 25. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any speciﬁc form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 26. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any speciﬁc form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 27. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to ﬁne-tune inﬂuence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 28. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to ﬁne-tune inﬂuence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 29. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to ﬁne-tune inﬂuence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 30. The Low Density Assumption Semi-Supervised Support Vector Machine • • • Minimize distance to labeled and unlabeled instances Parameter to ﬁne-tune inﬂuence of unlabeled instances Additional constraint: keep class balance correct Implementation • • Margins of labeled instances Regularizer 2 min kwk w +C +C n X `(yi (wT xi + b)) i=1 n+m X ? `(|wT xi + b|) i=n+1 Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Margins of unlabeled instances
- 31. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassiﬁed or unlabeled instance moves classiﬁer a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classiﬁer for unlabeled or misclassiﬁed instances • Many random runs to ﬁnd best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 32. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassiﬁed or unlabeled instance moves classiﬁer a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classiﬁer for unlabeled or misclassiﬁed instances • Many random runs to ﬁnd best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 33. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassiﬁed or unlabeled instance moves classiﬁer a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classiﬁer for unlabeled or misclassiﬁed instances • Many random runs to ﬁnd best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 34. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassiﬁed or unlabeled instance moves classiﬁer a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classiﬁer for unlabeled or misclassiﬁed instances • Many random runs to ﬁnd best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 35. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassiﬁed or unlabeled instance moves classiﬁer a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classiﬁer for unlabeled or misclassiﬁed instances • Many random runs to ﬁnd best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 36. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 37. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 38. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 39. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 40. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 41. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 42. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 43. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 44. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: ﬁlter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 2 1 3 Smarter Approaches • Data 1 4 1 1 1 1
- 45. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: ﬁlter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 1 2 2 1 2 2 2 2 3 Smarter Approaches • Data 1 2 4 1 2
- 46. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: ﬁlter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 3 1 4 2 2 1 2 2 2 3 2 4 3 Smarter Approaches • Data 3 1 3 2 3 3 3 4 4 4 1 4 2 4 3 4 4
- 47. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: ﬁlter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors
- 48. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: ﬁlter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors Result
- 49. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: ﬁlter out the overall nearest neighbors Smarter Approaches • Data 1 1 1 1 2 2 2 1 2 2 2 3 3 2 3 3 3 4 4 3 4 4 3 Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Reducers 4
- 50. Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: ﬁlter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 51. Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: ﬁlter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space ﬁlling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 52. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more eﬃcient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 53. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more eﬃcient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 54. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more eﬃcient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 55. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more eﬃcient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 56. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more eﬃcient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 57. Conclusion Semi-Supervised Learning • Only few training instances have labels • Unlabeled instances can still provide valuable signal Diﬀerent assumptions lead to diﬀerent approaches • • • Cluster assumption: generative models Low density assumption: semi-supervised support vector machines Manifold assumption: label propagation Code available: https://github.com/Datameer-Inc/lesel © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
- 58. @Datameer Thursday, October 31, 13

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment