SlideShare a Scribd company logo
How to do Predictive
Analytics with
Limited Data
Ulrich Rueckert

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Agenda
Introduction and Motivation
Semi-Supervised Learning

• Generative Models
• Large Margin Approaches
• Similarity Based Methods
Conclusion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•

Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers

What happens if only few
customers rate a book?

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Test Data
Age

Attributes

Income
60K
80K
95K
52K
45K
75K
51K

+
+
+

52

47K

-

47
25

38K
22K

-

33

47K

?

+

20
43
26

41K

-

35

?

LikesBook
+

65
60

67K

39

Age
24

LikesBook

22

Target
Label

Income

+

Training Data

Model

Age
22

Income
67K

LikesBook
+

39

41K

-

Prediction
Predictive Modeling
Test Data

Traditional Supervised
Learning
•
•
•
•
•

Age

Test set: new customers

What happens if only few
customers rate a book?

41K

?

95K
52K

?
?

20

45K

?

43

75K

?

26
52

51K
47K

?
?

47

38K

?

25

22K

?

33

Training set: observations on previous
customers

?

60
35

Will a new customer like this book?

67K

39

Customers can rate books.

LikesBook

22

Promotion on bookseller’s web page

Income

47K

?

Target
Label

Attributes
Age

Income

LikesBook

24

60K

+

65

80K

-

Model

Training Data

Prediction

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•

Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers

What happens if only few
customers rate a book?

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Test Data
Age

Attributes

Income
60K
80K
95K
52K
45K
75K
51K

?
+
?

52

47K

-

47
25

38K
22K

?
?

33

47K

?

?

20
43
26

41K

?

35

?

LikesBook
+

65
60

67K

39

Age
24

LikesBook

22

Target
Label

Income

?

Training Data

Model

Age
22

Income
67K

LikesBook
+

39

41K

-

Prediction
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classification
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•

In theory: no
... but we can make assumptions

Popular Assumptions
•
•
•

Clustering assumption
Low density assumption
Manifold assumption

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•

In theory: no
... but we can make assumptions

Popular Assumptions
•
•
•

Clustering assumption
Low density assumption
Manifold assumption

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classification targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Beyond Mixtures of Gaussians
Expectation-Maximization
•
•

Can be adjusted to all kinds of mixture models
E.g. use Naive Bayes as mixture model for text classification

Self-Training
•
•
•
•

Learn model on labeled instances only
Apply model to unlabeled instances
Learn new model on all instances
Repeat until convergence

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specific form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specific form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specific form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•
•
•

Minimize distance to labeled and
unlabeled instances
Parameter to fine-tune influence of
unlabeled instances
Additional constraint: keep class balance
correct

Implementation
•
•

Margins of
labeled
instances

Regularizer

2

min kwk
w

+C
+C

n
X

`(yi (wT xi + b))

i=1
n+m
X
?

`(|wT xi + b|)

i=n+1

Simple extension of SVM
But non-convex optimization problem

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Margins of
unlabeled
instances
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassified or unlabeled instance
moves classifier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classifier for
unlabeled or misclassified instances

•

Many random runs to find best one

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

2

1

3

Smarter Approaches
•

Data

1

4

1

1

1

1
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

1 2

1

1

2

2 1

2 2

2

2

3

Smarter Approaches
•

Data

1

2

4

1

2
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

1 2

1 3

1 4

2

2 1

2 2

2 3

2 4

3

Smarter Approaches
•

Data

3 1

3 2

3 3

3 4

4

4 1

4 2

4 3

4 4
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Data
1
2
3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

4

Nearest Neighbors
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Data
1
2
3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

4

Nearest Neighbors

Result
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Data
1

1 1

1 2

2

2 1

2 2

2 3

3 2

3 3

3 4

4 3

4 4

3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Reducers

4
Nearest Neighbor Join
Block Nested Loop Join
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks

•

2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Nearest Neighbor Join
Block Nested Loop Join
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks

•

2nd MR job: filter out the overall nearest
neighbors

Smarter Approaches
•

Use spatial information to avoid
unnecessary comparisons

•

R trees, space filling curves, locality
sensitive hashing

•

Example implementations available online

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Conclusion
Semi-Supervised Learning
•

Only few training instances have labels

•

Unlabeled instances can still provide valuable signal

Different assumptions lead to different approaches
•
•
•

Cluster assumption: generative models
Low density assumption: semi-supervised support vector machines
Manifold assumption: label propagation

Code available: https://github.com/Datameer-Inc/lesel

© 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
@Datameer

Thursday, October 31, 13

More Related Content

How to do Predictive Analytics with Limited Data

  • 1. How to do Predictive Analytics with Limited Data Ulrich Rueckert © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 2. Agenda Introduction and Motivation Semi-Supervised Learning • Generative Models • Large Margin Approaches • Similarity Based Methods Conclusion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 3. Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K + + + 52 47K - 47 25 38K 22K - 33 47K ? + 20 43 26 41K - 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income + Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • 4. Predictive Modeling Test Data Traditional Supervised Learning • • • • • Age Test set: new customers What happens if only few customers rate a book? 41K ? 95K 52K ? ? 20 45K ? 43 75K ? 26 52 51K 47K ? ? 47 38K ? 25 22K ? 33 Training set: observations on previous customers ? 60 35 Will a new customer like this book? 67K 39 Customers can rate books. LikesBook 22 Promotion on bookseller’s web page Income 47K ? Target Label Attributes Age Income LikesBook 24 60K + 65 80K - Model Training Data Prediction © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 5. Predictive Modeling Traditional Supervised Learning • • • • • Promotion on bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K ? + ? 52 47K - 47 25 38K 22K ? ? 33 47K ? ? 20 43 26 41K ? 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income ? Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • 6. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 7. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 8. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 9. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 10. Supervised Learning Classification • For now: each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 11. Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 12. Semi-Supervised Learning Can we make use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 13. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 14. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 15. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 16. The Clustering Assumption Clustering • Partition instances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 17. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 18. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 19. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 20. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 21. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 22. Generative Models Mixture of Gaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 23. Beyond Mixtures of Gaussians Expectation-Maximization • • Can be adjusted to all kinds of mixture models E.g. use Naive Bayes as mixture model for text classification Self-Training • • • • Learn model on labeled instances only Apply model to unlabeled instances Learn new model on all instances Repeat until convergence © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 24. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 25. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 26. The Low Density Assumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 27. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 28. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 29. The Low Density Assumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 30. The Low Density Assumption Semi-Supervised Support Vector Machine • • • Minimize distance to labeled and unlabeled instances Parameter to fine-tune influence of unlabeled instances Additional constraint: keep class balance correct Implementation • • Margins of labeled instances Regularizer 2 min kwk w +C +C n X `(yi (wT xi + b)) i=1 n+m X ? `(|wT xi + b|) i=n+1 Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Margins of unlabeled instances
  • 31. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 32. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 33. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 34. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 35. Semi-Supervised SVM Stochastic Gradient Descent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 36. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 37. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 38. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 39. The Manifold Assumption The Assumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 40. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 41. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 42. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 43. Label Propagation Main Idea • Propagate label information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 44. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 2 1 3 Smarter Approaches • Data 1 4 1 1 1 1
  • 45. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 1 2 2 1 2 2 2 2 3 Smarter Approaches • Data 1 2 4 1 2
  • 46. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 3 1 4 2 2 1 2 2 2 3 2 4 3 Smarter Approaches • Data 3 1 3 2 3 3 3 4 4 4 1 4 2 4 3 4 4
  • 47. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors
  • 48. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors Result
  • 49. Nearest Neighbor Join Block Nested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 1 1 1 2 2 2 1 2 2 2 3 3 2 3 3 3 4 4 3 4 4 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Reducers 4
  • 50. Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 51. Nearest Neighbor Join Block Nested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 52. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 53. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 54. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 55. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 56. Label Propagation on Hadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 57. Conclusion Semi-Supervised Learning • Only few training instances have labels • Unlabeled instances can still provide valuable signal Different assumptions lead to different approaches • • • Cluster assumption: generative models Low density assumption: semi-supervised support vector machines Manifold assumption: label propagation Code available: https://github.com/Datameer-Inc/lesel © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13