How to do Predictive
Analytics with
Limited Data
Ulrich Rueckert

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Agenda
Introduction and Motivation
Semi-Supervised Learning

• Generative Models
• Large Margin Approaches
• Similarity Based Methods
Conclusion

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•

Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers

What happens if only few
customers rate a book?

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Test Data
Age

Attributes

Income
60K
80K
95K
52K
45K
75K
51K

+
+
+

52

47K

-

47
25

38K
22K

-

33

47K

?

+

20
43
26

41K

-

35

?

LikesBook
+

65
60

67K

39

Age
24

LikesBook

22

Target
Label

Income

+

Training Data

Model

Age
22

Income
67K

LikesBook
+

39

41K

-

Prediction
Predictive Modeling
Test Data

Traditional Supervised
Learning
•
•
•
•
•

Age

Test set: new customers

What happens if only few
customers rate a book?

41K

?

95K
52K

?
?

20

45K

?

43

75K

?

26
52

51K
47K

?
?

47

38K

?

25

22K

?

33

Training set: observations on previous
customers

?

60
35

Will a new customer like this book?

67K

39

Customers can rate books.

LikesBook

22

Promotion on bookseller’s web page

Income

47K

?

Target
Label

Attributes
Age

Income

LikesBook

24

60K

+

65

80K

-

Model

Training Data

Prediction

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Predictive Modeling
Traditional Supervised
Learning
•
•
•
•
•

Promotion on bookseller’s web page
Customers can rate books.
Will a new customer like this book?
Training set: observations on previous
customers
Test set: new customers

What happens if only few
customers rate a book?

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Test Data
Age

Attributes

Income
60K
80K
95K
52K
45K
75K
51K

?
+
?

52

47K

-

47
25

38K
22K

?
?

33

47K

?

?

20
43
26

41K

?

35

?

LikesBook
+

65
60

67K

39

Age
24

LikesBook

22

Target
Label

Income

?

Training Data

Model

Age
22

Income
67K

LikesBook
+

39

41K

-

Prediction
Supervised Learning
Classication
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classication
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classication
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classication
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Supervised Learning
Classication
•

For now: each data instance is a
point in a 2D coordinate system

•
•

Color denotes target label
Model is given as decision boundary

What’s the correct model?
•
•

In theory: no way to tell

•

All learning systems have underlying
assumptions

Smoothness assumption: similar
instances have similar labels

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•

In theory: no
... but we can make assumptions

Popular Assumptions
•
•
•

Clustering assumption
Low density assumption
Manifold assumption

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised Learning
Can we make use of the
unlabeled data?
•
•

In theory: no
... but we can make assumptions

Popular Assumptions
•
•
•

Clustering assumption
Low density assumption
Manifold assumption

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classication targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classication targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classication targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Clustering Assumption
Clustering
•

Partition instances into groups (clusters)
of similar instances

•

Many different algorithms: k-Means, EM,
DBSCAN, etc.

•

Available e.g. on Mahout

Clustering Assumption
•

The two classication targets are distinct
clusters

•

Simple semi-supervised learning: cluster,
then perform majority vote

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Generative Models
Mixture of Gaussians
•

Assumption: the data in each cluster is
generated by a normal distribution

•

Find most probable location and shape
of clusters given data

Expectation-Maximization
•
•

Two step optimization procedure

•
•

Each step is one MapReduce job

Keeps estimates of cluster assignment
probabilities for each instance

Might converge to local optimum

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Beyond Mixtures of Gaussians
Expectation-Maximization
•
•

Can be adjusted to all kinds of mixture models
E.g. use Naive Bayes as mixture model for text classication

Self-Training
•
•
•
•

Learn model on labeled instances only
Apply model to unlabeled instances
Learn new model on all instances
Repeat until convergence

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specic form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specic form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Assumption
•

The area between the two classes has
low density

•

Does not assume any specic form of
cluster

Support Vector Machine
•
•
•

Decision boundary is linear
Maximizes margin to closest instances
Can be learned in one Map-Reduce step
(Stochastic Gradient Descent)

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•

Minimize distance to labeled and
unlabeled instances

•

Parameter to fine-tune influence of
unlabeled instances

•

Additional constraint: keep class balance
correct

Implementation
•
•

Simple extension of SVM
But non-convex optimization problem

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Low Density Assumption
Semi-Supervised Support
Vector Machine
•
•
•

Minimize distance to labeled and
unlabeled instances
Parameter to fine-tune influence of
unlabeled instances
Additional constraint: keep class balance
correct

Implementation
•
•

Margins of
labeled
instances

Regularizer

2

min kwk
w

+C
+C

n
X

`(yi (wT xi + b))

i=1
n+m
X
?

`(|wT xi + b|)

i=n+1

Simple extension of SVM
But non-convex optimization problem

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Margins of
unlabeled
instances
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassied or unlabeled instance
moves classier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classier for
unlabeled or misclassied instances

•

Many random runs to nd best one

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassied or unlabeled instance
moves classier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classier for
unlabeled or misclassied instances

•

Many random runs to nd best one

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassied or unlabeled instance
moves classier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classier for
unlabeled or misclassied instances

•

Many random runs to nd best one

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassied or unlabeled instance
moves classier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classier for
unlabeled or misclassied instances

•

Many random runs to nd best one

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Semi-Supervised SVM
Stochastic Gradient Descent
•
•

One run over the data in random order

•

Steps get smaller over time

Each misclassied or unlabeled instance
moves classier a bit

Implementation on Hadoop
•

Mapper: send data to reducer in random
order

•

Reducer: update linear classier for
unlabeled or misclassied instances

•

Many random runs to nd best one

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
The Manifold Assumption
The Assumption
•

Training data is (roughly) contained in a
low dimensional manifold

•

One can perform learning in a more
meaningful low-dimensional space

•

Avoids curse of dimensionality

Similarity Graphs
•

Idea: compute similarity scores between
instances

•

Create network where the nearest
neighbors are connected

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation
Main Idea
•

Propagate label information to
neighboring instances

•
•

Then repeat until convergence
Similar to PageRank

Theory
•

Known to converge under weak
conditions

•

Equivalent to matrix inversion

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: lter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

2

1

3

Smarter Approaches
•

Data

1

4

1

1

1

1
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: lter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

1 2

1

1

2

2 1

2 2

2

2

3

Smarter Approaches
•

Data

1

2

4

1

2
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: lter out the overall nearest
neighbors

Reducers

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

1

1 1

1 2

1 3

1 4

2

2 1

2 2

2 3

2 4

3

Smarter Approaches
•

Data

3 1

3 2

3 3

3 4

4

4 1

4 2

4 3

4 4
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: lter out the overall nearest
neighbors

Smarter Approaches
•

Data
1
2
3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

4

Nearest Neighbors
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: lter out the overall nearest
neighbors

Smarter Approaches
•

Data
1
2
3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

4

Nearest Neighbors

Result
Nearest Neighbor Join
Block Nested Loop Join
•
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks
2nd MR job: lter out the overall nearest
neighbors

Smarter Approaches
•

Data
1

1 1

1 2

2

2 1

2 2

2 3

3 2

3 3

3 4

4 3

4 4

3

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13

Reducers

4
Nearest Neighbor Join
Block Nested Loop Join
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks

•

2nd MR job: lter out the overall nearest
neighbors

Smarter Approaches
•

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Nearest Neighbor Join
Block Nested Loop Join
•

1st MR job: partition data into blocks,
compute nearest neighbors between
blocks

•

2nd MR job: lter out the overall nearest
neighbors

Smarter Approaches
•

Use spatial information to avoid
unnecessary comparisons

•

R trees, space lling curves, locality
sensitive hashing

•

Example implementations available online

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Label Propagation on Hadoop
Iterative Procedure
•
•

Mapper: Emit neighbor-label pairs

•

Repeat until convergence

Reducer: Collect incoming labels per
node and combine them into new label

Improvement
•

Cluster data and perform within-cluster
propagation locally

•

Sometimes it’s more efficient to perform
matrix inversion instead

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
Conclusion
Semi-Supervised Learning
•

Only few training instances have labels

•

Unlabeled instances can still provide valuable signal

Different assumptions lead to different approaches
•
•
•

Cluster assumption: generative models
Low density assumption: semi-supervised support vector machines
Manifold assumption: label propagation

Code available: https://github.com/Datameer-Inc/lesel

Š 2013 Datameer, Inc. All rights reserved.

Thursday, October 31, 13
@Datameer

Thursday, October 31, 13

How to do Predictive Analytics with Limited Data

  • 1.
    How to doPredictive Analytics with Limited Data Ulrich Rueckert Š 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 2.
    Agenda Introduction and Motivation Semi-SupervisedLearning • Generative Models • Large Margin Approaches • Similarity Based Methods Conclusion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 3.
    Predictive Modeling Traditional Supervised Learning • • • • • Promotionon bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K + + + 52 47K - 47 25 38K 22K - 33 47K ? + 20 43 26 41K - 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income + Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • 4.
    Predictive Modeling Test Data TraditionalSupervised Learning • • • • • Age Test set: new customers What happens if only few customers rate a book? 41K ? 95K 52K ? ? 20 45K ? 43 75K ? 26 52 51K 47K ? ? 47 38K ? 25 22K ? 33 Training set: observations on previous customers ? 60 35 Will a new customer like this book? 67K 39 Customers can rate books. LikesBook 22 Promotion on bookseller’s web page Income 47K ? Target Label Attributes Age Income LikesBook 24 60K + 65 80K - Model Training Data Prediction © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 5.
    Predictive Modeling Traditional Supervised Learning • • • • • Promotionon bookseller’s web page Customers can rate books. Will a new customer like this book? Training set: observations on previous customers Test set: new customers What happens if only few customers rate a book? © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Test Data Age Attributes Income 60K 80K 95K 52K 45K 75K 51K ? + ? 52 47K - 47 25 38K 22K ? ? 33 47K ? ? 20 43 26 41K ? 35 ? LikesBook + 65 60 67K 39 Age 24 LikesBook 22 Target Label Income ? Training Data Model Age 22 Income 67K LikesBook + 39 41K - Prediction
  • 6.
    Supervised Learning Classification • For now:each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 7.
    Supervised Learning Classification • For now:each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 8.
    Supervised Learning Classification • For now:each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 9.
    Supervised Learning Classification • For now:each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 10.
    Supervised Learning Classification • For now:each data instance is a point in a 2D coordinate system • • Color denotes target label Model is given as decision boundary What’s the correct model? • • In theory: no way to tell • All learning systems have underlying assumptions Smoothness assumption: similar instances have similar labels © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 11.
    Semi-Supervised Learning Can wemake use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 12.
    Semi-Supervised Learning Can wemake use of the unlabeled data? • • In theory: no ... but we can make assumptions Popular Assumptions • • • Clustering assumption Low density assumption Manifold assumption © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 13.
    The Clustering Assumption Clustering • Partitioninstances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 14.
    The Clustering Assumption Clustering • Partitioninstances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 15.
    The Clustering Assumption Clustering • Partitioninstances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 16.
    The Clustering Assumption Clustering • Partitioninstances into groups (clusters) of similar instances • Many different algorithms: k-Means, EM, DBSCAN, etc. • Available e.g. on Mahout Clustering Assumption • The two classification targets are distinct clusters • Simple semi-supervised learning: cluster, then perform majority vote © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 17.
    Generative Models Mixture ofGaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 18.
    Generative Models Mixture ofGaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 19.
    Generative Models Mixture ofGaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 20.
    Generative Models Mixture ofGaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 21.
    Generative Models Mixture ofGaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 22.
    Generative Models Mixture ofGaussians • Assumption: the data in each cluster is generated by a normal distribution • Find most probable location and shape of clusters given data Expectation-Maximization • • Two step optimization procedure • • Each step is one MapReduce job Keeps estimates of cluster assignment probabilities for each instance Might converge to local optimum © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 23.
    Beyond Mixtures ofGaussians Expectation-Maximization • • Can be adjusted to all kinds of mixture models E.g. use Naive Bayes as mixture model for text classification Self-Training • • • • Learn model on labeled instances only Apply model to unlabeled instances Learn new model on all instances Repeat until convergence © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 24.
    The Low DensityAssumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 25.
    The Low DensityAssumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 26.
    The Low DensityAssumption Assumption • The area between the two classes has low density • Does not assume any specific form of cluster Support Vector Machine • • • Decision boundary is linear Maximizes margin to closest instances Can be learned in one Map-Reduce step (Stochastic Gradient Descent) © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 27.
    The Low DensityAssumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 28.
    The Low DensityAssumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 29.
    The Low DensityAssumption Semi-Supervised Support Vector Machine • Minimize distance to labeled and unlabeled instances • Parameter to fine-tune influence of unlabeled instances • Additional constraint: keep class balance correct Implementation • • Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 30.
    The Low DensityAssumption Semi-Supervised Support Vector Machine • • • Minimize distance to labeled and unlabeled instances Parameter to fine-tune influence of unlabeled instances Additional constraint: keep class balance correct Implementation • • Margins of labeled instances Regularizer 2 min kwk w +C +C n X `(yi (wT xi + b)) i=1 n+m X ? `(|wT xi + b|) i=n+1 Simple extension of SVM But non-convex optimization problem © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Margins of unlabeled instances
  • 31.
    Semi-Supervised SVM Stochastic GradientDescent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 32.
    Semi-Supervised SVM Stochastic GradientDescent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 33.
    Semi-Supervised SVM Stochastic GradientDescent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 34.
    Semi-Supervised SVM Stochastic GradientDescent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 35.
    Semi-Supervised SVM Stochastic GradientDescent • • One run over the data in random order • Steps get smaller over time Each misclassified or unlabeled instance moves classifier a bit Implementation on Hadoop • Mapper: send data to reducer in random order • Reducer: update linear classifier for unlabeled or misclassified instances • Many random runs to find best one © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 36.
    The Manifold Assumption TheAssumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 37.
    The Manifold Assumption TheAssumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 38.
    The Manifold Assumption TheAssumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 39.
    The Manifold Assumption TheAssumption • Training data is (roughly) contained in a low dimensional manifold • One can perform learning in a more meaningful low-dimensional space • Avoids curse of dimensionality Similarity Graphs • Idea: compute similarity scores between instances • Create network where the nearest neighbors are connected © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 40.
    Label Propagation Main Idea • Propagatelabel information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 41.
    Label Propagation Main Idea • Propagatelabel information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 42.
    Label Propagation Main Idea • Propagatelabel information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 43.
    Label Propagation Main Idea • Propagatelabel information to neighboring instances • • Then repeat until convergence Similar to PageRank Theory • Known to converge under weak conditions • Equivalent to matrix inversion © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 44.
    Nearest Neighbor Join BlockNested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 2 1 3 Smarter Approaches • Data 1 4 1 1 1 1
  • 45.
    Nearest Neighbor Join BlockNested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 1 2 2 1 2 2 2 2 3 Smarter Approaches • Data 1 2 4 1 2
  • 46.
    Nearest Neighbor Join BlockNested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Reducers Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 1 1 1 1 2 1 3 1 4 2 2 1 2 2 2 3 2 4 3 Smarter Approaches • Data 3 1 3 2 3 3 3 4 4 4 1 4 2 4 3 4 4
  • 47.
    Nearest Neighbor Join BlockNested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors
  • 48.
    Nearest Neighbor Join BlockNested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 2 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 4 Nearest Neighbors Result
  • 49.
    Nearest Neighbor Join BlockNested Loop Join • • 1st MR job: partition data into blocks, compute nearest neighbors between blocks 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Data 1 1 1 1 2 2 2 1 2 2 2 3 3 2 3 3 3 4 4 3 4 4 3 Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13 Reducers 4
  • 50.
    Nearest Neighbor Join BlockNested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 51.
    Nearest Neighbor Join BlockNested Loop Join • 1st MR job: partition data into blocks, compute nearest neighbors between blocks • 2nd MR job: filter out the overall nearest neighbors Smarter Approaches • Use spatial information to avoid unnecessary comparisons • R trees, space filling curves, locality sensitive hashing • Example implementations available online © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 52.
    Label Propagation onHadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 53.
    Label Propagation onHadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 54.
    Label Propagation onHadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 55.
    Label Propagation onHadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 56.
    Label Propagation onHadoop Iterative Procedure • • Mapper: Emit neighbor-label pairs • Repeat until convergence Reducer: Collect incoming labels per node and combine them into new label Improvement • Cluster data and perform within-cluster propagation locally • Sometimes it’s more efficient to perform matrix inversion instead © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 57.
    Conclusion Semi-Supervised Learning • Only fewtraining instances have labels • Unlabeled instances can still provide valuable signal Different assumptions lead to different approaches • • • Cluster assumption: generative models Low density assumption: semi-supervised support vector machines Manifold assumption: label propagation Code available: https://github.com/Datameer-Inc/lesel © 2013 Datameer, Inc. All rights reserved. Thursday, October 31, 13
  • 58.