Streaming &Parallel 

Decision Tree in Flink
1 2 3 4
1 2 3 4 anwar.rizal @anrizal
1 2 3 4
Outlines
Motivation
Architecture
Decision Trees
Implementation
Conclusion
Motivation
Motivation
Architecture
Decision Trees
Implementation
Conclusion
Motivation
Need a classifier system on streaming
data
The data used for learning come
as a stream
So are the data to be classified
Motivation
$90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
Motivation
$90 $90 $120 $90 $90 $150 $200
$90 $75 $90 $90 $90 $90 $90
$120 $90 Sold out Sold out $75 $90 $90
$120 $90 $90 $90 $100 $90 $120
(predicted) to increase zero to two days
(predicted) to increase this week
(predicted) to increase next week
Motivation
FRA – NYC
FRA - LON
FRA - MEX
Motivation
FRA – NYC
FRA - LON
FRA - MEX
Need attention
revenue decrease
Need attention
passenger
decrease
Need attention
revenue decrease,
cost increase
Motivation
Need a classifier system on streaming
data
The data used for learning come
as a stream
So are the data to be classified
Motivation The classifier is kept fresh
No need for separate batch learning/evaluation
The feedback is taken into account in real time,
regularly
The classifier can be introspected
Transparent model structure
(e.g. know the tree, information gain for each
split point)
Known expected performance (accuracy, precision,
recall, AUC)
Seamless support for workflow of
machine learning
Data preprocessing: up/down sampling, imputations, …
Feature selections
Model evaluation, cross validation,
MUST
Motivation
The classifier is immediately available
The classifier can already predict during learning
When learning phase is terminated, it starts another
cycle of learning
The classifier has a meta-learning
capability
The classifier has several models different parameters
It is possible to learn about the learning capability of
the models
NICE TO HAVE
Motivation
Learning Learning &
Classifying
End of
learning
New cycle of
learning
Cycle of
Learning, Classifying during Learning,
End of Learning, Classifying, New
Learning
Motivation
Classifying Application
Stream Learner
Labeled
points
Classifier Predicted
points
Unlabeled
points
Motivation
Architecture
Decision Trees
Implementation
Conclusion
DecisionTrees
DecisionTrees
From origin to recent developments
“Understand data by asking a sequence of questions ”
Classification and Regression Trees (CART) by Breiman et al. in 1984
“Pool decision trees to improve generalization”
Random Forests by Breiman in 1999
“Let’s play: pose estimation for XBox’s Kinect”
Shotton et al. 2011
DecisionTrees
Streaming Decision Trees
“A classifier for streaming data with a bound”
Hoeffding Tree (VFDT), Dominguez & Hulthen 2000
“Use of Approximate Histograms for Decision Tree”
Streaming and Parallel Decision Tree, Ben Haim & Tom-Tov
2010
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Train a decision tree - get the intuition!
1 2 3 4 1 2
3 4
Busy
procrastinators
Tourists Foreseeing
businessmen
Tourists
Brad Pitt
Save money
for the company
Business
Leisure
Supervision
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Classifying - get the intuition !
Business
Leisure
1 2 3 4 1 2
3 4
+
confidence
measure
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Decision tree - node optimization
 
 
  
 
Information Gain
 
DecisionTrees
Streaming Decision Trees
The batch version of decision trees require
view of the full learning data set
In streaming
each point can only be seen once
the processing should be fast, can’t afford too much
access to disks
DecisionTrees
Streaming Decision Tree – get the intuition
!
Instead of using every point, the points are
compressed
The real position of each point is then
approximated
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
Streaming decision tree - get the intuition!
1 2 3 4 1 2
3 4
Busy
procrastinators
Tourists Foreseeing
businessmen
Tourists
Business
Leisure
Supervision
+ Count of
points nearby
DecisionTrees
Streaming Decision Tree – the Question
“How to find split points for a decision tree ?“
label 1 / feature 1
Count
0
2
4
6
8
Feature 1
2 5 7.5 9 11
DecisionTrees
Compressing Data

An approximate histogram is built for each label/feature
label n/feature 1
0
4
8
4 6 8.5 10 13
label n
feature 1
0
10
20
30
40
1 3.5 7 11 14
Total
label 1 / feature 1
0
4
8
2 5 7.5 9 11
label 0
DecisionTrees
For each feature, all histograms of th
feature are merged
Prepare Split Candidates (1/2)
Total
0
10
20
30
40
1 3.5 7 11 14
Total
Get the split candidates s.t. the interval between two split
candidates have same number of points (the colored square is as
large as each other )
Total
0
10
20
30
40
1 3.5 7 11 14
Total
u1 u2
DecisionTrees
Prepare Split Candidates (2/2)
Find the split point that maximizes the information gain
using
the split points
histogram per feature/label
Total
0
10
20
30
40
1 3.5 7 11 14
Total
u1 u2
DecisionTrees
Determine Split
Advance purchase
Reservation Subspace
Class
FIRST
BUSINESS
ECONOMY
The Intuition is not exactly precise
Business
Leisure
Supervision
• The histograms can no longer be used for
further split
• And of course, we have already lost
original data
A different data set is used for different iteration
DecisionTree
* If there are not enough data, the same data can be reinjected
instead, Kafka is very good for this
Subsequent Split – get the intuition !
Motivation
Architecture
Decision Trees
Implementation
Conclusion
Implementation
Implementation
Stream Learner
Implementation
Stream Learner
We use two kafka streams:
• One for labeled data stream
• One for the tree developed so far
(the topic is also use by
classifying applications)
• Because we need to annotate
each message with the tree so far
Implementation
Code Outlines
val kafkaDataStream: DataStream[Point]=
val kafkaTreeStream: DataStream[Node] =
// annotate each message with the latest tree
val annotatedDataStream: DataStream[AnnotatedPoint] =
(kafkaDataStream connect kafkaTreeStream) flatMap (new
AnnotateMessageCoFlatMap(…))
// create histogram per feature / node
val histograms = annotatedDataStream.map{ p => toSingletonHistograms(p) }
.timeWindowAll(Time.of(1, TimeUnit.MINUTES))
.reduce{ (n1, n2) => mergeHistogram(n1, n2) }
// merge histogram
val mergedHistogram = histograms.keyBy(_.id).reduce{ (n1, n2) => mergeHistogram(n1, n2)
}
val newTree = mergedHistograms
.filter(hs => haveEnoughPoints(hs) && toSplit(hs))
.map{ n => val splitPoint =
maxInformationGain( calculateSplitCandidates(n))
val Histogram
➔accumulate
var histogram
➔re-accumulate from 0
h1
Motivation
Architecture
Decision Trees
Implementation


Conclusion
Conclusion
Conclusions
Summary
Streaming algorithms based on approximate
histograms are explained
The streaming decision trees algorithms open
possibilities to have interesting properties of
classifier: freshness and continuous learning
Flink together with Kafka allow an
implementation of the algorithm in a nice way
Conclusions
Next Steps
Random Forests:
Trees with randomly selected features at each
level
Trees with different span of data (trees with
more but old data might behave worse than
trees with less but more fresh data:
forgetting capabilities)
Providing information of what type of trees
behave better at a given period of time (meta
learning)
Thanks!
Credit to: Yiqing Yan (Eurecom) & Tianshu Yang (Telecom Bretagne), Amadeus Interns

Anwar Rizal – Streaming & Parallel Decision Tree in Flink