Heitor Murilo Gomes and Albert Bifet
Télécom ParisTech
Streaming Random
Forest Learning in Spark
and StreamDM
#SAISEco7
Agenda
ML for streaming data
Streaming Random Forest
StreamDM and implementations
Future
2#SAISEco7
About me
• Heitor Murilo Gomes
• PhD in Computer Science
• ML Researcher at Télécom ParisTech
• Contribute to StreamDM and MOA
• heitor.gomes@telecom-paristech.fr
• Website: www.heitorgomes.com
• Linkedin: www.linkedin.com/in/hmgomes/
3#SAISEco7
Data streams
You can’t store all the data
4#SAISEco7
Data streams
You don’t want to store all the data
5#SAISEco7
ML for Streaming data
Machine Learning for...
Batch data X Streaming data
6#SAISEco7
Batch data
7#SAISEco7
Streaming data
8#SAISEco7
Batch x Streaming
9#SAISEco7
Batch data
Streaming data
The output is a trainable model
The output is a trained model
Definitions / Assumptions
Independent and identically distributed (iid)
(xt, yt) does not influence (xt+1, yt+1)
Immediate vs. delayed labeling
immediate: yt is available before xt+1 arrives
delayed: yt may be available at some point in the future
10#SAISEco7
Stationary and non-stationary
Stationary data
The data distribution is unknown, but it doesn’t change
Non-stationary (evolving) data
The data distribution is unknown and it may change
Concept Drift
11#SAISEco7
Concept drift
The data distribution may change overtime
12#SAISEco7
change
Time t Time t+Δ
Concept drift
The data distribution may change overtime
13#SAISEco7
change
Time t Time t+Δ
An accurate model Not anymore...
Concept drift
The data distribution may change overtime
14#SAISEco7
change
Time t Time t+Δ
An accurate model An updated model!
Concept drift
15#SAISEco7
There are many ways to categorize concept drifts…
… Abrupt, gradual, incremental …
A survey on concept drift adaptation. Gama, Žliobaitė, Bifet, Pechenizkiy,
Bouchachia (2014). Computing Surveys (CSUR), ACM.
... in general, we care about those that disturb our model
Addressing concept drifts
16#SAISEco7
Focus on supervised learning*
Some strategies:
Reactive: ensemble learner
Active: drift detector + learner
Addressing concept drifts
17#SAISEco7
Why use an ensemble to cope with concept drift?
Flexibility. Relatively easy to forget and learn new
concepts by removing/resetting base models and
training new ones.
A survey on ensemble learning for data stream classification. Gomes,
Barddal, Enembreck, Bifet (2017). Computing Surveys (CSUR), ACM.
Are there other challenges?
Yes, unfortunately (or fortunately)
Concept evolution
Feature evolution
Continual preprocessing
Verification latency (delayed labeling)
…
18#SAISEco7
Our current approach
Ensemble: StreamingRandomForest
Base learner: StreamingDecisionTree
Implementation compatible with mllib
Another implementation in the StreamDM framework
19#SAISEco7
HoeffdingTree / StreamingDecisionTree
20#SAISEco7
Incremental decision tree learning algorithm
Intuition: Splitting after observing a small amount of data
Also known as Very Fast Decision Tree (VFDT)
Mining High-Speed Data Streams. Domingos, Hulten (2000). KDD, ACM.
StreamingRandomForest
21#SAISEco7
Streaming version of the original Random Forest by Breiman
Random forests. Breiman (2001). Machine learning, Springer.
Main differences:
Bootstrap aggregation and the base learner
Overview:
1. Online bagging
2. Random subset of features
Streaming Random Forest
22#SAISEco7
1. Online bagging
“Standard” bootstrap aggregation is inviable
Issue: Can’t resample from the dataset as it is not stored
Solution: Approximate it using a Poisson distribution with λ=1
Online bagging and boosting. Oza (2005).
Streaming Random Forest
23#SAISEco7
1. Online bagging
Practical effect: train trees with
different subsets of instances.
… (xt,yt) …stream
forest
k “weight” train(λ=1)
Streaming Random Forest
24#SAISEco7
1. Online bagging
Leveraging [online] bagging
Augment λ (usually 6), such that instances are used more often
Leveraging bagging for evolving data streams. Bifet, Holmes, Pfahringer
(2010). ECML-PKDD, Springer.
Streaming Random Forest
25#SAISEco7
2. Randomly select subsets of features for splits
Use a subset of
randomly selected
features for each
split
{f1, f3, f6} {f3, f5, f8}
{f4, f1, f2}
Implementation
26#SAISEco7
Mllib streaming algorithm:
StreamingLogisticRegressionWithSGD
StreamingDecisionTree
- Implementation of the HoeffdingTree algorithm
StreamingRandomForest
- Implementation of the Random Forest algorithm
Similar API
val streamLR: StreamingLogisticRegressionWithSGD =
new StreamingLogisticRegressionWithSGD()
.setInitialWeights(Vectors.zeros(elecConfig.numFeatures))
…
streamLR.trainOn(examples)
val streamDT: StreamingDecisionTree =
new StreamingDecisionTree(instanceSchema).
setModel().setMaxDepth(Option(10))
…
streamDT.trainOn(examples)
27
StreamingDT implementation
28#SAISEco7
Similar coding standard as
StreamingLogisticRegressionWithSGD
Handles both nominal and numeric attributes
- Numeric using Gaussian estimator
Training
- Statistics are computed in the workers
- Splits are decided and performed in the driver
StreamingRF implementation
29#SAISEco7
Similar to StreamingDecisionTree
Training
- Delegates training to the underlying tree models
- Uses online bagging
Some results
Electricity dataset
# ~50k instances
# 8 features
# 2 class labels
30
Some results
Covertype dataset
# ~600k instances
# 54 features
# 7 class labels
Streaming Random forest
MaxDepth = 20
NumTrees = 10 and 100
31
StreamDM
32#SAISEco7
Started in Huawei Noah’s Ark Lab
Collaboration between Huawei Shenzhen and Télécom ParisTech
Open source
Built on top of Spark Streaming*
Experimental Structured Streaming version
Extensible to include new tasks/algorithms
Website: http://huawei-noah.github.io/streamDM/
GitHub: https://github.com/huawei-noah/streamDM
GitHub (structured streaming): https://github.com/hmgomes/streamDM/tree/structured-
streaming
StreamDM
33#SAISEco7
Stream readers/writers
Classes for reading data in and outputting results
Tasks
Setting up the learning cycle (e.g. train/predict/evaluate)
Methods
Supervised and unsupervised learning algorithms. Hoeffding Tree, CluStream, Random
Forest*, Bagging, …
Base/other classes
Instance and Example representation, Feature specification, parameter handling, …
34#SAISEco7
DEMO
Wrap-up / Future
35#SAISEco7
Machine learning for streaming data
- Non-stationary (and concept drift)
Streaming Decision Trees and Random Forest implementations
Future:
- StreamDM Structured streaming version
- Improve performance (both decision tree and random forest)
- More methods (anomaly detection, multi-output, …)
Further reading / links
Machine Learning for Data Streams
Albert Bifet, Ricard Gavaldà, Geoffrey Holmes,Bernhard Pfahringer
Contact
heitor.gomes@telecom-paristech.fr
Repository (main)
https://github.com/huawei-noah/streamDM
Repository (structured streaming)
https://github.com/hmgomes/streamDM/tree/structured-streaming
36#SAISEco7

Streaming Random Forest Learning in Spark and StreamDM with Heitor Murilogomes and Albert Bifet

  • 1.
    Heitor Murilo Gomesand Albert Bifet Télécom ParisTech Streaming Random Forest Learning in Spark and StreamDM #SAISEco7
  • 2.
    Agenda ML for streamingdata Streaming Random Forest StreamDM and implementations Future 2#SAISEco7
  • 3.
    About me • HeitorMurilo Gomes • PhD in Computer Science • ML Researcher at Télécom ParisTech • Contribute to StreamDM and MOA • heitor.gomes@telecom-paristech.fr • Website: www.heitorgomes.com • Linkedin: www.linkedin.com/in/hmgomes/ 3#SAISEco7
  • 4.
    Data streams You can’tstore all the data 4#SAISEco7
  • 5.
    Data streams You don’twant to store all the data 5#SAISEco7
  • 6.
    ML for Streamingdata Machine Learning for... Batch data X Streaming data 6#SAISEco7
  • 7.
  • 8.
  • 9.
    Batch x Streaming 9#SAISEco7 Batchdata Streaming data The output is a trainable model The output is a trained model
  • 10.
    Definitions / Assumptions Independentand identically distributed (iid) (xt, yt) does not influence (xt+1, yt+1) Immediate vs. delayed labeling immediate: yt is available before xt+1 arrives delayed: yt may be available at some point in the future 10#SAISEco7
  • 11.
    Stationary and non-stationary Stationarydata The data distribution is unknown, but it doesn’t change Non-stationary (evolving) data The data distribution is unknown and it may change Concept Drift 11#SAISEco7
  • 12.
    Concept drift The datadistribution may change overtime 12#SAISEco7 change Time t Time t+Δ
  • 13.
    Concept drift The datadistribution may change overtime 13#SAISEco7 change Time t Time t+Δ An accurate model Not anymore...
  • 14.
    Concept drift The datadistribution may change overtime 14#SAISEco7 change Time t Time t+Δ An accurate model An updated model!
  • 15.
    Concept drift 15#SAISEco7 There aremany ways to categorize concept drifts… … Abrupt, gradual, incremental … A survey on concept drift adaptation. Gama, Žliobaitė, Bifet, Pechenizkiy, Bouchachia (2014). Computing Surveys (CSUR), ACM. ... in general, we care about those that disturb our model
  • 16.
    Addressing concept drifts 16#SAISEco7 Focuson supervised learning* Some strategies: Reactive: ensemble learner Active: drift detector + learner
  • 17.
    Addressing concept drifts 17#SAISEco7 Whyuse an ensemble to cope with concept drift? Flexibility. Relatively easy to forget and learn new concepts by removing/resetting base models and training new ones. A survey on ensemble learning for data stream classification. Gomes, Barddal, Enembreck, Bifet (2017). Computing Surveys (CSUR), ACM.
  • 18.
    Are there otherchallenges? Yes, unfortunately (or fortunately) Concept evolution Feature evolution Continual preprocessing Verification latency (delayed labeling) … 18#SAISEco7
  • 19.
    Our current approach Ensemble:StreamingRandomForest Base learner: StreamingDecisionTree Implementation compatible with mllib Another implementation in the StreamDM framework 19#SAISEco7
  • 20.
    HoeffdingTree / StreamingDecisionTree 20#SAISEco7 Incrementaldecision tree learning algorithm Intuition: Splitting after observing a small amount of data Also known as Very Fast Decision Tree (VFDT) Mining High-Speed Data Streams. Domingos, Hulten (2000). KDD, ACM.
  • 21.
    StreamingRandomForest 21#SAISEco7 Streaming version ofthe original Random Forest by Breiman Random forests. Breiman (2001). Machine learning, Springer. Main differences: Bootstrap aggregation and the base learner Overview: 1. Online bagging 2. Random subset of features
  • 22.
    Streaming Random Forest 22#SAISEco7 1.Online bagging “Standard” bootstrap aggregation is inviable Issue: Can’t resample from the dataset as it is not stored Solution: Approximate it using a Poisson distribution with λ=1 Online bagging and boosting. Oza (2005).
  • 23.
    Streaming Random Forest 23#SAISEco7 1.Online bagging Practical effect: train trees with different subsets of instances. … (xt,yt) …stream forest k “weight” train(λ=1)
  • 24.
    Streaming Random Forest 24#SAISEco7 1.Online bagging Leveraging [online] bagging Augment λ (usually 6), such that instances are used more often Leveraging bagging for evolving data streams. Bifet, Holmes, Pfahringer (2010). ECML-PKDD, Springer.
  • 25.
    Streaming Random Forest 25#SAISEco7 2.Randomly select subsets of features for splits Use a subset of randomly selected features for each split {f1, f3, f6} {f3, f5, f8} {f4, f1, f2}
  • 26.
    Implementation 26#SAISEco7 Mllib streaming algorithm: StreamingLogisticRegressionWithSGD StreamingDecisionTree -Implementation of the HoeffdingTree algorithm StreamingRandomForest - Implementation of the Random Forest algorithm
  • 27.
    Similar API val streamLR:StreamingLogisticRegressionWithSGD = new StreamingLogisticRegressionWithSGD() .setInitialWeights(Vectors.zeros(elecConfig.numFeatures)) … streamLR.trainOn(examples) val streamDT: StreamingDecisionTree = new StreamingDecisionTree(instanceSchema). setModel().setMaxDepth(Option(10)) … streamDT.trainOn(examples) 27
  • 28.
    StreamingDT implementation 28#SAISEco7 Similar codingstandard as StreamingLogisticRegressionWithSGD Handles both nominal and numeric attributes - Numeric using Gaussian estimator Training - Statistics are computed in the workers - Splits are decided and performed in the driver
  • 29.
    StreamingRF implementation 29#SAISEco7 Similar toStreamingDecisionTree Training - Delegates training to the underlying tree models - Uses online bagging
  • 30.
    Some results Electricity dataset #~50k instances # 8 features # 2 class labels 30
  • 31.
    Some results Covertype dataset #~600k instances # 54 features # 7 class labels Streaming Random forest MaxDepth = 20 NumTrees = 10 and 100 31
  • 32.
    StreamDM 32#SAISEco7 Started in HuaweiNoah’s Ark Lab Collaboration between Huawei Shenzhen and Télécom ParisTech Open source Built on top of Spark Streaming* Experimental Structured Streaming version Extensible to include new tasks/algorithms Website: http://huawei-noah.github.io/streamDM/ GitHub: https://github.com/huawei-noah/streamDM GitHub (structured streaming): https://github.com/hmgomes/streamDM/tree/structured- streaming
  • 33.
    StreamDM 33#SAISEco7 Stream readers/writers Classes forreading data in and outputting results Tasks Setting up the learning cycle (e.g. train/predict/evaluate) Methods Supervised and unsupervised learning algorithms. Hoeffding Tree, CluStream, Random Forest*, Bagging, … Base/other classes Instance and Example representation, Feature specification, parameter handling, …
  • 34.
  • 35.
    Wrap-up / Future 35#SAISEco7 Machinelearning for streaming data - Non-stationary (and concept drift) Streaming Decision Trees and Random Forest implementations Future: - StreamDM Structured streaming version - Improve performance (both decision tree and random forest) - More methods (anomaly detection, multi-output, …)
  • 36.
    Further reading /links Machine Learning for Data Streams Albert Bifet, Ricard Gavaldà, Geoffrey Holmes,Bernhard Pfahringer Contact heitor.gomes@telecom-paristech.fr Repository (main) https://github.com/huawei-noah/streamDM Repository (structured streaming) https://github.com/hmgomes/streamDM/tree/structured-streaming 36#SAISEco7