SlideShare a Scribd company logo
1 of 36
Download to read offline
Heitor Murilo Gomes and Albert Bifet
Télécom ParisTech
Streaming Random
Forest Learning in Spark
and StreamDM
#SAISEco7
Agenda
ML for streaming data
Streaming Random Forest
StreamDM and implementations
Future
2#SAISEco7
About me
• Heitor Murilo Gomes
• PhD in Computer Science
• ML Researcher at Télécom ParisTech
• Contribute to StreamDM and MOA
• heitor.gomes@telecom-paristech.fr
• Website: www.heitorgomes.com
• Linkedin: www.linkedin.com/in/hmgomes/
3#SAISEco7
Data streams
You can’t store all the data
4#SAISEco7
Data streams
You don’t want to store all the data
5#SAISEco7
ML for Streaming data
Machine Learning for...
Batch data X Streaming data
6#SAISEco7
Batch data
7#SAISEco7
Streaming data
8#SAISEco7
Batch x Streaming
9#SAISEco7
Batch data
Streaming data
The output is a trainable model
The output is a trained model
Definitions / Assumptions
Independent and identically distributed (iid)
(xt, yt) does not influence (xt+1, yt+1)
Immediate vs. delayed labeling
immediate: yt is available before xt+1 arrives
delayed: yt may be available at some point in the future
10#SAISEco7
Stationary and non-stationary
Stationary data
The data distribution is unknown, but it doesn’t change
Non-stationary (evolving) data
The data distribution is unknown and it may change
Concept Drift
11#SAISEco7
Concept drift
The data distribution may change overtime
12#SAISEco7
change
Time t Time t+Δ
Concept drift
The data distribution may change overtime
13#SAISEco7
change
Time t Time t+Δ
An accurate model Not anymore...
Concept drift
The data distribution may change overtime
14#SAISEco7
change
Time t Time t+Δ
An accurate model An updated model!
Concept drift
15#SAISEco7
There are many ways to categorize concept drifts…
… Abrupt, gradual, incremental …
A survey on concept drift adaptation. Gama, Žliobaitė, Bifet, Pechenizkiy,
Bouchachia (2014). Computing Surveys (CSUR), ACM.
... in general, we care about those that disturb our model
Addressing concept drifts
16#SAISEco7
Focus on supervised learning*
Some strategies:
Reactive: ensemble learner
Active: drift detector + learner
Addressing concept drifts
17#SAISEco7
Why use an ensemble to cope with concept drift?
Flexibility. Relatively easy to forget and learn new
concepts by removing/resetting base models and
training new ones.
A survey on ensemble learning for data stream classification. Gomes,
Barddal, Enembreck, Bifet (2017). Computing Surveys (CSUR), ACM.
Are there other challenges?
Yes, unfortunately (or fortunately)
Concept evolution
Feature evolution
Continual preprocessing
Verification latency (delayed labeling)
…
18#SAISEco7
Our current approach
Ensemble: StreamingRandomForest
Base learner: StreamingDecisionTree
Implementation compatible with mllib
Another implementation in the StreamDM framework
19#SAISEco7
HoeffdingTree / StreamingDecisionTree
20#SAISEco7
Incremental decision tree learning algorithm
Intuition: Splitting after observing a small amount of data
Also known as Very Fast Decision Tree (VFDT)
Mining High-Speed Data Streams. Domingos, Hulten (2000). KDD, ACM.
StreamingRandomForest
21#SAISEco7
Streaming version of the original Random Forest by Breiman
Random forests. Breiman (2001). Machine learning, Springer.
Main differences:
Bootstrap aggregation and the base learner
Overview:
1. Online bagging
2. Random subset of features
Streaming Random Forest
22#SAISEco7
1. Online bagging
“Standard” bootstrap aggregation is inviable
Issue: Can’t resample from the dataset as it is not stored
Solution: Approximate it using a Poisson distribution with λ=1
Online bagging and boosting. Oza (2005).
Streaming Random Forest
23#SAISEco7
1. Online bagging
Practical effect: train trees with
different subsets of instances.
… (xt,yt) …stream
forest
k “weight” train(λ=1)
Streaming Random Forest
24#SAISEco7
1. Online bagging
Leveraging [online] bagging
Augment λ (usually 6), such that instances are used more often
Leveraging bagging for evolving data streams. Bifet, Holmes, Pfahringer
(2010). ECML-PKDD, Springer.
Streaming Random Forest
25#SAISEco7
2. Randomly select subsets of features for splits
Use a subset of
randomly selected
features for each
split
{f1, f3, f6} {f3, f5, f8}
{f4, f1, f2}
Implementation
26#SAISEco7
Mllib streaming algorithm:
StreamingLogisticRegressionWithSGD
StreamingDecisionTree
- Implementation of the HoeffdingTree algorithm
StreamingRandomForest
- Implementation of the Random Forest algorithm
Similar API
val streamLR: StreamingLogisticRegressionWithSGD =
new StreamingLogisticRegressionWithSGD()
.setInitialWeights(Vectors.zeros(elecConfig.numFeatures))
…
streamLR.trainOn(examples)
val streamDT: StreamingDecisionTree =
new StreamingDecisionTree(instanceSchema).
setModel().setMaxDepth(Option(10))
…
streamDT.trainOn(examples)
27
StreamingDT implementation
28#SAISEco7
Similar coding standard as
StreamingLogisticRegressionWithSGD
Handles both nominal and numeric attributes
- Numeric using Gaussian estimator
Training
- Statistics are computed in the workers
- Splits are decided and performed in the driver
StreamingRF implementation
29#SAISEco7
Similar to StreamingDecisionTree
Training
- Delegates training to the underlying tree models
- Uses online bagging
Some results
Electricity dataset
# ~50k instances
# 8 features
# 2 class labels
30
Some results
Covertype dataset
# ~600k instances
# 54 features
# 7 class labels
Streaming Random forest
MaxDepth = 20
NumTrees = 10 and 100
31
StreamDM
32#SAISEco7
Started in Huawei Noah’s Ark Lab
Collaboration between Huawei Shenzhen and Télécom ParisTech
Open source
Built on top of Spark Streaming*
Experimental Structured Streaming version
Extensible to include new tasks/algorithms
Website: http://huawei-noah.github.io/streamDM/
GitHub: https://github.com/huawei-noah/streamDM
GitHub (structured streaming): https://github.com/hmgomes/streamDM/tree/structured-
streaming
StreamDM
33#SAISEco7
Stream readers/writers
Classes for reading data in and outputting results
Tasks
Setting up the learning cycle (e.g. train/predict/evaluate)
Methods
Supervised and unsupervised learning algorithms. Hoeffding Tree, CluStream, Random
Forest*, Bagging, …
Base/other classes
Instance and Example representation, Feature specification, parameter handling, …
34#SAISEco7
DEMO
Wrap-up / Future
35#SAISEco7
Machine learning for streaming data
- Non-stationary (and concept drift)
Streaming Decision Trees and Random Forest implementations
Future:
- StreamDM Structured streaming version
- Improve performance (both decision tree and random forest)
- More methods (anomaly detection, multi-output, …)
Further reading / links
Machine Learning for Data Streams
Albert Bifet, Ricard Gavaldà, Geoffrey Holmes,Bernhard Pfahringer
Contact
heitor.gomes@telecom-paristech.fr
Repository (main)
https://github.com/huawei-noah/streamDM
Repository (structured streaming)
https://github.com/hmgomes/streamDM/tree/structured-streaming
36#SAISEco7

More Related Content

What's hot

Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentationAyanaRukasar
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksYoonho Lee
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing Luay AL-Assadi
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Data warehouse,data mining & Big Data
Data warehouse,data mining & Big DataData warehouse,data mining & Big Data
Data warehouse,data mining & Big DataRavinder Kamboj
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Houw Liong The
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionPetteriTeikariPhD
 
Techniques du data mining
Techniques du data miningTechniques du data mining
Techniques du data miningDonia Hammami
 
Introduction au Deep Learning
Introduction au Deep Learning Introduction au Deep Learning
Introduction au Deep Learning Niji
 
Machine Learning et Intelligence Artificielle
Machine Learning et Intelligence ArtificielleMachine Learning et Intelligence Artificielle
Machine Learning et Intelligence ArtificielleSoft Computing
 
Cours Big Data Chap1
Cours Big Data Chap1Cours Big Data Chap1
Cours Big Data Chap1Amal Abid
 
Data mining - Introduction générale
Data mining - Introduction généraleData mining - Introduction générale
Data mining - Introduction généraleMohamed Heny SELMI
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoderJun Lang
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Introduction au Data Mining et Méthodes Statistiques
Introduction au Data Mining et Méthodes StatistiquesIntroduction au Data Mining et Méthodes Statistiques
Introduction au Data Mining et Méthodes StatistiquesGiorgio Pauletto
 

What's hot (20)

Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Data warehouse,data mining & Big Data
Data warehouse,data mining & Big DataData warehouse,data mining & Big Data
Data warehouse,data mining & Big Data
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
 
Techniques du data mining
Techniques du data miningTechniques du data mining
Techniques du data mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Introduction au Deep Learning
Introduction au Deep Learning Introduction au Deep Learning
Introduction au Deep Learning
 
Machine Learning et Intelligence Artificielle
Machine Learning et Intelligence ArtificielleMachine Learning et Intelligence Artificielle
Machine Learning et Intelligence Artificielle
 
Cours Big Data Chap1
Cours Big Data Chap1Cours Big Data Chap1
Cours Big Data Chap1
 
Data mining - Introduction générale
Data mining - Introduction généraleData mining - Introduction générale
Data mining - Introduction générale
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Cours Big Data Part I
Cours Big Data Part ICours Big Data Part I
Cours Big Data Part I
 
Simple Introduction to AutoEncoder
Simple Introduction to AutoEncoderSimple Introduction to AutoEncoder
Simple Introduction to AutoEncoder
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Supervised learning
  Supervised learning  Supervised learning
Supervised learning
 
Introduction au Data Mining et Méthodes Statistiques
Introduction au Data Mining et Méthodes StatistiquesIntroduction au Data Mining et Méthodes Statistiques
Introduction au Data Mining et Méthodes Statistiques
 

Similar to Streaming Random Forest Learning in Spark and StreamDM with Heitor Murilogomes and Albert Bifet

TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform Seldon
 
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...Andrey Sadovykh
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Codemotion
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Codemotion
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceJim Dowling
 
Performance modeling and simulation for accumulo applications
Performance modeling and simulation for accumulo applicationsPerformance modeling and simulation for accumulo applications
Performance modeling and simulation for accumulo applicationsAccumulo Summit
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Karel Dumon
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectMatthew Gerring
 
Model driven engineering for big data management systems
Model driven engineering for big data management systemsModel driven engineering for big data management systems
Model driven engineering for big data management systemsMarcos Almeida
 
Webinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the LearningWebinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the LearningMesosphere Inc.
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...LEGATO project
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep LearningJorge Cardoso
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptxArthur240715
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data ExpoBigDataExpo
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyCloudify Community
 
Constraint Programming - An Alternative Approach to Heuristics in Scheduling
Constraint Programming - An Alternative Approach to Heuristics in SchedulingConstraint Programming - An Alternative Approach to Heuristics in Scheduling
Constraint Programming - An Alternative Approach to Heuristics in SchedulingEray Cakici
 

Similar to Streaming Random Forest Learning in Spark and StreamDM with Heitor Murilogomes and Albert Bifet (20)

TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
 
End-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in FinanceEnd-to-End Platform Support for Distributed Deep Learning in Finance
End-to-End Platform Support for Distributed Deep Learning in Finance
 
Performance modeling and simulation for accumulo applications
Performance modeling and simulation for accumulo applicationsPerformance modeling and simulation for accumulo applications
Performance modeling and simulation for accumulo applications
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
 
Eclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science ProjectEclipse Con Europe 2014 How to use DAWN Science Project
Eclipse Con Europe 2014 How to use DAWN Science Project
 
Model driven engineering for big data management systems
Model driven engineering for big data management systemsModel driven engineering for big data management systems
Model driven engineering for big data management systems
 
MATLAB and HDF-EOS
MATLAB and HDF-EOSMATLAB and HDF-EOS
MATLAB and HDF-EOS
 
Ssbse10.ppt
Ssbse10.pptSsbse10.ppt
Ssbse10.ppt
 
Webinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the LearningWebinar: Deep Learning Pipelines Beyond the Learning
Webinar: Deep Learning Pipelines Beyond the Learning
 
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
SAMOS 2018: LEGaTO: first steps towards energy-efficient toolset for heteroge...
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
 
230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx230208 MLOps Getting from Good to Great.pptx
230208 MLOps Getting from Good to Great.pptx
 
Google Big Data Expo
Google Big Data ExpoGoogle Big Data Expo
Google Big Data Expo
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
 
Constraint Programming - An Alternative Approach to Heuristics in Scheduling
Constraint Programming - An Alternative Approach to Heuristics in SchedulingConstraint Programming - An Alternative Approach to Heuristics in Scheduling
Constraint Programming - An Alternative Approach to Heuristics in Scheduling
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Streaming Random Forest Learning in Spark and StreamDM with Heitor Murilogomes and Albert Bifet

  • 1. Heitor Murilo Gomes and Albert Bifet Télécom ParisTech Streaming Random Forest Learning in Spark and StreamDM #SAISEco7
  • 2. Agenda ML for streaming data Streaming Random Forest StreamDM and implementations Future 2#SAISEco7
  • 3. About me • Heitor Murilo Gomes • PhD in Computer Science • ML Researcher at Télécom ParisTech • Contribute to StreamDM and MOA • heitor.gomes@telecom-paristech.fr • Website: www.heitorgomes.com • Linkedin: www.linkedin.com/in/hmgomes/ 3#SAISEco7
  • 4. Data streams You can’t store all the data 4#SAISEco7
  • 5. Data streams You don’t want to store all the data 5#SAISEco7
  • 6. ML for Streaming data Machine Learning for... Batch data X Streaming data 6#SAISEco7
  • 9. Batch x Streaming 9#SAISEco7 Batch data Streaming data The output is a trainable model The output is a trained model
  • 10. Definitions / Assumptions Independent and identically distributed (iid) (xt, yt) does not influence (xt+1, yt+1) Immediate vs. delayed labeling immediate: yt is available before xt+1 arrives delayed: yt may be available at some point in the future 10#SAISEco7
  • 11. Stationary and non-stationary Stationary data The data distribution is unknown, but it doesn’t change Non-stationary (evolving) data The data distribution is unknown and it may change Concept Drift 11#SAISEco7
  • 12. Concept drift The data distribution may change overtime 12#SAISEco7 change Time t Time t+Δ
  • 13. Concept drift The data distribution may change overtime 13#SAISEco7 change Time t Time t+Δ An accurate model Not anymore...
  • 14. Concept drift The data distribution may change overtime 14#SAISEco7 change Time t Time t+Δ An accurate model An updated model!
  • 15. Concept drift 15#SAISEco7 There are many ways to categorize concept drifts… … Abrupt, gradual, incremental … A survey on concept drift adaptation. Gama, Žliobaitė, Bifet, Pechenizkiy, Bouchachia (2014). Computing Surveys (CSUR), ACM. ... in general, we care about those that disturb our model
  • 16. Addressing concept drifts 16#SAISEco7 Focus on supervised learning* Some strategies: Reactive: ensemble learner Active: drift detector + learner
  • 17. Addressing concept drifts 17#SAISEco7 Why use an ensemble to cope with concept drift? Flexibility. Relatively easy to forget and learn new concepts by removing/resetting base models and training new ones. A survey on ensemble learning for data stream classification. Gomes, Barddal, Enembreck, Bifet (2017). Computing Surveys (CSUR), ACM.
  • 18. Are there other challenges? Yes, unfortunately (or fortunately) Concept evolution Feature evolution Continual preprocessing Verification latency (delayed labeling) … 18#SAISEco7
  • 19. Our current approach Ensemble: StreamingRandomForest Base learner: StreamingDecisionTree Implementation compatible with mllib Another implementation in the StreamDM framework 19#SAISEco7
  • 20. HoeffdingTree / StreamingDecisionTree 20#SAISEco7 Incremental decision tree learning algorithm Intuition: Splitting after observing a small amount of data Also known as Very Fast Decision Tree (VFDT) Mining High-Speed Data Streams. Domingos, Hulten (2000). KDD, ACM.
  • 21. StreamingRandomForest 21#SAISEco7 Streaming version of the original Random Forest by Breiman Random forests. Breiman (2001). Machine learning, Springer. Main differences: Bootstrap aggregation and the base learner Overview: 1. Online bagging 2. Random subset of features
  • 22. Streaming Random Forest 22#SAISEco7 1. Online bagging “Standard” bootstrap aggregation is inviable Issue: Can’t resample from the dataset as it is not stored Solution: Approximate it using a Poisson distribution with λ=1 Online bagging and boosting. Oza (2005).
  • 23. Streaming Random Forest 23#SAISEco7 1. Online bagging Practical effect: train trees with different subsets of instances. … (xt,yt) …stream forest k “weight” train(λ=1)
  • 24. Streaming Random Forest 24#SAISEco7 1. Online bagging Leveraging [online] bagging Augment λ (usually 6), such that instances are used more often Leveraging bagging for evolving data streams. Bifet, Holmes, Pfahringer (2010). ECML-PKDD, Springer.
  • 25. Streaming Random Forest 25#SAISEco7 2. Randomly select subsets of features for splits Use a subset of randomly selected features for each split {f1, f3, f6} {f3, f5, f8} {f4, f1, f2}
  • 26. Implementation 26#SAISEco7 Mllib streaming algorithm: StreamingLogisticRegressionWithSGD StreamingDecisionTree - Implementation of the HoeffdingTree algorithm StreamingRandomForest - Implementation of the Random Forest algorithm
  • 27. Similar API val streamLR: StreamingLogisticRegressionWithSGD = new StreamingLogisticRegressionWithSGD() .setInitialWeights(Vectors.zeros(elecConfig.numFeatures)) … streamLR.trainOn(examples) val streamDT: StreamingDecisionTree = new StreamingDecisionTree(instanceSchema). setModel().setMaxDepth(Option(10)) … streamDT.trainOn(examples) 27
  • 28. StreamingDT implementation 28#SAISEco7 Similar coding standard as StreamingLogisticRegressionWithSGD Handles both nominal and numeric attributes - Numeric using Gaussian estimator Training - Statistics are computed in the workers - Splits are decided and performed in the driver
  • 29. StreamingRF implementation 29#SAISEco7 Similar to StreamingDecisionTree Training - Delegates training to the underlying tree models - Uses online bagging
  • 30. Some results Electricity dataset # ~50k instances # 8 features # 2 class labels 30
  • 31. Some results Covertype dataset # ~600k instances # 54 features # 7 class labels Streaming Random forest MaxDepth = 20 NumTrees = 10 and 100 31
  • 32. StreamDM 32#SAISEco7 Started in Huawei Noah’s Ark Lab Collaboration between Huawei Shenzhen and Télécom ParisTech Open source Built on top of Spark Streaming* Experimental Structured Streaming version Extensible to include new tasks/algorithms Website: http://huawei-noah.github.io/streamDM/ GitHub: https://github.com/huawei-noah/streamDM GitHub (structured streaming): https://github.com/hmgomes/streamDM/tree/structured- streaming
  • 33. StreamDM 33#SAISEco7 Stream readers/writers Classes for reading data in and outputting results Tasks Setting up the learning cycle (e.g. train/predict/evaluate) Methods Supervised and unsupervised learning algorithms. Hoeffding Tree, CluStream, Random Forest*, Bagging, … Base/other classes Instance and Example representation, Feature specification, parameter handling, …
  • 35. Wrap-up / Future 35#SAISEco7 Machine learning for streaming data - Non-stationary (and concept drift) Streaming Decision Trees and Random Forest implementations Future: - StreamDM Structured streaming version - Improve performance (both decision tree and random forest) - More methods (anomaly detection, multi-output, …)
  • 36. Further reading / links Machine Learning for Data Streams Albert Bifet, Ricard Gavaldà, Geoffrey Holmes,Bernhard Pfahringer Contact heitor.gomes@telecom-paristech.fr Repository (main) https://github.com/huawei-noah/streamDM Repository (structured streaming) https://github.com/hmgomes/streamDM/tree/structured-streaming 36#SAISEco7