SlideShare a Scribd company logo
1 of 29
Detecting concept drifts on
data streams using
Robust Random Cut Forest
RRCF is an unsupervised algorithm
for detecting anomalous data points
within a stream of data.
Agenda
• Why do we need streaming outlier detection?
• Difficulties on detecting anomalies in data streams
• Chronology of RRCF
• Isolation Forest
• RRCF
• Implementation Details
Best outlier detection
algorithm?
???
Iforest SVM LOF EGM
Chronology recap
• 2008 – Isolation Forest published
• 2013 – Survey on outlier detection
• 2016 – RRCF published in JMLR
• 2016 – RRCF available on Amazon Kinesis
• 2018 – RCCF available on Hydroserving
Isolation Forest
Isolation Tree
• Contains all points
• Every leaf contains one distinct point
• Each node separates bounding box of it’s points in two halves
Isolation
Forest
1. Create bounding box around
given set of points in which
you would like to make a cut
2. Pick random dimension
3. Calculate random coordinate
in this dimension
4. Divide given set of points
according to calculated cut
5. Repeat 1-4 on this two sets
until every point is isolated
Isolation
Forest
1. Create bounding box around
given set of points in which
you would like to make a cut
2. Pick random dimension
3. Calculate random coordinate
in this dimension
4. Divide given set of points
according to calculated cut
5. Repeat 1-4 on this two sets
until every point is isolated
Isolation
Forest
1. Create bounding box around
given set of points in which
you would like to make a cut
2. Pick random dimension
3. Calculate random coordinate
in this dimension
4. Divide given set of points
according to calculated cut
5. Repeat 1-4 on this two sets
until every point is isolated
Isolation
Forest
1. Create bounding box around
given set of points in which
you would like to make a cut
2. Pick random dimension
3. Calculate random coordinate
in this dimension
4. Divide given set of points
according to calculated cut
5. Repeat 1-4 on this two sets
until every point is isolated
Y = 5.45
Isolation
Forest
1. Create bounding box around
given set of points in which
you would like to make a cut
2. Pick random dimension
3. Calculate random coordinate
in this dimension
4. Divide given set of points
according to calculated cut
5. Repeat 1-4 on this two sets
until every point is isolated
Y ≤ 5.45
Isolation
Forest
1. Create bounding box around
given set of points in which
you would like to make a cut
2. Pick random dimension
3. Calculate random coordinate
in this dimension
4. Divide given set of points
according to calculated cut
5. Repeat 1-4 on this two sets
until every point is isolated
Y ≤ 5.45
X=8.0
Y=8.
0
X ≤ 3.22 Y ≤ 7.61
v v v
Streaming
Algorithms
Issues
1. Data is transient
2. Timestamps are needed
3. Stream size is infinite
4. Arrival rate is important
5. Concept drift
6. Uncertainty
Survey on Outlier Detection in Data Stream ‘16. By Thakkar, Vala, Prajapati
RRCF
RRCF - Tree
Displacement
ba c
Point displacement
is average number
of points moved in
the tree after this
point deletion.
Collusive
Displacement
CoDisplacement
cba
Point collusive
displacement is average
of maximal
displacement which can
be achieved by deleting
this point and it’s
neighbors divided by
number of points
deletedmax
𝑥∈𝐶
1
𝐶
𝐷𝑖𝑠𝑝(𝐶)
Reservoir sampling
Shingles
Issues after paper analysis
1. What is time decay reservoir and how it is implemented?
2. How to initialize forest?
3. How to implement scoring and insertion efficiently?
dimension
coordinate
bounding box
# of children
point
bounding box
# of children
Tree & Node structure
dba c
Incremental Insertion

More Related Content

Similar to [Data Science Meetup] Bulat Lutfullin & Yuri Gavrilin: Detecting concept drifts on data streams using Robust Random Cut Forest

Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 
Machine learning interviews day3
Machine learning interviews   day3Machine learning interviews   day3
Machine learning interviews day3rajmohanc
 
Ausplots Training - Session 3
Ausplots Training - Session 3Ausplots Training - Session 3
Ausplots Training - Session 3bensparrowau
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forestsViet-Trung TRAN
 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfssuser4c50a9
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasMadhumita Ghosh
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptxImXaib
 
Dream3D and its Extension to Abaqus Input Files
Dream3D and its Extension to Abaqus Input FilesDream3D and its Extension to Abaqus Input Files
Dream3D and its Extension to Abaqus Input FilesMatthew Priddy
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Sri Ambati
 
background.pptx
background.pptxbackground.pptx
background.pptxKabileshCm
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxFaridAliMousa1
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic VisualizationSri Ambati
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learningRajasekhar364622
 

Similar to [Data Science Meetup] Bulat Lutfullin & Yuri Gavrilin: Detecting concept drifts on data streams using Robust Random Cut Forest (20)

Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Machine learning interviews day3
Machine learning interviews   day3Machine learning interviews   day3
Machine learning interviews day3
 
Ausplots Training - Session 3
Ausplots Training - Session 3Ausplots Training - Session 3
Ausplots Training - Session 3
 
Neural Networks made easy
Neural Networks made easyNeural Networks made easy
Neural Networks made easy
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Post pruning
Post pruning Post pruning
Post pruning
 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdf
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Cluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and SasCluster analysis using Rapidminer and Sas
Cluster analysis using Rapidminer and Sas
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Classification & Clustering.pptx
Classification & Clustering.pptxClassification & Clustering.pptx
Classification & Clustering.pptx
 
Dream3D and its Extension to Abaqus Input Files
Dream3D and its Extension to Abaqus Input FilesDream3D and its Extension to Abaqus Input Files
Dream3D and its Extension to Abaqus Input Files
 
Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013Jan vitek distributedrandomforest_5-2-2013
Jan vitek distributedrandomforest_5-2-2013
 
background.pptx
background.pptxbackground.pptx
background.pptx
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
Unsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptxUnsupervised Learning Clustering KMean and Hirarchical.pptx
Unsupervised Learning Clustering KMean and Hirarchical.pptx
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic Visualization
 
Using Tree algorithms on machine learning
Using Tree algorithms on machine learningUsing Tree algorithms on machine learning
Using Tree algorithms on machine learning
 

More from Provectus

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP SolutionProvectus
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsProvectus
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondProvectus
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRProvectus
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...Provectus
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...Provectus
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...Provectus
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...Provectus
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...Provectus
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...Provectus
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...Provectus
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019Provectus
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019Provectus
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...Provectus
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019Provectus
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMProvectus
 

More from Provectus (20)

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
 

Recently uploaded

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx9to5mart
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 

Recently uploaded (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 

[Data Science Meetup] Bulat Lutfullin & Yuri Gavrilin: Detecting concept drifts on data streams using Robust Random Cut Forest

  • 1. Detecting concept drifts on data streams using Robust Random Cut Forest
  • 2. RRCF is an unsupervised algorithm for detecting anomalous data points within a stream of data.
  • 3. Agenda • Why do we need streaming outlier detection? • Difficulties on detecting anomalies in data streams • Chronology of RRCF • Isolation Forest • RRCF • Implementation Details
  • 6.
  • 7.
  • 8.
  • 9. Chronology recap • 2008 – Isolation Forest published • 2013 – Survey on outlier detection • 2016 – RRCF published in JMLR • 2016 – RRCF available on Amazon Kinesis • 2018 – RCCF available on Hydroserving
  • 11. Isolation Tree • Contains all points • Every leaf contains one distinct point • Each node separates bounding box of it’s points in two halves
  • 12. Isolation Forest 1. Create bounding box around given set of points in which you would like to make a cut 2. Pick random dimension 3. Calculate random coordinate in this dimension 4. Divide given set of points according to calculated cut 5. Repeat 1-4 on this two sets until every point is isolated
  • 13. Isolation Forest 1. Create bounding box around given set of points in which you would like to make a cut 2. Pick random dimension 3. Calculate random coordinate in this dimension 4. Divide given set of points according to calculated cut 5. Repeat 1-4 on this two sets until every point is isolated
  • 14. Isolation Forest 1. Create bounding box around given set of points in which you would like to make a cut 2. Pick random dimension 3. Calculate random coordinate in this dimension 4. Divide given set of points according to calculated cut 5. Repeat 1-4 on this two sets until every point is isolated
  • 15. Isolation Forest 1. Create bounding box around given set of points in which you would like to make a cut 2. Pick random dimension 3. Calculate random coordinate in this dimension 4. Divide given set of points according to calculated cut 5. Repeat 1-4 on this two sets until every point is isolated Y = 5.45
  • 16. Isolation Forest 1. Create bounding box around given set of points in which you would like to make a cut 2. Pick random dimension 3. Calculate random coordinate in this dimension 4. Divide given set of points according to calculated cut 5. Repeat 1-4 on this two sets until every point is isolated Y ≤ 5.45
  • 17. Isolation Forest 1. Create bounding box around given set of points in which you would like to make a cut 2. Pick random dimension 3. Calculate random coordinate in this dimension 4. Divide given set of points according to calculated cut 5. Repeat 1-4 on this two sets until every point is isolated Y ≤ 5.45 X=8.0 Y=8. 0 X ≤ 3.22 Y ≤ 7.61 v v v
  • 18. Streaming Algorithms Issues 1. Data is transient 2. Timestamps are needed 3. Stream size is infinite 4. Arrival rate is important 5. Concept drift 6. Uncertainty Survey on Outlier Detection in Data Stream ‘16. By Thakkar, Vala, Prajapati
  • 19. RRCF
  • 21.
  • 22. Displacement ba c Point displacement is average number of points moved in the tree after this point deletion.
  • 24. CoDisplacement cba Point collusive displacement is average of maximal displacement which can be achieved by deleting this point and it’s neighbors divided by number of points deletedmax 𝑥∈𝐶 1 𝐶 𝐷𝑖𝑠𝑝(𝐶)
  • 27. Issues after paper analysis 1. What is time decay reservoir and how it is implemented? 2. How to initialize forest? 3. How to implement scoring and insertion efficiently?
  • 28. dimension coordinate bounding box # of children point bounding box # of children Tree & Node structure

Editor's Notes

  1. Почему возникла такая проблема? Технологические сервисы и продукты, генерируют огромное количество данных, без анализа которых невозможно достичь успеха. Умные сети электроснабжения, беспилотные автомобили, беспроводные сенсорные сети, логи, интернет вещей это лишь несколько примеров таких сервисов. Выявление аномалий в данных – один из аспектов такого анализа и бесспорно очень важная тема. Обычно именно алгоритмы для определения аномалий определяют поломки\сбои\нарушения протоколов безопасности. Невыявленные аномалии в любом из перечисленных продуктов могут иметь самые ужасные последствия для бизнеса. Обычно эти сервисы генерируют настолько много данных, что их хранение не представляется возможным.
  2. К примеру взять вот эту турбину марки Pratt & Whitney’s. Это двигатель нового поколения, он оснащен 5000 сенсорами и может генерировать до 10 гигабайт данных в секунду. Анализ этих данных возможен только поточными алгоритмами. Именно поэтому сегодня мы будем обсуждать поточные алгоритмы для выявления аномалий и, в частности, алгоритм который я имплементировал на своей стажировке в Provectus под названием RRCF
  3. Почему мы сегодня будем обсуждать именно RRCF, а не любой другой алгоритм для поиска аномалий? Для того чтобы понять это, я расскажу его предысторию.
  4. В 2013 году вышла статья, название которой вы можете увидеть на экране. Авторы статьи критикуют отсутствие качественного сравнения алгоритмов для поиска аномалий и представляют результаты своих экспериментов, где они сравнивают несколько популярных моделей: изолирующий лес, несколько вариаций SVM, LOF, EGMM
  5. В 2016 году эту статью увидели ребята из AWS AI Algorithms team и решили модифицировать Изолирующий лес для использования на потоках данных, что у них успешно получилось и с тех пор этот алгоритм используется в AWS. К слову, основная часть алгоритмов для поиска аномалий в потоках появилась примерно в это же время (2016).
  6. С тех пор алгоритм используется в AWS. Раньше он был доступен как микросервис подключаемый к кинезису, теперь он доступен еще и как модель в SageMakere. То что Амазон уже несколько лет использует этот алгоритм и добавляет его как функционал в новые продукты свидетельствует о качестве алгоритма и его нужности.
  7. На самом тебе, мы разговариваем об RRCF потому что я имплементировал этот алгоритм для платформы Hydroserving, которая позволяет запускать и автоматически скэйлить ваши модели
  8. Разобравшись с тем, почему мы говорим об RRCF, просуммируем все предыдущее. Бла – бла. Перед тем как мы перейдем к обсуждению непосредственно RRCF и его особенностям посмотрим сначала на изолирующий лес, для того чтобы понять на чем основывалась работа парней из амазона.
  9. Изолирующий лес это алгоритм цель которого построить лес из изолирующих деревьев.
  10. Бла-бла. Чтобы понять как этот алгоритм находит аномалии надо взглянуть на то как он строит свои деревья.
  11. Это описание алгоритма для постройки одного дерева. Разберем каждый этап
  12. Algorithm, how it works at high level
  13. Algorithm, how it works at high level
  14. Algorithm, how it works at high level
  15. Algorithm, how it works at high level
  16. Если вы уже успели заметить, то вероятность того, что аномальная точка будет изолирована раньше остальных гораздо выше чем у точек которые лежат рядом друг с другом. Поэтому коэффициент аномальности определен так:
  17. Чтобы трансформировать любой алгоритм анализа аномалий для потокового использования обзорная статья по анализу аномалий в потоках предлагают нам задуматься о 6 проблемах: В потоковых алгоритмах, точки мимолетны. Старые точки теряют свою важность, и их необходимо исключать В алгоритм необходимо ввести такое понятие как время появления точки: явно или неявно Необходимо учитывать то, что поток данных бесконечен Необходимо учитывать скорость прихода данных будь то постоянная или переменная, а также то, что обработать следующую точку возможно, только если мы обработали предыдущую. Concept drift - смена распределения со временем Неточность данных заставляет нас задуматься действительно ли нам пришел аутлаер или это ошибка в получении данных
  18. Наконец перейдем к RRCF. Если цель изолирующего леса была создать лес, то цель RRCF поддерживать лес актуальным. Этот лес является скетчем наших данных. Скетч, или набросок, отражает распределение наших данных используя лишь малую их часть. При получении новой точки алгоритм обновляет скетч Для того чтобы деревья могли оставаться актуальными, их необходимо перестраивать. Для того чтобы перестраивать деревья, авторы изменяют процесс построения дерева – вместо построения дерева целиком за раз - добавляется операция добавления и удаления точки
  19. Поддержка леса происходит вот так: Когда дерево хочет вставить новую точку, то, первым делом оно удаляет из себя одну из старых точек, к примеру вот эту. О том, какая точка будет удалена мы обсудим позже. Если точка удалена, то ее родитель ничего не изолирует, поэтому этот разрез становится не нужен и он тоже удаляется. Это и есть операция удаления точки. Потом дерево добавляет новую точку. Для этого оно пытается сделать изолирующий разрез. Если созданный разрез действительно изолирует эту точку, то создается новая нода в дереве. Если нет, то эта точка спускается на следующий уровень в дереве согласно разрезу и пытается сделать разрез там. Это продолжается до тех пор, пока новая точка не будет изолирована
  20. В отличии от изолирующего леса, RRCF выбирает измерения для разреза не случайно – а пропорционально его длине. Когда одно измерение имеет гораздо большую дисперсию чем другое.
  21. Перед тем как обновить скетч, алгоритм оценивает аномальность точки
  22. Еще пару слов о деталях алгоритма. Для того чтобы наши деревья были различными друг от друга, а это необходимое условие для любого ансамбля моделей, каждое дерево само решает вставлять ему новую точку или нет. Это делается с помощью резервуарного сэмплирования. Резервуарное сэмплирование – это семейство алгоритмов которые позволяют получить случайную подвыборку фиксированного размера из потока данных. Именно резервуар, который есть у каждого дерева, решает вставлять ли ему новую точку и если да, то какую ему точку удалить.
  23. Чтобы алгоритм работал для временных последовательностей, авторы предлагают использовать шинглы (или черепицу). Берется скользящее окно и вектора точек в нем конкатенируются в одно. Именно это позволяет алгоритму определять аномалии во временных потоках
  24. Авторы статьи предлагают использовать резервуарное семплирование или тайм декей, но классическое резервуарное семплирование не подходит для бесконечных поток. Из статьи ясно лишь то, что time decay reservoir это вариация экспоненциального резервуара. Авторы опускают в своей статье важную часть алгоритма: Проблему холодного старта. Алгоритм не может предсказывать аномальность точек, пока не заполнит все свои резервуары и соответственно не построит все деревья Статья чисто математическая и не имеет намеков на имплементацию, однако авторы этой статьи давали речь в Стэнфорде, где говорили что основная часть алгоритма – это incremental insertion, который позволяет эффективно считать скоры и делать вставки. Как он реализован тоже не ясно.
  25. Имплементация этого алгоритма похожа на вариацию k-d дерева. Деревья бинарны и двусвязны. И листок и Нода хранят ограничивающий парраллелепипед, который ограничивает все точки потомки этой ноды. Это необходимо для того что не пересчитывать эту коробку каждый раз когда мы будем решать когда сделать разрез у этой ноды. Также каждая нода имеет количество потомков, это необходимо чтобы измерять кодисплэйсмент. Как и в кд-дереве нода хранит измерение и координату разреза.
  26. Чтобы эффективно измерять кодисплейсмент нужно иметь операцию которая позволит определить где точка окажется в дерев