Mining high speed data streams: Hoeffding and VFDT

•Download as PPTX, PDF•

2 likes•1,080 views

Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017

Data & Analytics

Mining High-Speed Data Streams
Davide Gallitelli
Politecnico di Torino – TELECOM ParisTech
@DGallitelli95
Mining High-Speed Data Streams 1
Pedro Domingos
University of Washington
Geoff Hulten
University of Washington

1. Introduction 2
Huge and Fast data streaming

1. Introduction 3
KDD systems
operating
continuously
and indefinitely
Limited by:
• Time
• Memory
• Sample Size
SPRINT
Tested on up to
a few million
examples.
Less than a
day’s worth!

41. Introduction
VERY
FAST
DECISION
TREE

Hoeffding Decision Tree
2. Hoeffding Trees 5

2. Hoeffding Trees 6
 Classical DT learners are limited by main memory size
 Probably, not all examples are needed to find the best attribute at a node
 How to decide how many are necessary? Hoeffding Bound!
«Suppose we have made 𝑛 independent observations of a variable 𝑟 with
domain 𝑅, and computed their mean 𝑟. The Hoeffding bound states that,
with probability 1 − 𝛿, the true mean of the variable is at least 𝑟 − 𝜖»

2. Hoeffding Trees 7
How many examples are enough?
• Let 𝐺 𝑋𝑖 be the heuristic measure of choice (Information Gain, Gini Index)
• 𝑋 𝑎 : the attribute with the highest attribute evaluation value after n examples
• 𝑋 𝑏 : the attribute with the second highest split evaluation function value after n
examples
• We can compute
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• Thanks to Hoeffding Bound, we can infer that:
• ∆𝐺 ≥ ∆ 𝐺 − 𝜖 > 0 with probability 1 − 𝛿, where ∆𝐺 is the true difference in
heuristic measure
• This means that we can split the tree using 𝑋 𝑎, and the succeeding examples
will be passed to the new leaves (incremental approach)

82. Hoeffding Trees
• Compute the heuristic measure
for the attributes and determine
the best two attributes
• At each node chack for the
condition
∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖
• If true, create child nodes based
on the test at the node; else, get
more examples from stream.
HT Algorithm

2. Hoeffding Trees 9
In a nutshell
• Learning in Hoeffding tree is constant time per example (instance) and
this means Hoeffding tree is suitable for data stream mining.
• Requires each example to be read at most once (incrementally built).
• With high probability, a Hoeffding tree is asymptotically identical to the
decision tree built by a batch learner.
𝐸 ∆𝑖 𝐻𝑇𝛿, 𝐷𝑇∗ ≤
𝛿
𝑝
• Independent of the probability
distribution generating the observations
• Built incrementally by sequential reading
• Make class predictions in parallel
• What happens with ties?
• Memory used with tree expansion
• Number of candidate attributes
goo.gl/gBnm9h
goo.gl/QvZMC7

113. VFDT System
VFDT (Very Fast Decision Tree)
• Hoeffding tree algorithm implementation is VFDT
• VFDT includes refinements to the HT algorithm:
• Tie-braking algorithm
• Recompute G after a user-defined #examples
• Deactivation of inactive leaves
• Drop of unpromising early attributes (if ∆𝐺 > 𝜖)
• Bootstrap with traditional learner on a small
subset of data
• Rescan of previously-seen examples

123. VFDT System
Comparison with C4.5
𝛿 = 10−7
𝜏 = 5%
𝑛 𝑚𝑖𝑛 = 200

134. Application
A VFDT application : Web Data
• Mining the stream of Web page requests emanating
from the whole University of Washington main
campus.
• Useful to improve Web Caching, by predicting which
hosts and pages will be requested in the near future.

145. Conclusion
Future Work
• Test other applications (such as Intrusion detection)
• Use of non-discretized numeric attributes
• Use of post-pruning
• Use of adaptive δ
• Compare with other incremental algorithms (ID5R or SLIQ/SPRINT)
• Adapt to time-changing domains (concept drift)
• Parallelization

What's hot

RapidMiner: Introduction To Rapid MinerRapidmining Content

Kdd processRajesh Chandra

Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioMarina Santini

Fraud and Risk in Big DataUmma Khatuna Jannat

Apriori algorithmGaurav Aggarwal

Apriori algorithmAshis Kumar Chanda

Chapter 1. Introductionbutest

Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean

3. mining frequent patternsAzad public school

Fp growthFarah M. Altufaili

Association Rule Learning Part 1: Frequent Itemset GenerationKnoldus Inc.

Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaPyData

Exploratory data analysisgokulprasath06

Association rule mining and Apriori algorithmhina firdaus

Education data mining presentationNishabhanot1

Introduction to Big DataVipin Batra

Dimensionality ReductionSaad Elbeleidy

Gradient Boosted treesNihar Ranjan

Churn prediction data modelingPierre Gutierrez

5.1 mining data streamsKrish_ver2

What's hot (20)

RapidMiner: Introduction To Rapid Miner

Kdd process

Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio

Fraud and Risk in Big Data

Apriori algorithm

Chapter 1. Introduction

Data Mining: Concepts and techniques: Chapter 13 trend

3. mining frequent patterns

Fp growth

Association Rule Learning Part 1: Frequent Itemset Generation

Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova

Exploratory data analysis

Association rule mining and Apriori algorithm

Education data mining presentation

Introduction to Big Data

Dimensionality Reduction

Gradient Boosted trees

Churn prediction data modeling

5.1 mining data streams

Similar to Mining high speed data streams: Hoeffding and VFDT

Evaluating Classification Algorithms Applied To Data Streams Esteban DonatoEsteban Donato

MSR 2009swy351

Online machine learning in Streaming ApplicationsStavros Kontopoulos

Performance Issue? Machine Learning to the rescue!Maarten Smeets

Data Stream Algorithms in Storm and RRadek Maciaszek

Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com

Mining data streams using option treesAlexander Decker

Lecture 1Mr SMAK

NbvtalkatjntuvizianagaramNagasuri Bala Venkateswarlu

Challenges in Large Scale Machine LearningSudarsun Santhiappan

Building Big Data Streaming ArchitecturesDavid Martínez Rego

Matsunaga crowdsourcing IEEE e-science 2014Andrea Matsunaga

Memory efficient java tutorial practices and challengesmustafa sarac

Lecture on the annotation of transposable elementsfmaumus

Entity embeddings for categorical dataPaul Skeie

2014 nicta-reproducibilityc.titus.brown

Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Maninda Edirisooriya

Scaling HDFS for Exabyte Storage@twitterlohitvijayarenu

Data Mining: Mining stream time series and sequence dataDatamining Tools

Data Mining: Mining stream time series and sequence dataDataminingTools Inc

Similar to Mining high speed data streams: Hoeffding and VFDT (20)

Evaluating Classification Algorithms Applied To Data Streams Esteban Donato

MSR 2009

Online machine learning in Streaming Applications

Performance Issue? Machine Learning to the rescue!

Data Stream Algorithms in Storm and R

Modern Computing: Cloud, Distributed, & High Performance

Mining data streams using option trees

Lecture 1

Nbvtalkatjntuvizianagaram

Challenges in Large Scale Machine Learning

Building Big Data Streaming Architectures

Matsunaga crowdsourcing IEEE e-science 2014

Memory efficient java tutorial practices and challenges

Lecture on the annotation of transposable elements

Entity embeddings for categorical data

2014 nicta-reproducibility

Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...

Scaling HDFS for Exabyte Storage@twitter

Data Mining: Mining stream time series and sequence data

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Decoding Loan Approval: Predictive Modeling in ActionBoston Institute of Analytics

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...

04242024_CCC TUG_Joins and Relationships

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Decoding Loan Approval: Predictive Modeling in Action

20240419 - Measurecamp Amsterdam - SAM.pdf

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx

Data Science Jobs and Salaries Analysis.pptx

Call Girls in Saket 99530🔝 56974 Escort Service

Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...

Mining high speed data streams: Hoeffding and VFDT

1. Mining High-Speed Data Streams Davide Gallitelli Politecnico di Torino – TELECOM ParisTech @DGallitelli95 Mining High-Speed Data Streams 1 Pedro Domingos University of Washington Geoff Hulten University of Washington

2. 1. Introduction 2 Huge and Fast data streaming

3. 1. Introduction 3 KDD systems operating continuously and indefinitely Limited by: • Time • Memory • Sample Size SPRINT Tested on up to a few million examples. Less than a day’s worth!

4. 41. Introduction VERY FAST DECISION TREE

5. Hoeffding Decision Tree 2. Hoeffding Trees 5

6. 2. Hoeffding Trees 6  Classical DT learners are limited by main memory size  Probably, not all examples are needed to find the best attribute at a node  How to decide how many are necessary? Hoeffding Bound! «Suppose we have made 𝑛 independent observations of a variable 𝑟 with domain 𝑅, and computed their mean 𝑟. The Hoeffding bound states that, with probability 1 − 𝛿, the true mean of the variable is at least 𝑟 − 𝜖»

7. 2. Hoeffding Trees 7 How many examples are enough? • Let 𝐺 𝑋𝑖 be the heuristic measure of choice (Information Gain, Gini Index) • 𝑋 𝑎 : the attribute with the highest attribute evaluation value after n examples • 𝑋 𝑏 : the attribute with the second highest split evaluation function value after n examples • We can compute ∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖 • Thanks to Hoeffding Bound, we can infer that: • ∆𝐺 ≥ ∆ 𝐺 − 𝜖 > 0 with probability 1 − 𝛿, where ∆𝐺 is the true difference in heuristic measure • This means that we can split the tree using 𝑋 𝑎, and the succeeding examples will be passed to the new leaves (incremental approach)

8. 82. Hoeffding Trees • Compute the heuristic measure for the attributes and determine the best two attributes • At each node chack for the condition ∆ 𝐺 = 𝐺 𝑋 𝑎 − 𝐺 𝑋 𝑏 > 𝜖 • If true, create child nodes based on the test at the node; else, get more examples from stream. HT Algorithm

9. 2. Hoeffding Trees 9 In a nutshell • Learning in Hoeffding tree is constant time per example (instance) and this means Hoeffding tree is suitable for data stream mining. • Requires each example to be read at most once (incrementally built). • With high probability, a Hoeffding tree is asymptotically identical to the decision tree built by a batch learner. 𝐸 ∆𝑖 𝐻𝑇𝛿, 𝐷𝑇∗ ≤ 𝛿 𝑝 • Independent of the probability distribution generating the observations • Built incrementally by sequential reading • Make class predictions in parallel • What happens with ties? • Memory used with tree expansion • Number of candidate attributes goo.gl/gBnm9h goo.gl/QvZMC7

10. VFDT 3. VFDT System 10

11. 113. VFDT System VFDT (Very Fast Decision Tree) • Hoeffding tree algorithm implementation is VFDT • VFDT includes refinements to the HT algorithm: • Tie-braking algorithm • Recompute G after a user-defined #examples • Deactivation of inactive leaves • Drop of unpromising early attributes (if ∆𝐺 > 𝜖) • Bootstrap with traditional learner on a small subset of data • Rescan of previously-seen examples

12. 123. VFDT System Comparison with C4.5 𝛿 = 10−7 𝜏 = 5% 𝑛 𝑚𝑖𝑛 = 200

13. 134. Application A VFDT application : Web Data • Mining the stream of Web page requests emanating from the whole University of Washington main campus. • Useful to improve Web Caching, by predicting which hosts and pages will be requested in the near future.

14. 145. Conclusion Future Work • Test other applications (such as Intrusion detection) • Use of non-discretized numeric attributes • Use of post-pruning • Use of adaptive δ • Compare with other incremental algorithms (ID5R or SLIQ/SPRINT) • Adapt to time-changing domains (concept drift) • Parallelization

15. 5. Conclusion 15 QUESTIONS?

16. 5. Conclusion 16 THANK YOU!

Editor's Notes

Let’s think about two situations. On the left, the smart city of the future, with thousands of sensors and control systems. On the right, present days banking systems, which generates millions of transactions per day, and are expected to grow even more as e-shopping continues to spread. Thinking about the data produced by those systems, what are its main characteristics? < change > Size and Quantity. No more standard big data analytics, but high-speed data stream mining.
Knowledge discovery systems are constrained by three main limited resources: time, memory and sample size. In traditional applications of machine learning and statistics, sample size tends to be the dominant limitation. In contrast, in many (if not most) present-day data mining applications, the bottleneck is time and memory, not examples. The latter are typically in over-supply, in the sense that it is impossible with current KDD systems to make use of all of them within the available computational resources. Currently, the most efficient algorithms available (e.g., SPRINT or BIRCH) concentrate on making it possible to mine databases that do not fit in main memory by only requiring sequential scans of the disk. But even these algorithms have only been tested on up to a few million examples. Ideally, we would like to have KDD systems that operate continuously and indefinitely, incorporating examples as they arrive, and never losing potentially valuable information. Incremental algorithms are out there, but they are either highly sensitive to example ordering, potentially never recovering from an unfavorable set of early examples, or produce results similar to batch classification with undesired overhead in computation time.
Introducing: VFDT, a decision-tree learning system that overcomes the shortcomings of incremental algorithms. It is I/O bound, which means it mines examples in less time than it takes to input them from the disk, it’s an anytime algorithm, meaning that the model is ready-to-use at anytime, it does not store any examples and learns by seeing them exactly once.
Hoeffding Trees are born from the limitations of classical decision tree learners, which assume all training data can be simultaneously stored in main memory. HT is based on the assumption that, in order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, the first ones will be used to choose the root test; once the root attribute is chosen, the succeeding examples will be passed down to the corresponding leaves and used to choose the appropriate attributes there, and so on recursively. We solve the difficult problem of deciding exactly how many examples are necessary at each node by using a statistical result known as the Hoeffding bound.
So, how do we decide how many examples are enough?
If HTδ is the tree produced by the Hoeffding tree algorithm with desired probability δ given infinite examples (Table 1), DT∗ is the asymptotic batch tree, and p is the leaf probability, then E[∆i(HTδ, DT∗)] ≤ δ/p. The smaller δ/p , the more similar the Hoeffding tree is to a subtree of the asymptotic batch tree.
The Hoeffding tree algorithm was implemented into Very Fast Decision Tree learner (VFDT), which includes some enhancements for practical use. In case of ties, potentially many examples will be required to decide between them with some confidence, which is wasteful since they’re basically equivalent. VFDT splits on the current best attribute. Recomputing G is actually pretty expensive. In VFDT it is possible to define a parameter for the minimum number of examples read before recomputing G. Memory was an issue for HT, meaning that the moew the tree grew, the more memory it needed. VFDT deactivates inactive leaves, only keeping track of the probability of x falling into leaf l, times the observed error rate.

Mining high speed data streams: Hoeffding and VFDT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mining high speed data streams: Hoeffding and VFDT

Similar to Mining high speed data streams: Hoeffding and VFDT (20)

Recently uploaded

Recently uploaded (20)

Mining high speed data streams: Hoeffding and VFDT

Editor's Notes