Low-latency Multi-threaded
Ensemble Learning for
Dynamic Big Data Streams
Diego Marr´on (dmarron@ac.upc.edu)
Eduard Ayguad´e (eduard.ayguade@bsc.es)
Jos´e R. Herrero (josepr@ac.upc.edu)
Jesse Read (jesse.read@polytechnique.edu)
Albert Bifet (albert.bifet@telecom-paristech.fr)
2017 IEEE International Conference on Big Data
December 11-14, 2017, Boston, MA, USA
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Real–time mining of dynamic data streams
• Unprecedented amount of dynamic big data streams (Volume)
• Generating data at High Ratio (Velocity)
• Newly created data rapidly supersedes old data (Volatility)
• This increase in volume, velocity and volatility requires data
to be processed on–the–fly in real–time
2/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Real–time dynamic data streams classification
• Real-time classification imposes some challenges:
• Deal with potentially infinite streams
• Single pass on each instance
• React to changes on the stream (concept drifting)
• Bounded response-time:
• Low latency: Milliseconds (ms) per instance
• High latency: Few seconds (s) per instance
• Limited CPU-time to process each instance
• Preferred methods:
• Hoeffding Tree (HT)
• Ensemble of HT
3/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Hoeffding Tree
• Decision tree suitable for large data streams
• Easy–to–deploy
• They are usually able to keep up with the arrival rate
4/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Hoeffding Tree: Basics
• Build tree structure incrementally (on-the-fly)
• Tree structure uses attributes to route an instance to a leaf
node
• Leaf node:
• Contains the classifier (Naive Bayes)
• Statistics to decide next attribute to split (attribute counters)
• Split decision: needs per-attribute information gain
• Naive Bayes Classifier: calculates each attribute probability
5/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Ensembles: Random Forest of Hoeffding Trees
• Random Forest uses RandomHT (a variation of HT):
• Split decision uses a random subset of attributes
• For each RandomHT in the ensemble:
• Input: sampling with repetition
• Tree is reset if change (drift) is detected
• Responses are combined to form final prediction
• Ensembles require more work for each instance
6/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Exploiting CPU Parallelism
• Parallelism: more work on the same amount of time
• Improves throughput per time unit
• Or improves the accuracy by using more CPU-intensive
methods
• Modern CPU parallel features: multithreaded, SIMD
instructions
7/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Contributions
• Very low-latency response time
• few micro-seconds (µs) per instance
• Compared to an state-of-art implementation, MOA:
• Same accuracy
• Single HT, on average:
• Response time: 2 microseconds (µs) per instance
• 6.73x faster
• Multithreaded ensemble, on average:
• Response time: 10 microseconds (µs) per instance
• 85x faster
• Up to 70% parallel efficiency on a 24 cores CPU
• Highly scalable/adaptive design tested on:
• Intel platform: i7,Xeon
• ARM SoCs: from server range to low end (Raspberry Pi3)
8/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Hoeffding Tree
• Our implementation uses a binary tree
• Split into smaller sub-trees that fit on the CPU L1 cache
• L1 cache 64KB (8x64bits pointer)
• Max sub-tree height: 3
• SIMD instructions to accelerate calculations:
• Information Gain
• Naive Bayes Classifier
cache line
cache line
9/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT: Architecture
• N threads (up to the number of CPU hardware threads)
• Thread 1: data load/parser
• N-1 workers for L learners
• Common instance buffer (lockless ring buffer)
10/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT: Achieving low latency
• Lockless datas structures: at least one makes progress
• Key for scaling with low latency
• Lockless ring buffer:
• Single writer principle:
• Only the owner can write to it
• Everyone can read it
11/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT: Achieving low latency
• Lockless ring buffer:
• Buffer Head:
• Signals new instance on the buffer
• Owned by the parser
• Buffer Tail:
• Each worker owns its LastProcessed sequence number
• Buffer Tail: Lowest LastProcessed among all workers
12/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Evaluation
• Reference implementations
• MOA (Java)
• StreamDM (C++)
• Datasets:
• Two real-world: Covertype and Electricity
• 11 Synthetic generators:
• RandomRBF Drift: r1-r6
• Hyperplane: h1, h2
• LED Drift: l1
• SEA: s1, s2
• Hardware Platforms:
• Intel platform:
• Desktop class: i7-5930K
• Server class: Xeon Platinum 8160
• ARM-based SoC:
• High-end: Nvidia Jetson Tx1, Applied Micro X-Gene2
• Low-end: Raspberry Pi3
13/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Single Hoeffding Tree Performance vs MOA
• Single thread
• Same accuracy as MOA
• Average throughput 525.65 instances per millisecond (ms)
• 6.73x faster than MOA
• 7x faster then StreamDM
• Including instance loading/parsing time
• Except on the RPI3: all instances already parsed in memory
• Data parsing is currently a bottleneck
• Using data from memory: 3x faster on the Intel i7
14/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Single Hoeffding Tree: Speedup Over MOA
cov
elec
h1
h2
l1
r1
r2
r3
r4
r5
r6
s1
s2
Datasets
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Speedup
10
StreamDM (Intel i7)
LMHT (Intel i7)
LMHT (Jetson TX1)
LMHT (X-Gene2)
LMHT (Rpi3)
15/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT Throughput vs MOA
• 85x faster than MOA on average (Intel i7-5930K):
• MOA average: 1.30 Instances per millisecond (ms)
• LHMT average: 105 instances per millisecond (ms)
covtype elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2
Datasets
0
20
40
60
80
100
120
Instances/ms
130 123
MOA LHMT
16/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT Throughput vs MOA
• Average throughput on different hardware platforms
MOA (i7)
LMHT(i7)
LMHT(Xeon)
LMHT(Jetson-TX1)
LMHT(X-Gene2)
2
13
29
105
109
Throughput(instances/ms)
17/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT Speedup vs MOA
LMHT Intel i7
LMHT Xeon
LMHT Jetson TX1
LMHT X-Gene2
11
24
85
88
Speedup
18/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT Scalability Test: i7-5930K
• i7-5930K: 6 cores / 12 threads
• Threads start competing for resources when using 6+ workers
(7+ threads)
1
2
3
4
5
6
7
8
9
10
11
# Worker threads
1
2
3
4
5
6
7
RelativeSpeedup
cov
elec
h1
h2
l1
r1
r2
r3
r4
r5
r6
s1
s2
19/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
LMHT Scalability Test: Xeon Platinum 8160
• Up to 70% parallel efficiency on Xeon 8160 (24 Cores)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Worker threads
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
RelativeSpeedup
cov
elec
h1
h2
l1
r1
r2
r3
r4
r5
r6
s1
s2
20/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Conclusions
• We presented a high performance scalable design for real-time
data streams classification
• Very low latency: few microseconds (µs) per instance
• Same accuracy than MOA
• Highly adaptive to a variety of hardware platforms
• From server to edge computing (ARM and Intel)
• Up to 70% parallel efficiency on a 24 cores CPU
• On Intel Platforms, on average:
• Single HT: 6.73x faster than MOA
• Multithreaded Ensemble: 85x faster than MOA
• On Arm-based SoCs, on average:
• Single HT: 2x faster than MOA (i7)
• Similar performance on a Raspberry Pi3 (ARM) than MOA(i7)
• Multithreaded ensemble: 24x faster than MOA (i7)
21/24
Introduction Hoeffding Tree Ensembles Evaluations Conclusions
Future Work
• Parser thread can easily limit throughput
• Find the appropriate ratio of learners per parser
• Implement counters for all kinds of attributes
• Scaling to multi-socket nodes (NUMA architectures)
• Distribute ensemble across several nodes
22/24
Thank you
Low-latency Multi-threaded
Ensemble Learning for
Dynamic Big Data Streams
Diego Marr´on (dmarron@ac.upc.edu)
Eduard Ayguad´e (eduard.ayguade@bsc.es)
Jos´e R. Herrero (josepr@ac.upc.edu)
Jesse Read (jesse.read@polytechnique.edu)
Albert Bifet (albert.bifet@telecom-paristech.fr)
2017 IEEE International Conference on Big Data
December 11-14, 2017, Boston, MA, USA

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

  • 1.
    Low-latency Multi-threaded Ensemble Learningfor Dynamic Big Data Streams Diego Marr´on (dmarron@ac.upc.edu) Eduard Ayguad´e (eduard.ayguade@bsc.es) Jos´e R. Herrero (josepr@ac.upc.edu) Jesse Read (jesse.read@polytechnique.edu) Albert Bifet (albert.bifet@telecom-paristech.fr) 2017 IEEE International Conference on Big Data December 11-14, 2017, Boston, MA, USA
  • 2.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Real–time mining of dynamic data streams • Unprecedented amount of dynamic big data streams (Volume) • Generating data at High Ratio (Velocity) • Newly created data rapidly supersedes old data (Volatility) • This increase in volume, velocity and volatility requires data to be processed on–the–fly in real–time 2/24
  • 3.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Real–time dynamic data streams classification • Real-time classification imposes some challenges: • Deal with potentially infinite streams • Single pass on each instance • React to changes on the stream (concept drifting) • Bounded response-time: • Low latency: Milliseconds (ms) per instance • High latency: Few seconds (s) per instance • Limited CPU-time to process each instance • Preferred methods: • Hoeffding Tree (HT) • Ensemble of HT 3/24
  • 4.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Hoeffding Tree • Decision tree suitable for large data streams • Easy–to–deploy • They are usually able to keep up with the arrival rate 4/24
  • 5.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Hoeffding Tree: Basics • Build tree structure incrementally (on-the-fly) • Tree structure uses attributes to route an instance to a leaf node • Leaf node: • Contains the classifier (Naive Bayes) • Statistics to decide next attribute to split (attribute counters) • Split decision: needs per-attribute information gain • Naive Bayes Classifier: calculates each attribute probability 5/24
  • 6.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Ensembles: Random Forest of Hoeffding Trees • Random Forest uses RandomHT (a variation of HT): • Split decision uses a random subset of attributes • For each RandomHT in the ensemble: • Input: sampling with repetition • Tree is reset if change (drift) is detected • Responses are combined to form final prediction • Ensembles require more work for each instance 6/24
  • 7.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Exploiting CPU Parallelism • Parallelism: more work on the same amount of time • Improves throughput per time unit • Or improves the accuracy by using more CPU-intensive methods • Modern CPU parallel features: multithreaded, SIMD instructions 7/24
  • 8.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Contributions • Very low-latency response time • few micro-seconds (µs) per instance • Compared to an state-of-art implementation, MOA: • Same accuracy • Single HT, on average: • Response time: 2 microseconds (µs) per instance • 6.73x faster • Multithreaded ensemble, on average: • Response time: 10 microseconds (µs) per instance • 85x faster • Up to 70% parallel efficiency on a 24 cores CPU • Highly scalable/adaptive design tested on: • Intel platform: i7,Xeon • ARM SoCs: from server range to low end (Raspberry Pi3) 8/24
  • 9.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Hoeffding Tree • Our implementation uses a binary tree • Split into smaller sub-trees that fit on the CPU L1 cache • L1 cache 64KB (8x64bits pointer) • Max sub-tree height: 3 • SIMD instructions to accelerate calculations: • Information Gain • Naive Bayes Classifier cache line cache line 9/24
  • 10.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT: Architecture • N threads (up to the number of CPU hardware threads) • Thread 1: data load/parser • N-1 workers for L learners • Common instance buffer (lockless ring buffer) 10/24
  • 11.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT: Achieving low latency • Lockless datas structures: at least one makes progress • Key for scaling with low latency • Lockless ring buffer: • Single writer principle: • Only the owner can write to it • Everyone can read it 11/24
  • 12.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT: Achieving low latency • Lockless ring buffer: • Buffer Head: • Signals new instance on the buffer • Owned by the parser • Buffer Tail: • Each worker owns its LastProcessed sequence number • Buffer Tail: Lowest LastProcessed among all workers 12/24
  • 13.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Evaluation • Reference implementations • MOA (Java) • StreamDM (C++) • Datasets: • Two real-world: Covertype and Electricity • 11 Synthetic generators: • RandomRBF Drift: r1-r6 • Hyperplane: h1, h2 • LED Drift: l1 • SEA: s1, s2 • Hardware Platforms: • Intel platform: • Desktop class: i7-5930K • Server class: Xeon Platinum 8160 • ARM-based SoC: • High-end: Nvidia Jetson Tx1, Applied Micro X-Gene2 • Low-end: Raspberry Pi3 13/24
  • 14.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Single Hoeffding Tree Performance vs MOA • Single thread • Same accuracy as MOA • Average throughput 525.65 instances per millisecond (ms) • 6.73x faster than MOA • 7x faster then StreamDM • Including instance loading/parsing time • Except on the RPI3: all instances already parsed in memory • Data parsing is currently a bottleneck • Using data from memory: 3x faster on the Intel i7 14/24
  • 15.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Single Hoeffding Tree: Speedup Over MOA cov elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 Datasets 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Speedup 10 StreamDM (Intel i7) LMHT (Intel i7) LMHT (Jetson TX1) LMHT (X-Gene2) LMHT (Rpi3) 15/24
  • 16.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT Throughput vs MOA • 85x faster than MOA on average (Intel i7-5930K): • MOA average: 1.30 Instances per millisecond (ms) • LHMT average: 105 instances per millisecond (ms) covtype elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 Datasets 0 20 40 60 80 100 120 Instances/ms 130 123 MOA LHMT 16/24
  • 17.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT Throughput vs MOA • Average throughput on different hardware platforms MOA (i7) LMHT(i7) LMHT(Xeon) LMHT(Jetson-TX1) LMHT(X-Gene2) 2 13 29 105 109 Throughput(instances/ms) 17/24
  • 18.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT Speedup vs MOA LMHT Intel i7 LMHT Xeon LMHT Jetson TX1 LMHT X-Gene2 11 24 85 88 Speedup 18/24
  • 19.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT Scalability Test: i7-5930K • i7-5930K: 6 cores / 12 threads • Threads start competing for resources when using 6+ workers (7+ threads) 1 2 3 4 5 6 7 8 9 10 11 # Worker threads 1 2 3 4 5 6 7 RelativeSpeedup cov elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 19/24
  • 20.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions LMHT Scalability Test: Xeon Platinum 8160 • Up to 70% parallel efficiency on Xeon 8160 (24 Cores)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # Worker threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 RelativeSpeedup cov elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 20/24
  • 21.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Conclusions • We presented a high performance scalable design for real-time data streams classification • Very low latency: few microseconds (µs) per instance • Same accuracy than MOA • Highly adaptive to a variety of hardware platforms • From server to edge computing (ARM and Intel) • Up to 70% parallel efficiency on a 24 cores CPU • On Intel Platforms, on average: • Single HT: 6.73x faster than MOA • Multithreaded Ensemble: 85x faster than MOA • On Arm-based SoCs, on average: • Single HT: 2x faster than MOA (i7) • Similar performance on a Raspberry Pi3 (ARM) than MOA(i7) • Multithreaded ensemble: 24x faster than MOA (i7) 21/24
  • 22.
    Introduction Hoeffding TreeEnsembles Evaluations Conclusions Future Work • Parser thread can easily limit throughput • Find the appropriate ratio of learners per parser • Implement counters for all kinds of attributes • Scaling to multi-socket nodes (NUMA architectures) • Distribute ensemble across several nodes 22/24
  • 23.
  • 24.
    Low-latency Multi-threaded Ensemble Learningfor Dynamic Big Data Streams Diego Marr´on (dmarron@ac.upc.edu) Eduard Ayguad´e (eduard.ayguade@bsc.es) Jos´e R. Herrero (josepr@ac.upc.edu) Jesse Read (jesse.read@polytechnique.edu) Albert Bifet (albert.bifet@telecom-paristech.fr) 2017 IEEE International Conference on Big Data December 11-14, 2017, Boston, MA, USA