Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

Low-latency Multi-threaded
Ensemble Learning for
Dynamic Big Data Streams
Diego Marrón (dmarron@ac.upc.edu)
Eduard Ayguadé (eduard.ayguade@bsc.es)
José R. Herrero (josepr@ac.upc.edu)
Jesse Read (jesse.read@polytechnique.edu)
Albert Bifet (albert.bifet@telecom-paristech.fr)
2017 IEEE International Conference on Big Data
December 11-14, 2017, Boston, MA, USA

Introduction Hoeﬀding Tree Ensembles Evaluations Conclusions
Real–time mining of dynamic data streams
• Unprecedented amount of dynamic big data streams (Volume)
• Generating data at High Ratio (Velocity)
• Newly created data rapidly supersedes old data (Volatility)
• This increase in volume, velocity and volatility requires data
to be processed on–the–ﬂy in real–time
2/24

Real–time dynamic data streams classification
• Real-time classification imposes some challenges:
• Deal with potentially infinite streams
• Single pass on each instance
• React to changes on the stream (concept drifting)
• Bounded response-time:
• Low latency: Milliseconds (ms) per instance
• High latency: Few seconds (s) per instance
• Limited CPU-time to process each instance
• Preferred methods:
• Hoeffding Tree (HT)
• Ensemble of HT
3/24

Hoeﬀding Tree
• Decision tree suitable for large data streams
• Easy–to–deploy
• They are usually able to keep up with the arrival rate
4/24

Hoeffding Tree: Basics
• Build tree structure incrementally (on-the-fly)
• Tree structure uses attributes to route an instance to a leaf
node
• Leaf node:
• Contains the classifier (Naive Bayes)
• Statistics to decide next attribute to split (attribute counters)
• Split decision: needs per-attribute information gain
• Naive Bayes Classifier: calculates each attribute probability
5/24

Ensembles: Random Forest of Hoeﬀding Trees
• Random Forest uses RandomHT (a variation of HT):
• Split decision uses a random subset of attributes
• For each RandomHT in the ensemble:
• Input: sampling with repetition
• Tree is reset if change (drift) is detected
• Responses are combined to form ﬁnal prediction
• Ensembles require more work for each instance
6/24

Exploiting CPU Parallelism
• Parallelism: more work on the same amount of time
• Improves throughput per time unit
• Or improves the accuracy by using more CPU-intensive
methods
• Modern CPU parallel features: multithreaded, SIMD
instructions
7/24

Contributions
• Very low-latency response time
• few micro-seconds (µs) per instance
• Compared to an state-of-art implementation, MOA:
• Same accuracy
• Single HT, on average:
• Response time: 2 microseconds (µs) per instance
• 6.73x faster
• Multithreaded ensemble, on average:
• Response time: 10 microseconds (µs) per instance
• 85x faster
• Up to 70% parallel eﬃciency on a 24 cores CPU
• Highly scalable/adaptive design tested on:
• Intel platform: i7,Xeon
• ARM SoCs: from server range to low end (Raspberry Pi3)
8/24

Hoeffding Tree
• Our implementation uses a binary tree
• Split into smaller sub-trees that fit on the CPU L1 cache
• L1 cache 64KB (8x64bits pointer)
• Max sub-tree height: 3
• SIMD instructions to accelerate calculations:
• Information Gain
• Naive Bayes Classifier
cache line
cache line
9/24

LMHT: Architecture
• N threads (up to the number of CPU hardware threads)
• Thread 1: data load/parser
• N-1 workers for L learners
• Common instance buﬀer (lockless ring buﬀer)
10/24

LMHT: Achieving low latency
• Lockless datas structures: at least one makes progress
• Key for scaling with low latency
• Lockless ring buﬀer:
• Single writer principle:
• Only the owner can write to it
• Everyone can read it
11/24

LMHT: Achieving low latency
• Lockless ring buffer:
• Buffer Head:
• Signals new instance on the buffer
• Owned by the parser
• Buffer Tail:
• Each worker owns its LastProcessed sequence number
• Buffer Tail: Lowest LastProcessed among all workers
12/24

Evaluation
• Reference implementations
• MOA (Java)
• StreamDM (C++)
• Datasets:
• Two real-world: Covertype and Electricity
• 11 Synthetic generators:
• RandomRBF Drift: r1-r6
• Hyperplane: h1, h2
• LED Drift: l1
• SEA: s1, s2
• Hardware Platforms:
• Intel platform:
• Desktop class: i7-5930K
• Server class: Xeon Platinum 8160
• ARM-based SoC:
• High-end: Nvidia Jetson Tx1, Applied Micro X-Gene2
• Low-end: Raspberry Pi3
13/24

Single Hoeﬀding Tree Performance vs MOA
• Single thread
• Same accuracy as MOA
• Average throughput 525.65 instances per millisecond (ms)
• 6.73x faster than MOA
• 7x faster then StreamDM
• Including instance loading/parsing time
• Except on the RPI3: all instances already parsed in memory
• Data parsing is currently a bottleneck
• Using data from memory: 3x faster on the Intel i7
14/24

Single Hoeﬀding Tree: Speedup Over MOA
cov
elec
h1
h2
l1
r1
r2
r3
r4
r5
r6
s1
s2
Datasets
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Speedup
10
StreamDM (Intel i7)
LMHT (Intel i7)
LMHT (Jetson TX1)
LMHT (X-Gene2)
LMHT (Rpi3)
15/24

LMHT Throughput vs MOA
• 85x faster than MOA on average (Intel i7-5930K):
• MOA average: 1.30 Instances per millisecond (ms)
• LHMT average: 105 instances per millisecond (ms)
covtype elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2
Datasets
0
20
40
60
80
100
120
Instances/ms
130 123
MOA LHMT
16/24

LMHT Throughput vs MOA
• Average throughput on diﬀerent hardware platforms
MOA (i7)
LMHT(i7)
LMHT(Xeon)
LMHT(Jetson-TX1)
LMHT(X-Gene2)
2
13
29
105
109
Throughput(instances/ms)
17/24

LMHT Speedup vs MOA
LMHT Intel i7
LMHT Xeon
LMHT Jetson TX1
LMHT X-Gene2
11
24
85
88
Speedup
18/24

LMHT Scalability Test: i7-5930K
• i7-5930K: 6 cores / 12 threads
• Threads start competing for resources when using 6+ workers
(7+ threads)
1
2
3
4
5
6
7
8
9
10
11
# Worker threads
1
2
3
4
5
6
7
RelativeSpeedup
cov
elec
h1
h2
l1
r1
r2
r3
r4
r5
r6
s1
s2
19/24

LMHT Scalability Test: Xeon Platinum 8160
• Up to 70% parallel eﬃciency on Xeon 8160 (24 Cores)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Worker threads
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
RelativeSpeedup
cov
elec
h1
h2
l1
r1
r2
r3
r4
r5
r6
s1
s2
20/24

Conclusions
• We presented a high performance scalable design for real-time
data streams classiﬁcation
• Very low latency: few microseconds (µs) per instance
• Same accuracy than MOA
• Highly adaptive to a variety of hardware platforms
• From server to edge computing (ARM and Intel)
• Up to 70% parallel eﬃciency on a 24 cores CPU
• On Intel Platforms, on average:
• Single HT: 6.73x faster than MOA
• Multithreaded Ensemble: 85x faster than MOA
• On Arm-based SoCs, on average:
• Single HT: 2x faster than MOA (i7)
• Similar performance on a Raspberry Pi3 (ARM) than MOA(i7)
• Multithreaded ensemble: 24x faster than MOA (i7)
21/24

Future Work
• Parser thread can easily limit throughput
• Find the appropriate ratio of learners per parser
• Implement counters for all kinds of attributes
• Scaling to multi-socket nodes (NUMA architectures)
• Distribute ensemble across several nodes
22/24

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

More Related Content

What's hot

Similar to Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

Recently uploaded

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams