Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

37 views

Published on

Real–time mining of evolving data streams in-
volves new challenges when targeting today’s application do-
mains such as the Internet of the Things: increasing volume,
velocity and volatility requires data to be processed on–the–
fly with fast reaction and adaptation to changes. This paper
presents a high performance scalable design for decision trees
and ensemble combinations that makes use of the vector SIMD
and multicore capabilities available in modern processors to
provide the required throughput and accuracy. The proposed
design offers very low latency and good scalability with the
number of cores on commodity hardware when compared to
other state–of–the art implementations. On an Intel i7-based
system, processing a single decision tree is 6x faster than
MOA (Java), and 7x faster than StreamDM (C++), two well-
known reference implementations. On the same system, the
use of the 6 cores (and 12 hardware threads) available allow
to process an ensemble of 100 learners 85x faster that MOA
while providing the same accuracy. Furthermore, our solution
is highly scalable: on an Intel Xeon socket with large core
counts, the proposed ensemble design achieves up to 16x speed-
up when employing 24 cores with respect to a single threaded
execution

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams

  1. 1. Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams Diego Marr´on (dmarron@ac.upc.edu) Eduard Ayguad´e (eduard.ayguade@bsc.es) Jos´e R. Herrero (josepr@ac.upc.edu) Jesse Read (jesse.read@polytechnique.edu) Albert Bifet (albert.bifet@telecom-paristech.fr) 2017 IEEE International Conference on Big Data December 11-14, 2017, Boston, MA, USA
  2. 2. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Real–time mining of dynamic data streams • Unprecedented amount of dynamic big data streams (Volume) • Generating data at High Ratio (Velocity) • Newly created data rapidly supersedes old data (Volatility) • This increase in volume, velocity and volatility requires data to be processed on–the–fly in real–time 2/24
  3. 3. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Real–time dynamic data streams classification • Real-time classification imposes some challenges: • Deal with potentially infinite streams • Single pass on each instance • React to changes on the stream (concept drifting) • Bounded response-time: • Low latency: Milliseconds (ms) per instance • High latency: Few seconds (s) per instance • Limited CPU-time to process each instance • Preferred methods: • Hoeffding Tree (HT) • Ensemble of HT 3/24
  4. 4. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Hoeffding Tree • Decision tree suitable for large data streams • Easy–to–deploy • They are usually able to keep up with the arrival rate 4/24
  5. 5. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Hoeffding Tree: Basics • Build tree structure incrementally (on-the-fly) • Tree structure uses attributes to route an instance to a leaf node • Leaf node: • Contains the classifier (Naive Bayes) • Statistics to decide next attribute to split (attribute counters) • Split decision: needs per-attribute information gain • Naive Bayes Classifier: calculates each attribute probability 5/24
  6. 6. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Ensembles: Random Forest of Hoeffding Trees • Random Forest uses RandomHT (a variation of HT): • Split decision uses a random subset of attributes • For each RandomHT in the ensemble: • Input: sampling with repetition • Tree is reset if change (drift) is detected • Responses are combined to form final prediction • Ensembles require more work for each instance 6/24
  7. 7. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Exploiting CPU Parallelism • Parallelism: more work on the same amount of time • Improves throughput per time unit • Or improves the accuracy by using more CPU-intensive methods • Modern CPU parallel features: multithreaded, SIMD instructions 7/24
  8. 8. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Contributions • Very low-latency response time • few micro-seconds (µs) per instance • Compared to an state-of-art implementation, MOA: • Same accuracy • Single HT, on average: • Response time: 2 microseconds (µs) per instance • 6.73x faster • Multithreaded ensemble, on average: • Response time: 10 microseconds (µs) per instance • 85x faster • Up to 70% parallel efficiency on a 24 cores CPU • Highly scalable/adaptive design tested on: • Intel platform: i7,Xeon • ARM SoCs: from server range to low end (Raspberry Pi3) 8/24
  9. 9. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Hoeffding Tree • Our implementation uses a binary tree • Split into smaller sub-trees that fit on the CPU L1 cache • L1 cache 64KB (8x64bits pointer) • Max sub-tree height: 3 • SIMD instructions to accelerate calculations: • Information Gain • Naive Bayes Classifier cache line cache line 9/24
  10. 10. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT: Architecture • N threads (up to the number of CPU hardware threads) • Thread 1: data load/parser • N-1 workers for L learners • Common instance buffer (lockless ring buffer) 10/24
  11. 11. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT: Achieving low latency • Lockless datas structures: at least one makes progress • Key for scaling with low latency • Lockless ring buffer: • Single writer principle: • Only the owner can write to it • Everyone can read it 11/24
  12. 12. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT: Achieving low latency • Lockless ring buffer: • Buffer Head: • Signals new instance on the buffer • Owned by the parser • Buffer Tail: • Each worker owns its LastProcessed sequence number • Buffer Tail: Lowest LastProcessed among all workers 12/24
  13. 13. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Evaluation • Reference implementations • MOA (Java) • StreamDM (C++) • Datasets: • Two real-world: Covertype and Electricity • 11 Synthetic generators: • RandomRBF Drift: r1-r6 • Hyperplane: h1, h2 • LED Drift: l1 • SEA: s1, s2 • Hardware Platforms: • Intel platform: • Desktop class: i7-5930K • Server class: Xeon Platinum 8160 • ARM-based SoC: • High-end: Nvidia Jetson Tx1, Applied Micro X-Gene2 • Low-end: Raspberry Pi3 13/24
  14. 14. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Single Hoeffding Tree Performance vs MOA • Single thread • Same accuracy as MOA • Average throughput 525.65 instances per millisecond (ms) • 6.73x faster than MOA • 7x faster then StreamDM • Including instance loading/parsing time • Except on the RPI3: all instances already parsed in memory • Data parsing is currently a bottleneck • Using data from memory: 3x faster on the Intel i7 14/24
  15. 15. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Single Hoeffding Tree: Speedup Over MOA cov elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 Datasets 1.0 2.0 3.0 4.0 5.0 6.0 7.0 Speedup 10 StreamDM (Intel i7) LMHT (Intel i7) LMHT (Jetson TX1) LMHT (X-Gene2) LMHT (Rpi3) 15/24
  16. 16. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT Throughput vs MOA • 85x faster than MOA on average (Intel i7-5930K): • MOA average: 1.30 Instances per millisecond (ms) • LHMT average: 105 instances per millisecond (ms) covtype elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 Datasets 0 20 40 60 80 100 120 Instances/ms 130 123 MOA LHMT 16/24
  17. 17. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT Throughput vs MOA • Average throughput on different hardware platforms MOA (i7) LMHT(i7) LMHT(Xeon) LMHT(Jetson-TX1) LMHT(X-Gene2) 2 13 29 105 109 Throughput(instances/ms) 17/24
  18. 18. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT Speedup vs MOA LMHT Intel i7 LMHT Xeon LMHT Jetson TX1 LMHT X-Gene2 11 24 85 88 Speedup 18/24
  19. 19. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT Scalability Test: i7-5930K • i7-5930K: 6 cores / 12 threads • Threads start competing for resources when using 6+ workers (7+ threads) 1 2 3 4 5 6 7 8 9 10 11 # Worker threads 1 2 3 4 5 6 7 RelativeSpeedup cov elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 19/24
  20. 20. Introduction Hoeffding Tree Ensembles Evaluations Conclusions LMHT Scalability Test: Xeon Platinum 8160 • Up to 70% parallel efficiency on Xeon 8160 (24 Cores)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # Worker threads 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 RelativeSpeedup cov elec h1 h2 l1 r1 r2 r3 r4 r5 r6 s1 s2 20/24
  21. 21. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Conclusions • We presented a high performance scalable design for real-time data streams classification • Very low latency: few microseconds (µs) per instance • Same accuracy than MOA • Highly adaptive to a variety of hardware platforms • From server to edge computing (ARM and Intel) • Up to 70% parallel efficiency on a 24 cores CPU • On Intel Platforms, on average: • Single HT: 6.73x faster than MOA • Multithreaded Ensemble: 85x faster than MOA • On Arm-based SoCs, on average: • Single HT: 2x faster than MOA (i7) • Similar performance on a Raspberry Pi3 (ARM) than MOA(i7) • Multithreaded ensemble: 24x faster than MOA (i7) 21/24
  22. 22. Introduction Hoeffding Tree Ensembles Evaluations Conclusions Future Work • Parser thread can easily limit throughput • Find the appropriate ratio of learners per parser • Implement counters for all kinds of attributes • Scaling to multi-socket nodes (NUMA architectures) • Distribute ensemble across several nodes 22/24
  23. 23. Thank you
  24. 24. Low-latency Multi-threaded Ensemble Learning for Dynamic Big Data Streams Diego Marr´on (dmarron@ac.upc.edu) Eduard Ayguad´e (eduard.ayguade@bsc.es) Jos´e R. Herrero (josepr@ac.upc.edu) Jesse Read (jesse.read@polytechnique.edu) Albert Bifet (albert.bifet@telecom-paristech.fr) 2017 IEEE International Conference on Big Data December 11-14, 2017, Boston, MA, USA

×