• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Near realtime analytics - technology choice (@pavlobaron)
 

Near realtime analytics - technology choice (@pavlobaron)

on

  • 1,923 views

Slides of the talk I have given and will be giving at some conferences this and next year.

Slides of the talk I have given and will be giving at some conferences this and next year.

Statistics

Views

Total Views
1,923
Views on SlideShare
1,756
Embed Views
167

Actions

Likes
9
Downloads
0
Comments
2

3 Embeds 167

https://twitter.com 164
https://www.xing.com 2
http://blog.lukaszbrzyski.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Near realtime analytics - technology choice (@pavlobaron) Near realtime analytics - technology choice (@pavlobaron) Presentation Transcript

    • N..e.ar ..re. ..ana. tec....hnolo...gy choi...ce
    • ✦ pavlo.baron@codecentric.de ✦ @pavlobaron
    • Wile E. Coyote ✦ pretty slow ✦ running on own demand ✦ very wide field of vision ✦ very long memory ✦ purely proactive ✦ ✦ thoroughly analysing and preparing always loses
    • Road Runner ✦ hell fast ✦ ever running ✦ very narrow field of vision ✦ very short memory ✦ purely reactive ✦ ✦ forced to immediately decide always wins
    • Coyote: slow ✦ ✦ ✦ too much mumbo-jumbo, too many tools, totally dependent on ACME needs a complex, partially distributed setup complex decisions, depending on Runner, weather, environment etc.
    • Runner: fast ✦ ✦ ✦ zero hoo-ha, zero tools, just own body road bound simple decisions like run | halt | step aside | beep beep
    • Coyote: offline ✦ ✦ mostly stands around, observing and planning only sprints on demand, when Runner passes by
    • Runner: non-stop ✦ ✦ never stops fully, just occasionally halts for food and to fool Coyote continuously runs the road in search for food
    • Coyote: wide vision ✦ ✦ sees the whole environment tries to use the whole environment to catch Runner, predicting his paths
    • Runner: narrow vision ✦ ✦ only sees what’s in front of his nose on the road due to speed and short-time predictions, feels well with the narrow, momentary vision
    • Coyote: long memory ✦ ✦ as far as possible, learns from previous failures continuously improves tricks to catch Runner
    • Runner: short memory ✦ ✦ ultimate carpe diem predicts Coyote’s actions in last minute, avoiding being harmed right before the fact
    • Coyote: proactive ✦ plans and tries out, looks for new ways to catch Runner
    • Runner: reactive ✦ doesn’t plan, just reacts on Coyote’s actions
    • Coyote: thorough ✦ ✦ thoroughly analyses the situation throughly plans ahead, prepares for one single shot
    • Runner: spontaneous ✦ ✦ decides immediately and spontaneously, depending on what Coyote does makes the best immediate decision to achieve the highest level of Coyote fooling
    • Coyote: loses ✦ ✦ no matter how hard he tries, he’s never fast or savvy enough to catch Runner never gives up though
    • Runner: wins ✦ ✦ doesn’t even try to win, but always does thanks to speed and immediate situation analysis, followed by reaction. Also, due to Coyote’s continuous failure every time has fun fooling Coyote
    • Coyote is batch. Runner is near realtime.
    • Batch (analytics) ✦ ✦ ✦ ✦ is when you have plenty of time for analysis is when you explore patterns and models in historic data is when you try to fit any sort of data into a hypothetic model is when you plan and forecast the future instead of (re)acting immediately
    • Batch (architecture) ✦ ✦ ✦ ✦ is when you (synchronously) query previously stored data is when you use main memory primarily for temporary caches is when you do ETL and alike, even on Hadoop’s rails is when you split large amounts of historic data in smaller portions for distributed / parallel analysis
    • Batch (technology) ✦ ✦ ✦ is when you build on (R)DBMS or (softschema) NoSQL data stores in a classic way is when you store in HDFS and process with Hadoop & Co. is when you generally rely on disks / storage
    • Near realtime (analytics) ✦ ✦ ✦ ✦ is when you don’t have time is when you analyse data as it comes is when you already have a fixed model, and data flying in fits it 100% is when you (re)act immediately, based on patterns you learned online and in the batch analysis
    • Near realtime (architecture) ✦ ✦ ✦ ✦ is when you don’t query data, but expect / assume it is when you use main memory as primary data storage is when you process event streams is when you distribute and parallelise only independent computations (it’s hairy enough even on one machine - explicit loop tiling, skewing etc.)
    • Near realtime (technology) ✦ ✦ ✦ ✦ ✦ is when you build on DSMS, event processing systems and alike is when you store (almost) only for archiving reasons is when you don’t hit disks or speak of “storage” is when you do your best to avoid horizontal network gossip is when you must go for accelerators such as GPUs in case of complex math
    • Near realtime - non-stop, immediate analytics cannot be done as / in batch.
    • Near realtime is tricky ✦ ✦ ✦ ✦ ✦ you need to build event-driven, non-blocking, lock-free, reactive programs (buzzword award!) you need to work time-bound, penalising or compensating late events you need to keep everything (sliced, autoexpiring) in main memory you need to completely utilise resources of one single machine (speaking of mechanical sympathy), without waste you need to fix your model and work with fixed-size (binary) events
    • Scaling near realtime ✦ ✦ ✦ ✦ scaling near realtime analytics is pretty hard. Similar challenges parallelising on one machine or scaling out in a distributed way you scale through logical or physical stream splitting, online scatter-gather and alike you keep distributed / parallel computation independent, until you have to merge in the next processing stage. And so on. you scale through receive-and-forward, fireand-forget, cascading, pipelining, multicast, redundant (who’s first, role-based etc.) processing
    • Surviving near realtime ✦ ✦ ✦ building a restlessly eventoriented, in-memory analytics system brings some challenges disaster recovery: yet again, splitting streams (for storage), redundant (role-based) computation short-term failure recovery: upfront temporary, auto-expiring storage, auto-replay or penalising events
    • Near realtime is limited ✦ ✦ ✦ you need to run most of analytics on event windows of some size you switch from exact to probabilistic / approximate results you can only predict near future, cluster based on relatively short time periods and recognise short-term patterns and anomalies only
    • Near realtime mining ✦ ✦ ✦ ✦ you mine live streams instead of passive data sources typical algorithms such as Apriori, 1-class-SVM, k-means, regressions etc. are easily possible, but on stream portions only NLP can be done by giving words identifiers and dealing with binary messages instead of text as long as it fits into main memory, it’s comparable to classic mining, but is much faster
    • Near realtime + batch? ✦ ✦ ✦ the combination of both is what can make a winning solution. Example reference architecture: Lambda, but it’s even more exploratory, offline analytics, baseline analysis, pattern mining, algorithm training and alike you do in the batch you apply batch analytics’ results to near realtime and prove or reject hypothesis’, detect anomalies, run forecasts, derive trends etc.
    • Near realtime, no batch? ✦ ✦ ✦ ✦ it’s possible to do some of this completely without batch, just on streams - even more than basic counters and stats you need to keep every single historic event in a data store you need to replay historic events instead of querying / mining your data store don’t query your database - let the database stream what it has to you
    • Near realtime example tools ✦ ✦ ✦ ✦ ✦ query/store-oriented/passivelyadapting: Spark/Shark, Impala, Drill, ParStream, Splunk full-blown CEP engines / continuous querying DSMSs: Esper, TIBCO/StreamBase more pragmatic stream processors: Storm, S4, Samza event-oriented, continuous analysers: keen.io, also speaker’s current WIP etc. etc. etc...
    • Near realtime - DIY ✦ ✦ ✦ ✦ in the end, you’ll have to build it (or core parts of it) yourself you’ll have to work with circular / ring buffers and / or zero-overhead queuing software: Disruptor, 0MQ ideally, you keep everything in one single OS process - multi-threading is still hairy enough then managing and using machine’s overall memory is the tricky part ✦ for GPUs: OpenCL, Rootbeer ✦ embed analytics / statistics into the process
    • Near realtime - DIY ✦ ✦ ✦ ✦ ✦ ✦ picking the basis platform has less to do with the personal flavour than with what it offers C is a good and a valid choice, but very “manual” Erlang/OTP is great for glue, but hard for analytics and integration. In the end, it’s C, but pretty tricky here Node.js is C in the end at this point, but it’s not for single-process / multi-threading and still maturing JVM is a good compromise. Managed / GCcontrolled memory with object wrappers will be sacrificed for off-heap memory with primitives though Most of the rest doesn’t apply for this sort of tasks
    • Near realtime - DIY ✦ ✦ ✦ ✦ ✦ programming paradigms and thus languages are the essential, secret sauce functional programming is ideal for analytics and event-processing (functional) reactive programming, Reactor (as pattern or framework), RX are good for building this sort of systems JavaScript is partially there, Erlang, Clojure, Scala & Co. are further, but can be uncontrollable in runtime behaviour pure Java can be (later) a healthy tradeoff though - now with RX or Reactor, Netty etc.
    • Time in near realtime ✦ ✦ ✦ ✦ ✦ realtime still means real time, even if “near” the platform of your choice might not be ideal for hard or soft realtime, since the difference is primarily in what happens with late events and under high load Erlang will do its best to trigger a timer. Same with Node.js. But they don’t interrupt hard, are scheduling on their own and thus leaving you with an approximation JVM comes close, but still no easy way to interrupt explicitly. Alternative: Hashing Wheel, own scheduler on dedicated core C is the winner, OS-support essential (RTOS alike)
    • Near realtime + data store? ✦ ✦ ✦ ✦ near realtime analytics systems need to store data in different stages: shortterm replay, disaster protection, history the trick is to turn around the way you work with the data store your data store knows model and queries beforehand, and only waits for events to start streaming historic data satisfying the static query / view most NoSQL stores, but also classic RDBMS have implantable workers / jobs / coprocessors as built-in feature: Oracle, Riak, HBase etc.
    • Near realtime business cases ✦ ✦ anomaly / novelty / outlier detection in any sort of system fraud, attack detection based on patterns ✦ situational pricing, product placement ✦ stock, inventory control and forecast ✦ online bidding, trading ✦ automated traffic optimization ✦ semi-automated operations ✦ immediate visualization and tracing
    • Why speed? ✦ ✦ ✦ ✦ why be slow if it’s possible, with comparable effort, to be fast in making decisions and automating them? If not you, then your competitor since everybody can mine data, speed and quality are the only technical success factors left it’s about how fast you can decide based on data. The best way is to start very early, at the source of data “new economy” is all about speed, not (only) lobbies
    • ✦ cartoon images found on the internet and are directly or indirectly property/copyright of or related to Time Warner