Streaming
Computing
Some thoughts and technology choices for event-driven processing
Natalino Busa - 29 Aug. 2013
Outline
● Concurrency
● Streaming computing
● Technologies
○ Gigaspaces
○ Storm
○ Akka
● Comparison matrix
● Opportunities
Algorithms: a tribute
Numbers and Algorithms:
9th century Persian Muslim
mathematician Abu Abdullah
Muhammad ibn Musa Al-Khwarizmi,
whose work built upon that of the 7th
century Indian mathematician
Brahmagupta.
We own a lot to these guys !!!
Why do we need parallelism?
It gets bigger,
It doesn’t get much faster
BUT
We get more cores in a chip.
More cores = more parallelism
We are happy now, right?
Moore’s law
Every 18 months, the number of CPU
core’s double
Another
interpretation:
Every 18 months, the number of idle
CPU core’s double
More parallelism
We trade:
Time vs ( CPU, Memory, I/O)
Modern applications
Scalability:
Vertical: concurrency
(use all the cores, memory and I/O of a given machine)
Horizontal: distribution
(use all the machines in the cluster)
High availability:
Fault tolerance: all levels (local, distributed)
(the terminator effect: you can stop it but can’t kill it )
Streaming applications
Performance:
Efficient use of resources:
CPU and memory, but also OS threads and sockets
Asynchronous:
event driven, reacts on new data
Distributed:
more machines = more performance
the algorithm is partitioned and/or replicated on the cluster
What to increase?
More CPU: It helps when there is
computation involved
More MEMORY: It helps when there is
more state to keep
More I/O: It helps when there are
more messages to transfer
Streaming or batch?
ProcessingData
Natalino Busa - 12 Feb. 2013
Data
source system target systemour system
What differentiate Streaming from Batch?
● Granularity of Data
● Granularity of Processing
Granularity impacts:
Throughput, Latency, and the Cost of the system!
The choice is yours
1000 events/sec (1 KB/event)
running on 100 cores all day long
“Wait a day, then process”
860 M events = 86 GB of data
Latency: 24 hours
Throughput: 1 update/day
BATCH: Hadoop
Latency 1ms
Throughput: 1000 updates/sec
STREAMING: Akka
“Do not wait”
Process the 1KB of data each msec.
“Both are valid options. It depends on the application domain and the
requirements/specs of the target and source systems”
Mapping it to existing applications
Granularity of
Data
256 GB 256 GB
Granularity of
Processing
1 CPU 100 CPU’s
Traditional DB systems Big Data (Hadoop)
Granularity of
Data
1 KB 1 KB
Granularity of
Processing
1 CPU 100 CPU’s
Traditional mail server Web application server
Technologies: Gigaspaces
Technologies: Storm
Topology
Supervising
Scaling
Technologies: Akka
Supervising:
tree of actors
Topology (statics and dynamic actors)
Scaling and
distributed processing
Technology matrix
GranularityofData Granularity of Processing
Small Big
Small Akka Akka
Gigaspaces
Big ? Storm
System end-to-end throughput
High ~ 10’000 events/sec Medium ~100 events/sec Low ~10 events/sec
Akka Storm/ Gigaspaces Scripting languages
Big Data in motion
Both are:
Distributed, fault-tolerant, streaming
- Storm
++ multi-language
-- not user/admin friendly
-- slow supervising
processing elements are jvm’s
ideal when data is coarse grained
- Akka
++ high throughput, fine grained actors
++ dynamic topologies
-- low-level, but high performance
processing elements are small and lightweight
ideal for millions of transactions per second
- Gigaspaces
++ combines memory + application distribution
-- framework api is not very flexible
processing elements are jvms
ideal for all-in-one solution, with little customization
Opportunity: Lambda Architecture
Logic layer
Software as a Service
e.g realt-time predictor
Natalino Busa - 12 Feb. 2013
from http://www.manning.com/marz/
Opportunity: Batch + Streaming
Batch
Computing
Front End Services
In-Memory
Distributed Database
In-memory
Distributed DB’s
Batch
Streaming
HTML5 Client / Responsive Applow-latency
HTTP API services FETCH
(refresh)
Streaming
Computing
Data Warehouses Messaging Busses
PUSH
(SSE, notifications)
Thanks
linkedin:
www.linkedin.com/in/natalinobusa
blog:
www.natalinobusa.com
twitter:
@natalinobusa

Streaming computing: architectures, and tchnologies

  • 1.
    Streaming Computing Some thoughts andtechnology choices for event-driven processing Natalino Busa - 29 Aug. 2013
  • 2.
    Outline ● Concurrency ● Streamingcomputing ● Technologies ○ Gigaspaces ○ Storm ○ Akka ● Comparison matrix ● Opportunities
  • 3.
    Algorithms: a tribute Numbersand Algorithms: 9th century Persian Muslim mathematician Abu Abdullah Muhammad ibn Musa Al-Khwarizmi, whose work built upon that of the 7th century Indian mathematician Brahmagupta. We own a lot to these guys !!!
  • 4.
    Why do weneed parallelism? It gets bigger, It doesn’t get much faster BUT We get more cores in a chip. More cores = more parallelism We are happy now, right?
  • 5.
    Moore’s law Every 18months, the number of CPU core’s double Another interpretation: Every 18 months, the number of idle CPU core’s double
  • 6.
    More parallelism We trade: Timevs ( CPU, Memory, I/O)
  • 7.
    Modern applications Scalability: Vertical: concurrency (useall the cores, memory and I/O of a given machine) Horizontal: distribution (use all the machines in the cluster) High availability: Fault tolerance: all levels (local, distributed) (the terminator effect: you can stop it but can’t kill it )
  • 8.
    Streaming applications Performance: Efficient useof resources: CPU and memory, but also OS threads and sockets Asynchronous: event driven, reacts on new data Distributed: more machines = more performance the algorithm is partitioned and/or replicated on the cluster
  • 9.
    What to increase? MoreCPU: It helps when there is computation involved More MEMORY: It helps when there is more state to keep More I/O: It helps when there are more messages to transfer
  • 10.
    Streaming or batch? ProcessingData NatalinoBusa - 12 Feb. 2013 Data source system target systemour system What differentiate Streaming from Batch? ● Granularity of Data ● Granularity of Processing Granularity impacts: Throughput, Latency, and the Cost of the system!
  • 11.
    The choice isyours 1000 events/sec (1 KB/event) running on 100 cores all day long “Wait a day, then process” 860 M events = 86 GB of data Latency: 24 hours Throughput: 1 update/day BATCH: Hadoop Latency 1ms Throughput: 1000 updates/sec STREAMING: Akka “Do not wait” Process the 1KB of data each msec. “Both are valid options. It depends on the application domain and the requirements/specs of the target and source systems”
  • 12.
    Mapping it toexisting applications Granularity of Data 256 GB 256 GB Granularity of Processing 1 CPU 100 CPU’s Traditional DB systems Big Data (Hadoop) Granularity of Data 1 KB 1 KB Granularity of Processing 1 CPU 100 CPU’s Traditional mail server Web application server
  • 13.
  • 14.
  • 15.
    Technologies: Akka Supervising: tree ofactors Topology (statics and dynamic actors) Scaling and distributed processing
  • 16.
    Technology matrix GranularityofData Granularityof Processing Small Big Small Akka Akka Gigaspaces Big ? Storm System end-to-end throughput High ~ 10’000 events/sec Medium ~100 events/sec Low ~10 events/sec Akka Storm/ Gigaspaces Scripting languages
  • 17.
    Big Data inmotion Both are: Distributed, fault-tolerant, streaming - Storm ++ multi-language -- not user/admin friendly -- slow supervising processing elements are jvm’s ideal when data is coarse grained - Akka ++ high throughput, fine grained actors ++ dynamic topologies -- low-level, but high performance processing elements are small and lightweight ideal for millions of transactions per second - Gigaspaces ++ combines memory + application distribution -- framework api is not very flexible processing elements are jvms ideal for all-in-one solution, with little customization
  • 18.
    Opportunity: Lambda Architecture Logiclayer Software as a Service e.g realt-time predictor Natalino Busa - 12 Feb. 2013 from http://www.manning.com/marz/
  • 19.
    Opportunity: Batch +Streaming Batch Computing Front End Services In-Memory Distributed Database In-memory Distributed DB’s Batch Streaming HTML5 Client / Responsive Applow-latency HTTP API services FETCH (refresh) Streaming Computing Data Warehouses Messaging Busses PUSH (SSE, notifications)
  • 20.