Realtime  Distributed Analysis  of Datastreams

Realtime 
Distributed Analysis 
of Datastreams

Philipp Nolte – University of Passau – January 2014

1

Learn
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analytics.
Why twitter created Storm.
How Storm works.
2

Limits

Imagine a traditional web analytics software:
Every page view increments 
the url’s database row.

3

First Aid

Queue your writes and write in batches.
Shard your data: Partition horizontally.

4

Chronic Issues
Fault-tolerance is hard.
Applications become more and more complex.
You have to do all the work.

5

New Tools
Large scale computation systems such as Hadoop.
Scalable databases such as Casandra and Riak.
Easy to use frameworks such as Storm and Dempsy.

6

Lambda Architecture
Theoretical, abstract architecture for working with big data.

Speed Layer
Serving Layer
Batch Layer
7

Goal

Compute arbitrary functions on arbitrary data.
query = function ( all data )

8

Properties
Robust and fault-tolerant.
Low latency reads and updates.
Scalable.
Minimal maintenance.

9

Batch Layer

Speed Layer
Serving Layer

Stores the immutable master dataset.
Precomputes arbitrary batch views.
Home of batch processing and map 
reduce systems such as Hadoop.

10

Batch Layer

Serving Layer

Speed Layer
Serving Layer
Batch Layer

Read-only random-access to batch views.
Updated by batch layer.
Indexes batch views.
Home of real-time query systems 
such as Cloudera Impala for Hadoop.
11

Speed Layer

Speed Layer
Serving Layer
Batch Layer

Compensates for high-latency batch views.
Fast, incremental algorithms.
More complex because of random-writes.
Home of Apache HBase or Storm.

12

Lambda Architecture
Speed Layer
Realtime Views
Batch Views

Data

Serving Layer
Batch Layer
13

Query

Available Data
Batch View
Time

Realtime View
Discard Realtime View 
as soon as it is represented
in the batch view.

Batch View

Realtime View

14

Twitter’s Early Days
Worker

Queue

Queue

Worker

Worker

Queue

Worker

Map

Queue

Worker

Queue

Worker

Tweets

Worker

Queue

Worker

URLs
Hadoop

Cassandra
15

Storm
Guaranteed message processing without 
message brokers.
Horizontal scalability.
Fault-tolerance.
High level of abstraction.
Just works.
16

Storm Topologies
Stream

Spout

⚡️Bolt

⚡️Bolt

Spout

⚡️Bolt

⚡️Bolt

17

Parallel Tasks
Task

Spout
T

T

⚡️Bolt
T

Spout
T

Stream

T

⚡️Bolt

T

⚡️Bolt
T

18

T

T

⚡️Bolt
T

T

T

Know
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analytics.
Why twitter created Storm.
How Storm works.
20

Realtime  Distributed Analysis  of Datastreams

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Realtime  Distributed Analysis  of Datastreams

Similar to Realtime  Distributed Analysis  of Datastreams (20)

More from Florian Stegmaier

More from Florian Stegmaier (10)

Recently uploaded

Recently uploaded (20)