0
Realtime

Distributed Analysis

of Datastreams

Philipp Nolte – University of Passau – January 2014

1
Learn
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time anal...
Limits

Imagine a traditional web analytics software:
Every page view increments

the url’s database row.

3
First Aid

Queue your writes and write in batches.
Shard your data: Partition horizontally.

4
Chronic Issues
Fault-tolerance is hard.
Applications become more and more complex.
You have to do all the work.

5
New Tools
Large scale computation systems such as Hadoop.
Scalable databases such as Casandra and Riak.
Easy to use framew...
Lambda Architecture
Theoretical, abstract architecture for working with big data.

Speed Layer
Serving Layer
Batch Layer
7
Goal

Compute arbitrary functions on arbitrary data.
query = function ( all data )

8
Properties
Robust and fault-tolerant.
Low latency reads and updates.
Scalable.
Minimal maintenance.

9
Batch Layer

Speed Layer
Serving Layer

Stores the immutable master dataset.
Precomputes arbitrary batch views.
Home of ba...
Serving Layer

Speed Layer
Serving Layer
Batch Layer

Read-only random-access to batch views.
Updated by batch layer.
Inde...
Speed Layer

Speed Layer
Serving Layer
Batch Layer

Compensates for high-latency batch views.
Fast, incremental algorithms...
Lambda Architecture
Speed Layer
Realtime Views
Batch Views

Data

Serving Layer
Batch Layer
13

Query
Available Data
Batch View
Time

Realtime View
Discard Realtime View

as soon as it is represented
in the batch view.

Batc...
Twitter’s Early Days
Worker

Queue

Queue

Worker

Worker

Queue

Worker

Map

Queue

Worker

Queue

Worker

Tweets

Worke...
Storm
Guaranteed message processing without

message brokers.
Horizontal scalability.
Fault-tolerance.
High level of abstr...
Storm Topologies
Stream

Spout

⚡️Bolt

⚡️Bolt

Spout

⚡️Bolt

⚡️Bolt

17
Parallel Tasks
Task

Spout
T

T

⚡️Bolt
T

Spout
T

Stream

T

⚡️Bolt

T

⚡️Bolt
T

18

T

T

⚡️Bolt
T

T

T
Demo

Storm in action

19
Know
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analy...
The End.

Questions?

21
Upcoming SlideShare
Loading in...5
×

Realtime
 Distributed Analysis
 of Datastreams

523

Published on

Ein Vortrag von Philipp Nolte aus dem Hauptseminar "Personalisierung mit großen Daten".

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
523
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Realtime
 Distributed Analysis
 of Datastreams"

  1. 1. Realtime
 Distributed Analysis
 of Datastreams Philipp Nolte – University of Passau – January 2014 1
  2. 2. Learn Why we need fancy Big Data frameworks. How the lambda architecture looks like. How twitter used to do real-time analytics. Why twitter created Storm. How Storm works. 2
  3. 3. Limits Imagine a traditional web analytics software: Every page view increments
 the url’s database row. 3
  4. 4. First Aid Queue your writes and write in batches. Shard your data: Partition horizontally. 4
  5. 5. Chronic Issues Fault-tolerance is hard. Applications become more and more complex. You have to do all the work. 5
  6. 6. New Tools Large scale computation systems such as Hadoop. Scalable databases such as Casandra and Riak. Easy to use frameworks such as Storm and Dempsy. 6
  7. 7. Lambda Architecture Theoretical, abstract architecture for working with big data. Speed Layer Serving Layer Batch Layer 7
  8. 8. Goal Compute arbitrary functions on arbitrary data. query = function ( all data ) 8
  9. 9. Properties Robust and fault-tolerant. Low latency reads and updates. Scalable. Minimal maintenance. 9
  10. 10. Batch Layer Speed Layer Serving Layer Stores the immutable master dataset. Precomputes arbitrary batch views. Home of batch processing and map
 reduce systems such as Hadoop. 10 Batch Layer
  11. 11. Serving Layer Speed Layer Serving Layer Batch Layer Read-only random-access to batch views. Updated by batch layer. Indexes batch views. Home of real-time query systems
 such as Cloudera Impala for Hadoop. 11
  12. 12. Speed Layer Speed Layer Serving Layer Batch Layer Compensates for high-latency batch views. Fast, incremental algorithms. More complex because of random-writes. Home of Apache HBase or Storm. 12
  13. 13. Lambda Architecture Speed Layer Realtime Views Batch Views Data Serving Layer Batch Layer 13 Query
  14. 14. Available Data Batch View Time Realtime View Discard Realtime View
 as soon as it is represented in the batch view. Batch View Realtime View 14
  15. 15. Twitter’s Early Days Worker Queue Queue Worker Worker Queue Worker Map Queue Worker Queue Worker Tweets Worker Queue Worker URLs Hadoop Cassandra 15
  16. 16. Storm Guaranteed message processing without
 message brokers. Horizontal scalability. Fault-tolerance. High level of abstraction. Just works. 16
  17. 17. Storm Topologies Stream Spout ⚡️Bolt ⚡️Bolt Spout ⚡️Bolt ⚡️Bolt 17
  18. 18. Parallel Tasks Task Spout T T ⚡️Bolt T Spout T Stream T ⚡️Bolt T ⚡️Bolt T 18 T T ⚡️Bolt T T T
  19. 19. Demo Storm in action 19
  20. 20. Know Why we need fancy Big Data frameworks. How the lambda architecture looks like. How twitter used to do real-time analytics. Why twitter created Storm. How Storm works. 20
  21. 21. The End. Questions? 21
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×