Realtime
 Distributed Analysis
 of Datastreams

974 views

Published on

Ein Vortrag von Philipp Nolte aus dem Hauptseminar "Personalisierung mit großen Daten".

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
974
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Realtime
 Distributed Analysis
 of Datastreams

  1. 1. Realtime
 Distributed Analysis
 of Datastreams Philipp Nolte – University of Passau – January 2014 1
  2. 2. Learn Why we need fancy Big Data frameworks. How the lambda architecture looks like. How twitter used to do real-time analytics. Why twitter created Storm. How Storm works. 2
  3. 3. Limits Imagine a traditional web analytics software: Every page view increments
 the url’s database row. 3
  4. 4. First Aid Queue your writes and write in batches. Shard your data: Partition horizontally. 4
  5. 5. Chronic Issues Fault-tolerance is hard. Applications become more and more complex. You have to do all the work. 5
  6. 6. New Tools Large scale computation systems such as Hadoop. Scalable databases such as Casandra and Riak. Easy to use frameworks such as Storm and Dempsy. 6
  7. 7. Lambda Architecture Theoretical, abstract architecture for working with big data. Speed Layer Serving Layer Batch Layer 7
  8. 8. Goal Compute arbitrary functions on arbitrary data. query = function ( all data ) 8
  9. 9. Properties Robust and fault-tolerant. Low latency reads and updates. Scalable. Minimal maintenance. 9
  10. 10. Batch Layer Speed Layer Serving Layer Stores the immutable master dataset. Precomputes arbitrary batch views. Home of batch processing and map
 reduce systems such as Hadoop. 10 Batch Layer
  11. 11. Serving Layer Speed Layer Serving Layer Batch Layer Read-only random-access to batch views. Updated by batch layer. Indexes batch views. Home of real-time query systems
 such as Cloudera Impala for Hadoop. 11
  12. 12. Speed Layer Speed Layer Serving Layer Batch Layer Compensates for high-latency batch views. Fast, incremental algorithms. More complex because of random-writes. Home of Apache HBase or Storm. 12
  13. 13. Lambda Architecture Speed Layer Realtime Views Batch Views Data Serving Layer Batch Layer 13 Query
  14. 14. Available Data Batch View Time Realtime View Discard Realtime View
 as soon as it is represented in the batch view. Batch View Realtime View 14
  15. 15. Twitter’s Early Days Worker Queue Queue Worker Worker Queue Worker Map Queue Worker Queue Worker Tweets Worker Queue Worker URLs Hadoop Cassandra 15
  16. 16. Storm Guaranteed message processing without
 message brokers. Horizontal scalability. Fault-tolerance. High level of abstraction. Just works. 16
  17. 17. Storm Topologies Stream Spout ⚡️Bolt ⚡️Bolt Spout ⚡️Bolt ⚡️Bolt 17
  18. 18. Parallel Tasks Task Spout T T ⚡️Bolt T Spout T Stream T ⚡️Bolt T ⚡️Bolt T 18 T T ⚡️Bolt T T T
  19. 19. Demo Storm in action 19
  20. 20. Know Why we need fancy Big Data frameworks. How the lambda architecture looks like. How twitter used to do real-time analytics. Why twitter created Storm. How Storm works. 20
  21. 21. The End. Questions? 21

×