Transcript of "Realtime Distributed Analysis of Datastreams"
Philipp Nolte – University of Passau – January 2014
Why we need fancy Big Data frameworks.
How the lambda architecture looks like.
How twitter used to do real-time analytics.
Why twitter created Storm.
How Storm works.
Imagine a traditional web analytics software:
Every page view increments
the url’s database row.
Queue your writes and write in batches.
Shard your data: Partition horizontally.
Fault-tolerance is hard.
Applications become more and more complex.
You have to do all the work.
Large scale computation systems such as Hadoop.
Scalable databases such as Casandra and Riak.
Easy to use frameworks such as Storm and Dempsy.
Theoretical, abstract architecture for working with big data.
Compute arbitrary functions on arbitrary data.
query = function ( all data )
Robust and fault-tolerant.
Low latency reads and updates.
Stores the immutable master dataset.
Precomputes arbitrary batch views.
Home of batch processing and map
reduce systems such as Hadoop.
Read-only random-access to batch views.
Updated by batch layer.
Indexes batch views.
Home of real-time query systems
such as Cloudera Impala for Hadoop.
Compensates for high-latency batch views.
Fast, incremental algorithms.
More complex because of random-writes.
Home of Apache HBase or Storm.