Google Cloud Dataflow
and lightweight Lambda
Architecture
for Big Data App
New innovative ideas and core concepts for
simp...
How it works:
an immutable sequence of records is captured and fed into
a batch system and a stream processing system in p...
And the bad…
The problem with the Lambda Architecture is that
maintaining code that needs to produce the same
result in tw...
Too hard to build flexible analytics pipelines
Google is now making a huge point of the fact
that it abandoned MapReduce a...
http://youtu.be/TnLiEWglqHk - Google I/O 2014 - The dawn
of "Fast Data"
Google's new Dataflow
Google's new Dataflow architecture, which is
based on FlumeJava and MillWheel? They also
support cod...
MillWheel: Fault-Tolerant Stream Processing at
Internet Scale
FlumeJava: Easy, Efficient Data-Parallel
Pipelines
Lightweight Lambda Architecture
Stream processing system could be improved
to handle the full problem set in its target
do...
Ideas
● Use Kafka or some other system that will let you retain
the full log of the data you want to be able to reprocess
...
Background
Kafka maintains ordered logs like this:
A Kafka “topic” is a collection of these logs:
Big picture of Kafka
Refer links
1. http://cloudtimes.org/2014/07/07/mapreduce-successor-
google-cloud-dataflow-is-a-game-changer-for-hadoop-
t...
Upcoming SlideShare
Loading in...5
×

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

3,721
-1

Published on

Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App

  1. 1. Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App New innovative ideas and core concepts for simple real-time analytic development Compiled from Internet by @tantrieuf31 http://nguyentantrieu.info
  2. 2. How it works: an immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel. You implement your transformation logic twice, once in the batch system and once in the stream processing system. You stitch together the results from both systems at query time to produce a complete answer. The Lambda Architecture
  3. 3. And the bad… The problem with the Lambda Architecture is that maintaining code that needs to produce the same result in two complex distributed systems is exactly as painful as it seems like it would be. I don’t think this problem is fixable. Programming in distributed frameworks like Storm and Hadoop is complex. Inevitably, code ends up being specifically engineered toward the framework it runs on. The resulting operational complexity of systems implementing the Lambda Architecture is the one thing that seems to be universally agreed on by everyone doing it.
  4. 4. Too hard to build flexible analytics pipelines Google is now making a huge point of the fact that it abandoned MapReduce a long time ago as it was too hard to build flexible analytics pipelines. http://java.dzone.com/articles/google-cloud-dataflow-%E2% 80%93-game
  5. 5. http://youtu.be/TnLiEWglqHk - Google I/O 2014 - The dawn of "Fast Data"
  6. 6. Google's new Dataflow Google's new Dataflow architecture, which is based on FlumeJava and MillWheel? They also support code sharing. Cloud Dataflow is a successor to MapReduce, and is based on Google’s internal technologies like Flume and MillWheel. This new project in which Google placed their servers can be considered the natural evolution of MapReduce.
  7. 7. MillWheel: Fault-Tolerant Stream Processing at Internet Scale
  8. 8. FlumeJava: Easy, Efficient Data-Parallel Pipelines
  9. 9. Lightweight Lambda Architecture Stream processing system could be improved to handle the full problem set in its target domain. → Kappa Architecture + Micro-service
  10. 10. Ideas ● Use Kafka or some other system that will let you retain the full log of the data you want to be able to reprocess and that allows for multiple subscribers. For example, if you want to reprocess up to 30 days of data, set your retention in Kafka to 30 days. ● When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table. ● When the second job has caught up, switch the application to read from the new table. ● Stop the old version of the job, and delete the old output table.
  11. 11. Background Kafka maintains ordered logs like this:
  12. 12. A Kafka “topic” is a collection of these logs:
  13. 13. Big picture of Kafka
  14. 14. Refer links 1. http://cloudtimes.org/2014/07/07/mapreduce-successor- google-cloud-dataflow-is-a-game-changer-for-hadoop- thunder 2. http://martinfowler.com/articles/microservices.html 3. http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html 4. http://martinfowler.com/eaaDev/EventSourcing.html 5. http://martinfowler.com/bliki/CQRS.html 6. http://www.michael-noll.com/blog/2013/03/13/running-a- multi-broker-apache-kafka-cluster-on-a-single-node/ 7. http://kafka.apache.org 8. http://www.mc2ads.com
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×