Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Computing Architecture

589 views

Published on

it talks about the evolution of big data computation architecture

Published in: Technology
  • Be the first to comment

Big Data Computing Architecture

  1. 1. Evolution of Big Data Architecture Gang Tao
  2. 2. Computing Trend
  3. 3. Non Functional Requirements • Latency • Throughput • Fault Tolerant • Scalable • Exactly Once Semantic
  4. 4. Hadoop
  5. 5. Hadoop Eco System
  6. 6. Map Reduce Move computation to Data
  7. 7. Map Reduce
  8. 8. Hadoop the Limitation • Map/Reduce is hard to use • Latency is high • inevitable data movement
  9. 9. Lambda
  10. 10. Lambda Architecture query = function(all data)
  11. 11. Design Principle human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable. data immutability – store data in it’s rawest form immutable and for perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !) recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.
  12. 12. Lambda Architecture
  13. 13. Batch Processing
  14. 14. Batch Processing Using only batch processing, leaves you always with a portion of non- processed data.
  15. 15. Realtime Stream Processing
  16. 16. query = function(all data) + function(new data)
  17. 17. Lambda Architecture
  18. 18. A Reference Case
  19. 19. Lambda the Bad? What is Good? What is Bad? Inmutable Data , Reprocessing Keeping code written in two different systems perfectly in sync
  20. 20. Over complicated Lambda
  21. 21. Batch VS. Stream
  22. 22. Stream & Realtime
  23. 23. Storm Logic View Bolts Spouts Move data to Computation
  24. 24. Storm Deployment View
  25. 25. Code Compare
  26. 26. DAG • DAG : Directed Acyclic Graph • Used in Spark, Storm, Flink etc.
  27. 27. Out of Control Build Complexity with Simplicity
  28. 28. Stream Processing Model One at a time Micro batch
  29. 29. Stream Processing Model One at a time Micro Batch Low Latency Y N High Throughput N Y at least once Y Y excatly once Sometimes Y simple programing model Y N
  30. 30. Stream Computing the Limitation • Queries must be written before data • There should be another way to query past data • Queries cannot be run twice • All results will be lost when any error occurs All data have gone when bugs found • Disorders of events break results • Recorded time based queries? Or arrival time based queries?
  31. 31. Batch/Stream Unification
  32. 32. Stream with Spark
  33. 33. Apache Flink
  34. 34. Flink Distributed Snapshot
  35. 35. Fault Tolerance in Stream • At Least Once : ensure all operators see all events • Stream -> Replay on failure • Exactly Once : • Flink : distributed Snapshot • Spark : Micro Batch
  36. 36. Jay Kreps
  37. 37. Kappa Architecture

×