Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ai big dataconference_jeffrey ricker_kappa_architecture


Published on

Topic of presentation: Kappa architecture (and beyond)

The main points of the presentation:
We will discuss the evolution of big data architecture, from batch to Lambda to Kappa. I will walk through how to implement a Kappa architecture with practical examples, focusing on how to reach full potential and avoid the pitfalls. We will finish with reviewing what lies ahead, including the inevitable consolidation between microservices, GPGPU and Hadoop.

Published in: Engineering
  • Be the first to comment

Ai big dataconference_jeffrey ricker_kappa_architecture

  2. 2. JEFFREY RICKER Co-founder of Ricker Lyman Robotic Live in New York Clients include hedge funds, pharmaceutical, retail Amazon big data Distributed Instruments US Defense HPC Modernization
  3. 3. AGENDA Review history that led to Kappa architecture Investigate the power of Kappa Review what comes next
  4. 4. HISTORY PART 1
  5. 5. “In the beginning was the command line …” Neal Stephenson
  6. 6. MAP REDUCE
  7. 7. MAP REDUCE GOOD Hadoop distributed file system (HDFS) Distributed & massively parallel Move the compute to the data YARN NOT SO GOOD Finite data set Batch process Begin to chain jobs together Failure? Recovery? Idempotent?
  8. 8. WORKFLOWS Oozie Luigi Azkaban Airflow Pinball Cascading Taskflow
  9. 9. APACHE STORM September 2011
  10. 10. AND MORE Apache Spark 2014 Apache Samza 2014 Apache Flink 2015 Apache Nifi 2015 Apache Gearpump 2016 Apache Apex 2016 Kafka Streams 2016 Akka Streams 2016
  11. 11. STREAMS ARE DIFFERENT Data is infinite Continuous processing There is no now Eventual consistency vs false sense of consistency Closer to reality
  12. 12. TIME
  13. 13. CONSISTENCY Trade data arrives at end of day (EOD) Processing runs to create EOD status of trades Corrections exist for previous days Previous EOD is also changed
  14. 14. WINDOWING
  15. 15. UNBOUNDED
  17. 17. LAMBDA Enterprise SQL architectures have followed the same pattern for years Requires maintaining two versions of the same logic Joining the streaming with the batch is easier said than done
  18. 18. KAPPA PART 2
  19. 19. APACHE KAFKA
  20. 20. THE MISSING PIECE All distributed computing has three components: 1. Data (or state) 2. Compute 3. Communication We had 1. HDFS + Hive + HBASE +++ 2. YARN + Spark + Kubernetes +++ 3. ?
  21. 21. MESSAGING
  22. 22. WHAT IS KAFKA
  23. 23. WHAT IS KAFKA
  24. 24. ADVANTAGES Works as a queue Works as pub-sub Works as a storage system Scales Fast
  25. 25. DEFINITION OF KAPPA Rather than using a relational DB like SQL or a key-value store like Cassandra, the canonical data store in a Kappa Architecture system is an append-only immutable log. From the log, data is streamed through a computational system and fed into auxiliary stores for serving.
  26. 26. DEFINITION
  27. 27. STATE
  28. 28. STATE CHANGE
  29. 29. DIFFERENCE
  30. 30. BASIC CONCEPT Write immutable events to the append only log Recreate the state in (multiple) materialized views Distribute the ability to maintain state in multiple systems in read optimized formats “Turn the database inside out”
  31. 31. RESOURCE CONTENTION Intraday Run the business Microservices Exoday Analyze the business Hadoop
  32. 32. MULTIPLE CLUSTERS Kafka Microservices Hadoop Streaming Nifi
  35. 35. MICROSERVICE EXAMPLE Microservice Hbase
  36. 36. MICROSERVICE EXAMPLE microservice kafka HBase
  37. 37. MICROSERVICE EXAMPLE microservice kafka Hbase Hive Druid
  38. 38. BOUNDARY LAYER stream process A kafka stream process B
  39. 39. DOMAIN KAPPA Data is the current state Compute changes the state Stream publishes the state changed
  40. 40. DOMAIN KAPPA
  41. 41. DOMAIN KAPPA
  42. 42. OBSERVED STATE Stateful • Service maintains an in-memory copy of the observed state of the other service by subscribing to the stream of the other service from the beginning. Stateless • Service reads the state from the other service by request-response. Semi-stateless • Hybrid of the other two. The service subscribes to the stream of the other service and keeps a cache of the observable state. The cache is limited in size through time outs. If the service is missing a state in its local cache, then it reads the observable state from the other service and caches it.
  44. 44. SUMMARY Canonical data store is an append-only immutable log • Kappa is not dependent on Kafka • Kafka is very good for implementing Kappa From the log, data is streamed through a computational system and fed into auxiliary stores for serving Auxiliary stores are materialized views Multiple views of the same data, read optimized Meets resource contention requirements of enterprise
  45. 45. NEXT PART 3
  46. 46. 1. KAFKA WILL EVOLVE Kafka streams Only once processing Continuous queries
  47. 47. 2. CONVERGENCE
  48. 48. RESOURCE MANAGERS YARN • Map reduce jobs running on cluster • Long running services like Hbase running on cluster • Why not share the resources? Kubernetes • Distribute containers across a collection of servers Mesos • An operating system for the data center
  49. 49. SCHEDULERS Apache Yarn Kubernetes Mesos Docker Swarm Hashicorp Nomad Microsoft Apollo
  50. 50. CON/DI/VERGENCE Compute • YARN will expand to run microservices and containers • Microservice and container platforms will run Hadoop Data storage • HDFS or S3 or Ceph or ? Messaging • Kafka alternatives will arise
  51. 51. 3. GPGPU AI frameworks • TensorFlow • MXNet • Caffe Databases • Kinetica • MapD • Sqream • Blazegraph
  52. 52. SUPERCOMPUTER Hadoop • 12 m4 nodes x 64 cores = 768 GPU • 1 p2 node x 16 GPU x 2,496 cores = 39,936
  53. 53. A NEW LAYER data at rest stream (CPU) processing GPU processing
  54. 54. THANK YOU