Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Streams Architectures. Why? What? How?

569 views

Published on

With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.

Published in: Technology

Big Data Streams Architectures. Why? What? How?

  1. 1. BigDataStreams Architectures Why?What?How? Anton Nazaruk CTO @ VITech+
  2. 2. BigDatain2016+?
  3. 3. BigDatain2016+? ● No more an exotic buzzword ● Mature enough and already adopted by majority of businesses/companies ● Set of well-defined tools and processes… questionable ● Data Analysis at scale - taking value from your data! ○ Prescriptive - reveals what action should be taken ○ Predictive - analysis of likely scenarios of what might happen ○ Diagnostic - past analysis, shows what had happened and why (classic) ○ Descriptive - real time analytics (stocks, healthcare..)
  4. 4. BigDataanalysischallenges ● Integration - ability to have needed data in needed place ● Latency - data have to be presented for processing immediately ● Throughput - ability to consume/process massive volumes of data ● Consistency - data mutation in one place must be reflected everywhere ● Teams collaboration - inconvenient interface for inter-teams communication ● Technology adoption - typical technologies stack greatly complicates entire project ecosystem - another world of hiring, deployment, testing, scaling, fault tolerance, upgrades, monitoring, etc.
  5. 5. It’sachallenge!
  6. 6. Evolutionary system
  7. 7. Solution The Event LOG What every software engineer should know about real- time data's unifying abstraction https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying
  8. 8. TheeventLog
  9. 9. Reference architecture
  10. 10. transitioncase
  11. 11. Unifiedorderedeventlog
  12. 12. Kafka ● Fast - single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients ● Scalable - can be elastically and transparently expanded without downtime ● Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact ● Reliable - has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees
  13. 13. Kafka-highlevelview
  14. 14. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level
  15. 15. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block)
  16. 16. Kafka-buildingblocks ● Producer - process that publishes messages to a Kafka ● Topic - a category or feed name to which messages are published. For each topic, the Kafka cluster maintains a partitioned log ● Partition - part of a topic: level of parallelism in Kafka. Write/Read order is guaranteed at partition level ● Replica - up-to-date partition’s copy. Each partition is replicated across a configurable number of servers for fault tolerance (like HDFS block) ● Consumer - process that subscribes to topics and processes published messages
  17. 17. Kafka-buildingblocks ● Consumer - process that subscribes to topics and processes published messages
  18. 18. StreamProcessing-highlevel
  19. 19. Apache Storm Apache Spark Apache Samza Apache Flink Apache Flume ... StreamProcessing-possibleimplementationframeworks
  20. 20. StreamProcessing-possibleimplementationframeworks ● Pros ○ Automatic fault tolerance ○ Scaling ○ No data loss guarantees ○ Stream processing DSL/SQL (joins, filters, count aggregates, etc) ● Cons ○ Overall system complexity significantly grows ■ New cluster to maintain/monitor/upgrade/etc (Apache Storm) ■ Multi-pattern (mixed) data access (Spark/Samza on YARN) ○ Another framework to learn for your team
  21. 21. StreamProcessing-microservices
  22. 22. StreamProcessing-microservices Small, independent processes that communicate with each other to form complex applications which utilize language-agnostic APIs. These services are small building blocks, highly decoupled and focused on doing a small task, facilitating a modular approach to system-building. The microservices architectural style is becoming the standard for building modern applications.
  23. 23. StreamProcessing-microservicescommunication Three most commonly used protocols are : ● Synchronous request-response calls (mainly via HTTP REST API) ● Asynchronous (non blocking IO) request-response communication (Akka, Play Framework, etc) ● Asynchronous messages buffers (RabbitMQ, JMS, ActiveMQ, etc)
  24. 24. StreamProcessing-microservicesplatforms Microservices deployment platforms : ● Apache Mesos with a framework like Marathon ● Swarm from Docker ● Kubernetes ● YARN with something like Slider ● Various hosted container services such as ECS from Amazon ● Cloud Foundry ● Heroku
  25. 25. StreamProcessing-microservices Why can’t I just package and deploy my events processing code on Yarn / Mesos / Docker / Amazon cluster and let it take care o fault tolerance, scaling and other weird things?
  26. 26. StreamProcessing-microservices
  27. 27. StreamProcessing-microservicescommunication Fourth protocol is : ● Asynchronous, ordered and manageable logs of events - Kafka
  28. 28. StreamProcessing-newera(kafka&microservices)
  29. 29. StreamProcessing-kafka ● New Kafka Consumer 0.9.+ ○ Light - consumer client is just a thin JAR without heavy 3rd party dependencies (ZooKeeper, scala runtime, etc) ○ Acts as Load Balancer ○ Fault tolerant ○ Simple to use API ○ Kafka Streams - elegant DSL (should be officially released this month)
  30. 30. StreamProcessing-kafka&microservices
  31. 31. StreamProcessing-kafka&microservices 1. Language agnostic logs of events (buffers) 2. No backpressure on consumers (API endpoints with sync approach) 3. Fault tolerance - no data loss 4. Failed service doesn’t bring entire chain down 5. Resuming from last committed offset position 6. No circuit breaker like patterns needed 7. Smooth configs management across all nodes and services
  32. 32. StreamProcessing-kafka&microservices
  33. 33. LambdaArchitecture
  34. 34. KappaArchitecture
  35. 35. KappaArchitecture
  36. 36. Architecturescomparison Lambda Kappa Processing paradigm Batch + Streaming Streaming Re-processing paradigm Every batch cycles Only when code changes Resource consumption Higher Lower Maintenance/Support complexity Higher Lower Ability to re-create dataset Per any point of time No (or very hard) Yes
  37. 37. Evenmoreinterestingcomparison Hadoop-centric system Kafka-centric system Data Replication + + Fault Tolerance + + Scaling + + Random Reads With HBase With Elasticsearch/Solr Ordered Reads - + Secondary indices With Elasticsearch/Solr With Elasticsearch/Solr Storage for Big Files (>10M) + - TCO higher lower
  38. 38. Summary 1. Events Log centric system design - from chaos to structured architecture
  39. 39. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation
  40. 40. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach
  41. 41. Summary 1. Events Log centric system design - from chaos to structured architecture 2. Kafka as an Events Log reference storage implementation 3. Microservices as distributed events processing approach 4. Kappa Architecture as Microservices & Kafka symbiosis
  42. 42. Usefullinks 1. “I heart Logs” by Jay Krepps http://shop.oreilly. com/product/0636920034339.do 2. http://confluent.io/blog 3. https://engineering.linkedin.com/distributed-systems/log-what-every- software-engineer-should-know-about-real-time-datas-unifying 4. “Making sense of stream processing” by Martin Kleppmann 5. http://kafka.apache.org/ 6. http://martinfowler.com/articles/microservices.html

×