Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services with Kafka and Kafka Streams

573 views

Published on

ChatWork is a worldwide communication service, which holds 110k+ of customer organizations. In 2016, we have developed a new scalable infrastructure based on the idea of CQRS and Event Sourcing using Kafka and Kafka Streams combined with Akka and HBase. In this session, we talk about the concept of this architecture and lessons learned in production use cases.

Presented at Kafka Summit SF 2017 by Masaru Dobashi and Shingo Omura

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Kafka Summit SF 2017 - Worldwide Scalable and Resilient Messaging Services with Kafka and Kafka Streams

  1. 1. Kafka Summit SF 2017 by CQRS and Event Sourcing using Akka, Kafka Streams and HBase Worldwide Scalable and Resilient Messaging Services Shingo Omura ChatWork Co., Ltd. Masaru Dobashi NTT DATA Corporation © ChatWork and NTT DATA Corporation. 1
  2. 2. Kafka Summit SF 2017 Agenda • Introduction of Us and Our service “ChatWork” • Technical Debts Blocked Our Growth • Our Approach: CQRS + Event Sourcing with Akka, Kafka, HBase • Technical Consideration to Build the Architecture about Kafka © ChatWork and NTT DATA Corporation. 2
  3. 3. Kafka Summit SF 2017 Who am I ? Shingo OmuraAbout ChatWork Co., Ltd . • Founded in 2004, in Japan • 79 employees in total • Raised $15M in funding so far • 3 Office Locations : Japan, Taipei, U.S(California) • Senior Software Engineer in ChatWork Co., Ltd . • Specialized for Distributed and Concurrent Computing © ChatWork and NTT DATA Corporation. 3
  4. 4. Kafka Summit SF 2017 Who am I ? Masaru Dobashi • Senior Software Engineer and Architect of IT Platform • Specialized for distributed computing, open sources and infrastructures. About NTT DATA Corporation Common Stock • 142,520 million (as of March 31, 2016) Business Area • System integration • Networking system services • Other business activities related to the above © ChatWork and NTT DATA Corporation. 4
  5. 5. Kafka Summit SF 2017 Collaboration between ChatWork and NTT DATA • ChatWork is the project owner. • In this project, NTT DATA is providing the technical support about messaging systems and data stores in this project. © ChatWork and NTT DATA Corporation. 5
  6. 6. Kafka Summit SF 2017 ChatWork (http://chatwork.com) We Change World Works • ChatWork is the enterprise grade global team collaboration platform • Group Chat, File sharing, Task management, video conference all in one place • All device support (PC, Android, iOS) 6 languages support © ChatWork and NTT DATA Corporation. 6
  7. 7. Kafka Summit SF 2017 Demo © ChatWork and NTT DATA Corporation. 7
  8. 8. Kafka Summit SF 2017 ChatWork (http://chatwork.com) Easy for Cross-Organizational communication • Chat Room, User namespace is shared by the whole • You don’t need to sign in to multiple organizations • You can add anyone, even in other organization, to chat rooms • Stats • 60% of users use for Internal/External communication • 10% of users use only for External communication • Typical Usecases = Business collaboration with their partners • Publishers and Writers • Franchise/Branch Operations • Consulting firm and Clients (Accounting, Law, etc.) © ChatWork and NTT DATA Corporation. 8
  9. 9. Kafka Summit SF 2017 ChatWork Grows Rapidly 138,000 companies in 205 countries and regions © ChatWork and NTT DATA Corporation. 9
  10. 10. Kafka Summit SF 2017 ChatWork Grows Rapidly 2 Billion Messages sent globally! Number of messages sent on ChatWork has been increased along with user growth. © ChatWork and NTT DATA Corporation. 10
  11. 11. Kafka Summit SF 2017 Characteristics of Our Workload • 95% of message requests are"read" • Large portion of reads are about "recent” messages • But users sometime jump to very old messages via message links • every task and file has its associated message links © ChatWork and NTT DATA Corporation. 11
  12. 12. Kafka Summit SF 2017 Technical Debts That Blocked Our Growth Cannot Scale-Up Anymore • Using the biggest intance type (db.r3.8xlarge) • Should be able to Scale-Out ACID doesn’t scale • ACID is hard to tune up performance • We realized to accept weaker consistency model Monolith is hard • to deploy, to maintain, to optimize © ChatWork and NTT DATA Corporation. 12
  13. 13. Kafka Summit SF 2017 What We Want To Get on New Messaging Backend Different Scalability and Resiliency Level • Stateless Servers (API Servers) • Can be fully elastic automatically with high throughput and low latency • Fault-tolerant and self-healing • Statefull Servers (Storage) • No need to be automatically elastic, just scalable when we needed • Expected to be fault-tolerant and somewhat resilient • Durable and Predictability is important Acceptable consistency level • Eventual consistent can be accepted with reasonable/tunable delay • Every member in a chatroom should see message events in the same order © ChatWork and NTT DATA Corporation. 13
  14. 14. Kafka Summit SF 2017 Our Approach: CQRS(Command and Query Responsibility Segregation) Build read side and write side independently Pros: easy to optimize and be flexible • Data Structure • De-normalized data model can be used for read models • Database Middleware • Focus on either read-heavy or write-heavy • System Capacity • Can control system capacity independently Cons: • Confined Complexity in data transformation • Operation overhead © ChatWork and NTT DATA Corporation. 14
  15. 15. Kafka Summit SF 2017 Our Approach: CQRS + Event Sourcing Event Source • History of every changes in application state • It is stored in the sequence Write model database can be append only • Event is fact. It is already validated and authorized • Fact won’t be updated in nature © ChatWork and NTT DATA Corporation. 15 Message Created(id=1, room=a) Message Created(id=2, room=b) Message Updated(id=1, room=a) Message Updated(id=1, room=a) Message Deleted(id=2, room=b) Message Created(id=3, room=c) Write API Consume Produce Event Consumers
  16. 16. Kafka Summit SF 2017 Our Approach: CQRS + Event Sourcing Easy to build/rebuild read model eventually • We can mutate each event to read model iteratively • This can be seen as pre-computing query results incrementally • This process can be replayed to re-build read model when needed by some incident. © ChatWork and NTT DATA Corporation. 16
  17. 17. Kafka Summit SF 2017 Overall Architecture © ChatWork and NTT DATA Corporation. 17 Logs Metrics
  18. 18. Kafka Summit SF 2017 Why we chose these products? The main reasons • Akka • Minimal footprint for performance and resiliency by design • Kafka • High throughput and flexibility of Pub/Sub model based architecture • Kafka Streams • Simplicity of the basic design • HBase • High throughput and stability in the large scale © ChatWork and NTT DATA Corporation. 18
  19. 19. Kafka Summit SF 2017 Summary of Actual Performance • Write API • Throughput(in stress test): 40x of current peak with only 2 write-api pods (4 core&5G mem/pod on m4.2xlarge instance) • Latency(in production): 200ms à 80ms • produce time to Kafka Brokers = 20ms (in production) © ChatWork and NTT DATA Corporation. 19
  20. 20. Kafka Summit SF 2017 Summary of Actual Performance • Read API • Throughput(in stress test): current peak with 4 read-api pods (4 core&5G mem/pod on m4.2xlarge instance) • Latency(in production): 70ms à 70ms • HBase’s block cache hit rate = 99%!!! (in production) © ChatWork and NTT DATA Corporation. 20
  21. 21. Kafka Summit SF 2017 Summary of Actual Performance • Read Model Updater • Time lag until read model being updated: 80ms (in production) • Resilient enough • Akka Supervisor can safely restarts kafka streams without stopping pods with exponential backoff • Kafka consumer group itself is also resilient enough • partition rebalancing happens automatically even when some of consumer pods are down (e.g rolling update) and can keep processing event mutation © ChatWork and NTT DATA Corporation. 21
  22. 22. Kafka Summit SF 2017 Technical Consideration Topics • Kafka as a domain event source • Guarantee of message ordering • Kafka as a cushioning layer • Offset management • Kafka as a event pipeline • Leveraging Pub/Sub model to evolve our services • Kafka in Cloud • Several Tips for Running Kafka Clusters in AWS © ChatWork and NTT DATA Corporation.
  23. 23. Kafka Summit SF 2017 Guaranteeing message ordering • Basic Rule: Use partition = the unit to order message in Kafka (We use default partitioner with chatroom id for key) • Caution: Preserve (key ⟶ partition) mapping even when you increase partitions dynamically! (We operate fixed number (1000) of partitions, it can’t be changed in reality) Kafka as a domain event source: © ChatWork and NTT DATA Corporation. 23 chatroom id Events in room a, d, g, ... partition 1 Events in room b, e, h, ... partition 2 Events in room c, f, I, ... partition 3 partitioner
  24. 24. Kafka Summit SF 2017 Our Scale-out Strategy Preserving Event Order Guarantee • Partition Level • Use default partitioner – If you want throughput, don’t put expensive logic in partitioner • Use fixed number of partitions (large enough) per topic • Add brokers and re-assign partitions to them to increase throughput • Number of brokers will be bound to the number of partitions of the topic • Kafka doesn’t support auto re-assignment of partitions on brokers. • Topic Level • Introduce new topic when need more throughput (this topic can be used for new chatroom ids) • and modify write-api to decide destination topics. • Cluster Level • Introduce another cluster when need more throughput (this topic can be used for new chatroom ids) • and modify write-api to decide destination cluster Kafka as a domain event source: © ChatWork and NTT DATA Corporation. 24
  25. 25. Kafka Summit SF 2017 Kafka as a domain event source: Parallelism and Ordering • The configuration of parallelism of each component is important to realize both of the high throughput and the guarantee of ordering. • For example, Kafka Producer can reproduce messages due to some errors but this may cause the reordering when you send data in parallel . To prevent it, you can set a parameter, max.in.flight.requests.per.connection to 1. © ChatWork and NTT DATA Corporation. 25
  26. 26. Kafka Summit SF 2017 Kafka as a domain event source: max.in.flight.requests.per.connection • This parameter configures the maximum parallelism of sending requests, etc. © ChatWork and NTT DATA Corporation. 26 Conditions: queue == null || queue.isEmpty() || (queue.peekFirst().send.completed() && queue.size() < this.maxInFlightRequestsPerConnection);
  27. 27. Kafka Summit SF 2017 Kafka as a domain event source: Relation between Generating Message ID and Producing Messages to Kafka • The atomic operation of generating message IDs and producing messages to Kafka is important for the guarantee of the message order in data pipelines. • But there is a tradeoff about throughput/latency and exactness, because we use global message ID to represent the message order. © ChatWork and NTT DATA Corporation. 27
  28. 28. Kafka Summit SF 2017 Kafka as a domain event source: What happens if we produce messages independently from generation of IDs © ChatWork and NTT DATA Corporation. 28 Generator of ID Client B time ID:2 Message Broker Partition ID 1 ID 2 Client A Unfortunate delay due to some troubles, such as congestionsID:1 Message ID:1 is produced after ID:2
  29. 29. Kafka Summit SF 2017 Kafka as a domain event source: Actual design and restrictions which we accepted • We allow temporal small disorders of messages in Kafka and decided to re-order messages in parallel for each mini-batch if it is necessary. • The extreme delay of producing messages should assert errors in our applications. © ChatWork and NTT DATA Corporation. 29 ID:4 ID:2 ID:3 ID:1 ID:4 ID:3 ID:2 ID:1 consume messages as a batch Re-order messages if it is necessary according to use cases Partition A batch of data Id msg -- --- 1 A 2 B 3 C 4 D
  30. 30. Kafka Summit SF 2017 Kafka as a Cushioning Layer • In our design, Kafka has a role of “cushioning layer” as well as the hub of the pipeline. • We can reprocess old messages both automatically and manually when we find errors. This is achieved by storing several generations of offsets in the data store and controlling offsets in the applications. © ChatWork and NTT DATA Corporation. 30
  31. 31. Kafka Summit SF 2017 Kafka as a Cushioning Layer: KafkaClientSupplier • You can use KafkaClientSupplier implementation to provide a custom consumer & provider to KafkaStreams instance. • However, Kafka Streams is not basically designed for such use cases, so that it may be painful for you. Furthermore, you should be careful for frequent updating of the offset information. © ChatWork and NTT DATA Corporation. 31 public KafkaStreams(final TopologyBuilder builder, final StreamsConfig config, final KafkaClientSupplier clientSupplier) { public interface KafkaClientSupplier { Producer<byte[], byte[]> getProducer(final Map<String, Object> config); Consumer<byte[], byte[]> getConsumer(final Map<String, Object> config); Consumer<byte[], byte[]> getRestoreConsumer(final Map<String, Object> config); }
  32. 32. Kafka Summit SF 2017 Kafka as a Cushioning Layer: Error Handling in Multiple Ways (1/2) • In reality, it is difficult to perfectly handle errors in a certain layer. • For example, Kafka Streams didn’t provide the fine-grained error handling at the time that we started this project. Some of errors during processing records may cause application failures. © ChatWork and NTT DATA Corporation. 32
  33. 33. Kafka Summit SF 2017 Kafka as a Cushioning Layer: Error Handling in Multiple Ways (2/2) • Fortunately, since Kafka Streams can be run as a single application, you can wrap it by Akka Supervisor. This enables us to handle errors simply using UncaughtExceptionHandler. © ChatWork and NTT DATA Corporation. 33 public void setUncaughtExceptionHandler(final Thread.UncaughtExceptionHandler eh) streams.setUncaughtExceptionHandler(new UncaughtExceptionHandler { override def uncaughtException(t: Thread, e: Throwable): Unit = { self ! UncaughtExceptionInStream(e) } }) E.x. Send Actor messages to itself to trigger the back off function of the supervisor.
  34. 34. Kafka Summit SF 2017 Useful for integrating to other service • Kafka is very useful for integrating to other services • We currently one event forwarder which integrates to multiple existential services • We are now adding event forwarder for outgoing webhook service • Important: integrated service should be “idempotent” • Kafka Consumers guarantee only “at-least-once” delivery (sorry we still use 0.10.0.x) • Integrated service might receive the same event multiple times © ChatWork and NTT DATA Corporation. 34 Kafka as a event pipeline: Message Write API Message Event Source Read Model Updater Event Foarwarder Mobile Push Service Fulltext Search Index Service
  35. 35. Kafka Summit SF 2017 Read Model Updater Event Forwarder • Kafka’s pub/sub model provides us the flexibility of evolving services. We can add and improve services step by step. © ChatWork and NTT DATA Corporation. 35 Kafka is a key driver for service evolution Kafka as a event pipeline: Message Write API Message Event Source Mobile Push Service Fulltext Search Index Service Webhook Service Audit Service Webhook Forwarder New Message Audit Forwarder New
  36. 36. Kafka Summit SF 2017 Several Tips For Running Kafka in AWS • Be Careful with Service Limits • Use EBS-optimized instance • Bandwidths of EC2 ↔ EBSes, EC2 itself, EBS volume itself • IOPS limit of EBS • Total allocation size of EBS • Instance Types • Kafka Brokers: m4.2xlarge (8 cores, 32GB ram) + 214 GB EBS(gp2) • > 214GB EBS is recommended for gp2-type EBS. This is min. size for max. throughput • Zookeepers : c4.xlarge (4 cores, 7.5GB ram) + 128 GB EBS(gp2) • Those are shared with HBase. Smaller instance types would work when you use dedicated ensemble © ChatWork and NTT DATA Corporation. 36 Kafka in Cloud:
  37. 37. Kafka Summit SF 2017 Summary • Why and how we build our messaging backend with CQRS + Event Sourcing by Akka, Kafka, Hbase • Technical consideration topics how Kafka works as Domain Event Source, Cushioning Layer, Event Pipeline • Several tips for running Kafka cluster in AWS © ChatWork and NTT DATA Corporation. 37
  38. 38. Kafka Summit SF 2017 Thank you! Any Questions? © ChatWork and NTT DATA Corporation. 38

×