Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
©2017 LinkedIn Corporation. All Rights Reserved.
An Introduction to Apache Kafka and
Kafka Ecosystem at LinkedIn
Dong Lin
...
©2017 LinkedIn Corporation. All Rights Reserved.
Agenda
▪ Kafka basics (50 min)
▪ Kafka ecosystem at LinkedIn (40 min)
▪ H...
©2017 LinkedIn Corporation. All Rights Reserved. 3
Kafka basics
▪ What is Kafka?
– Motivation and design philosophy
▪ Who ...
©2017 LinkedIn Corporation. All Rights Reserved. 4
Publish/Subscribe Messaging
• Multiple producers
• Multiple consumers
•...
©2017 LinkedIn Corporation. All Rights Reserved. 5
PageViewEvent
Hadoop
Direct transmission
Web server
©2017 LinkedIn Corporation. All Rights Reserved.
Many problems
Multiple
consumers
Destination
is slow
Destination
permanen...
©2017 LinkedIn Corporation. All Rights Reserved.
Use a publish-subscribe messaging system
Multiple
consumers
Destination
p...
©2017 LinkedIn Corporation. All Rights Reserved.
Use Kafka
Spark streaming
Multiple
consumers
Destination
permanent
failur...
©2017 LinkedIn Corporation. All Rights Reserved.
Problem: closely-coupled pipelines
▪ O(N^2) pipelines – limited organizat...
©2017 LinkedIn Corporation. All Rights Reserved.
Solution: publish-subscribe messaging system
▪ O(N) pipelines
▪ Space eff...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka as Unix Pipes
$ cat *.txt | tr A-Z a-z | grep hello
$ tail –F *.txt...
©2017 LinkedIn Corporation. All Rights Reserved.
Fan In
12
©2017 LinkedIn Corporation. All Rights Reserved.
Fan Out
13
©2017 LinkedIn Corporation. All Rights Reserved.
Add Branch
14
©2017 LinkedIn Corporation. All Rights Reserved.
Switch Branch
15
©2017 LinkedIn Corporation. All Rights Reserved.
Delete Branch
16
©2017 LinkedIn Corporation. All Rights Reserved.
Parallel Consumption
17
©2017 LinkedIn Corporation. All Rights Reserved. 18
Kafka basics
▪ What is Kafka?
– Motivation and design philosophy
▪ Who...
©2017 LinkedIn Corporation. All Rights Reserved.
Companies that use Kafka
LinkedIn Yahoo Twitter Airbnb
Pinterest Square C...
©2017 LinkedIn Corporation. All Rights Reserved.
Apache projects integrated with Kafka
• Stream processing
• Apache Storm
...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka volume at LinkedIn
21
• Produced
• Per day
2Trillion
messages
• Sin...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka use-cases at LinkedIn
22
• Member-related
Activity
Tracking Metrics...
©2017 LinkedIn Corporation. All Rights Reserved. 23
Kafka basics
▪ What is Kafka?
– Motivation and design philosophy
▪ Who...
©2017 LinkedIn Corporation. All Rights Reserved.
Design goal
▪ Performance
– High throughput
– Low latency
– Scalable
▪ Pe...
©2017 LinkedIn Corporation. All Rights Reserved.
Characteristics
• High throughput (~300 MBps per machine)
– Immutable app...
©2017 LinkedIn Corporation. All Rights Reserved.
Is disk slow?
26
©2017 LinkedIn Corporation. All Rights Reserved.
Traditional data copy
27
▪ 4 copies
▪ 2 context switches
©2017 LinkedIn Corporation. All Rights Reserved.
Efficient zero copy
28
▪ 3 copies
▪ 0 context switch
▪ Only 2 copies if c...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka as log
29
©2017 LinkedIn Corporation. All Rights Reserved.
Producer -> Topic -> Consumer
30
©2017 LinkedIn Corporation. All Rights Reserved.
Topic divided into partitions
• Partitions are distributed and replicated...
©2017 LinkedIn Corporation. All Rights Reserved.
Old New
Partition consists of messages with offsets
• Append only
• Stric...
©2017 LinkedIn Corporation. All Rights Reserved. 33
▪ Disk/network/CPU load
distributed across brokers in
unit of partitio...
©2017 LinkedIn Corporation. All Rights Reserved.
Producer in Kafka
▪ Messages with same key go
to the same partition
▪ Mes...
©2017 LinkedIn Corporation. All Rights Reserved.
Consumer in Kafka
▪ Consume can belong to a
consumer group (CG)
▪ Consume...
©2017 LinkedIn Corporation. All Rights Reserved.
When a broker fails…
X
36
©2017 LinkedIn Corporation. All Rights Reserved.
Partition replication in Kafka
▪Brokers can fail
– Controlled: e.g., upgr...
©2017 LinkedIn Corporation. All Rights Reserved.
Partition replica assignment
▪ Replicas are laid out evenly across broker...
©2017 LinkedIn Corporation. All Rights Reserved.
Replication (at a high-level)
39
©2017 LinkedIn Corporation. All Rights Reserved.
Replication (at a high-level)
40
©2017 LinkedIn Corporation. All Rights Reserved.
Replication (at a high-level)
41
©2017 LinkedIn Corporation. All Rights Reserved.
Replication (at a high-level)
42
©2017 LinkedIn Corporation. All Rights Reserved. 43
Kafka basics
▪ What is Kafka?
– Motivation and design philosophy
▪ Who...
©2017 LinkedIn Corporation. All Rights Reserved.
No one-size-fits-all configuration
44
©2017 LinkedIn Corporation. All Rights Reserved.
Tradeoff between performance and persistence
• Should broker send ack to ...
©2017 LinkedIn Corporation. All Rights Reserved.
Tradeoff between performance and message order
46
• Should producer send ...
©2017 LinkedIn Corporation. All Rights Reserved.
Tradeoff between persistence and availability
• Should we allow message p...
©2017 LinkedIn Corporation. All Rights Reserved.
Tradeoff between availability and cost
• Do we need more replicas for the...
©2017 LinkedIn Corporation. All Rights Reserved. 49
Kafka basics
▪ What is Kafka?
– Motivation and design philosophy
▪ Who...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka provides great performance, availability and data persistence
Are t...
©2017 LinkedIn Corporation. All Rights Reserved.
Improved support for multi-tenancy
▪ Sasl/Kerberos and SSL support (KIP-1...
©2017 LinkedIn Corporation. All Rights Reserved.
Reduced hardware and operational cost
▪ Dynamic configuration (KIP-21)
▪ ...
©2017 LinkedIn Corporation. All Rights Reserved.
Additional functionality for broader use-cases
▪ Kafka connect for data i...
©2017 LinkedIn Corporation. All Rights Reserved.
Learn more about Kafka
▪ Stream processing meetup
▪ Kafka summit
▪ Kafka ...
©2017 LinkedIn Corporation. All Rights Reserved. 55
©2017 LinkedIn Corporation. All Rights Reserved.
Agenda
▪ Kafka basics (50 min)
▪ Kafka ecosystem at LinkedIn (40 min)
– P...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to monitor and manage Kafka servers
▪ cruise-control for automat...
©2017 LinkedIn Corporation. All Rights Reserved.
Problems before having Cruise Control
▪ SRE needs to wake up at night to ...
©2017 LinkedIn Corporation. All Rights Reserved.
Cruise Control Architecture
59
▪ Self-heal from broker failure
▪ Balance ...
©2017 LinkedIn Corporation. All Rights Reserved.
Example Cruise Control goals
▪ Partitions should be distributed across br...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to monitor and manage Kafka servers
▪ cruise-control for automat...
©2017 LinkedIn Corporation. All Rights Reserved.
Problems before having Kafka Monitor
▪ Some issues are discovered only af...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka Monitor Architecture
63
▪ Alert on service unavailability
▪ Quantif...
©2017 LinkedIn Corporation. All Rights Reserved.
Other Kafka Monitor features
64
▪ Automatically distribute partitions of ...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to monitor and manage Kafka servers
▪ cruise-control for automat...
©2017 LinkedIn Corporation. All Rights Reserved.
Problems before having Kafka Audit
▪ Hard to help user identify why their...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka Audit Architecture
67
▪ Detect messages loss
▪ Debug message loss
▪...
©2017 LinkedIn Corporation. All Rights Reserved.
Example Kafka Audit UI
68
When, where and how many of messages are delive...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to monitor and manage Kafka servers
▪ cruise-control for automat...
©2017 LinkedIn Corporation. All Rights Reserved.
InGraph Architecture
70
Metric topic
in
Kafka Cluster
Broker
Broker
Clien...
©2017 LinkedIn Corporation. All Rights Reserved.
Example InGraph UI
71
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to monitor and debug Kafka clients
▪ Burrow for monitoring offse...
©2017 LinkedIn Corporation. All Rights Reserved.
Burrow Architecture
▪ Detect lagging consumers
▪ Detect stalled consumers...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to monitor and debug Kafka clients
▪ Burrow for monitoring offse...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to make Kafka easier to use
▪ kafka-rest to allow non-Java clien...
©2017 LinkedIn Corporation. All Rights Reserved.
Kafka Rest Architecture
76
▪ Support non-Java clients
▪ No need to mainta...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to make Kafka easier to use
▪ kafka-rest to allow non-Java clien...
©2017 LinkedIn Corporation. All Rights Reserved.
Schema Registry Architecture
78
▪ Enable efficient binary
encoding of sch...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to make Kafka easier to use
▪ kafka-rest to allow non-Java clien...
©2017 LinkedIn Corporation. All Rights Reserved.
Large message support in li-apache-kafka-clients
80
©2017 LinkedIn Corporation. All Rights Reserved.
Projects to make Kafka easier to use
▪ kafka-rest to allow non-Java clien...
©2017 LinkedIn Corporation. All Rights Reserved.
Put things together
82
©2017 LinkedIn Corporation. All Rights Reserved.
Help yourself with these open source projects
▪ Cruise Control (https://g...
©2017 LinkedIn Corporation. All Rights Reserved.
Projects at LinkedIn that are built on Kafka
▪ Stream processing – Apache...
©2017 LinkedIn Corporation. All Rights Reserved. 85
©2017 LinkedIn Corporation. All Rights Reserved. 86
Agenda
▪ Kafka basics (50 min)
▪ Kafka ecosystem at LinkedIn (40 min)
...
©2017 LinkedIn Corporation. All Rights Reserved. 87
Hands-on
▪ Visit goo.gl/D7GFfB
▪ Single cluster
– Download and compile...
Upcoming SlideShare
Loading in …5
×

An introduction to Apache Kafka and Kafka ecosystem at LinkedIn

188 views

Published on

This talk gives an overview of Apache Kafka and introduces the Kafka ecosystem at LinkedIn. The talk is given at San Francisco in ODSC 2017.

Published in: Engineering
  • You might also like this slide 'Apache Kafka vs MapR-ES: Fit for purpose/Decision tree': https://www.slideshare.net/sbaltagi/apache-kafka-vs-mapres-fit-for-purposedecision-tree
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

An introduction to Apache Kafka and Kafka ecosystem at LinkedIn

  1. 1. ©2017 LinkedIn Corporation. All Rights Reserved. An Introduction to Apache Kafka and Kafka Ecosystem at LinkedIn Dong Lin Data Infra Streaming @ LinkedIn Open Data Science Conference
  2. 2. ©2017 LinkedIn Corporation. All Rights Reserved. Agenda ▪ Kafka basics (50 min) ▪ Kafka ecosystem at LinkedIn (40 min) ▪ Hands-on (30 min)
  3. 3. ©2017 LinkedIn Corporation. All Rights Reserved. 3 Kafka basics ▪ What is Kafka? – Motivation and design philosophy ▪ Who uses Kafka? – Adoption in the open source community and use-cases at LinkedIn ▪ What is the fundamental design of Kafka? – Partition and replication model ▪ How to configure Kafka for your use-case? – Tradeoff among performance, persistence, availability and message order ▪ What is the development roadmap of Kafka? – Recent and upcoming features
  4. 4. ©2017 LinkedIn Corporation. All Rights Reserved. 4 Publish/Subscribe Messaging • Multiple producers • Multiple consumers • Scalable and durable • Created by LinkedIn • Open sourced under Apache
  5. 5. ©2017 LinkedIn Corporation. All Rights Reserved. 5 PageViewEvent Hadoop Direct transmission Web server
  6. 6. ©2017 LinkedIn Corporation. All Rights Reserved. Many problems Multiple consumers Destination is slow Destination permanent failure Bug in downstream application Destination temporarily unavailable Multiple producers At least once delivery 6 PageViewEvent HadoopWeb server
  7. 7. ©2017 LinkedIn Corporation. All Rights Reserved. Use a publish-subscribe messaging system Multiple consumers Destination permanent failure Bug in downstream application Multiple producers Destination temporarily unavailable Pub/sub system 7 Hadoop Destination is slow At least once delivery Web server
  8. 8. ©2017 LinkedIn Corporation. All Rights Reserved. Use Kafka Spark streaming Multiple consumers Destination permanent failure Bug in downstream application FunctionalityPersistent Delivery semanticsPerformance Destination temporarily unavailable Availability 8 Destination is slow At least once delivery Multiple producers Web server
  9. 9. ©2017 LinkedIn Corporation. All Rights Reserved. Problem: closely-coupled pipelines ▪ O(N^2) pipelines – limited organizational scalability ▪ Messages are duplicated proportional to number of clients 9
  10. 10. ©2017 LinkedIn Corporation. All Rights Reserved. Solution: publish-subscribe messaging system ▪ O(N) pipelines ▪ Space efficient ▪ Producers are decoupled from consumers 10
  11. 11. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka as Unix Pipes $ cat *.txt | tr A-Z a-z | grep hello $ tail –F *.txt | tr A-Z a-z | grep hello producer kafka Hadoop kafka Hadoop Samza kafka Samza Reference: http://www.confluent.io/blog 11
  12. 12. ©2017 LinkedIn Corporation. All Rights Reserved. Fan In 12
  13. 13. ©2017 LinkedIn Corporation. All Rights Reserved. Fan Out 13
  14. 14. ©2017 LinkedIn Corporation. All Rights Reserved. Add Branch 14
  15. 15. ©2017 LinkedIn Corporation. All Rights Reserved. Switch Branch 15
  16. 16. ©2017 LinkedIn Corporation. All Rights Reserved. Delete Branch 16
  17. 17. ©2017 LinkedIn Corporation. All Rights Reserved. Parallel Consumption 17
  18. 18. ©2017 LinkedIn Corporation. All Rights Reserved. 18 Kafka basics ▪ What is Kafka? – Motivation and design philosophy ▪ Who uses Kafka? – Adoption in the open source community and use-cases at LinkedIn ▪ What is the fundamental design of Kafka? – Partition and replication model ▪ How to configure Kafka for your use-case? – Tradeoff among performance, persistence, availability and message order ▪ What is the development roadmap of Kafka? – Recent and upcoming features
  19. 19. ©2017 LinkedIn Corporation. All Rights Reserved. Companies that use Kafka LinkedIn Yahoo Twitter Airbnb Pinterest Square Coursera Uber Goldman Sachs Box Paypal Cisco Dropbox Spotify Wikipedia Microsoft Netflix CloudFlare Hotels.com … Reference: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By 19
  20. 20. ©2017 LinkedIn Corporation. All Rights Reserved. Apache projects integrated with Kafka • Stream processing • Apache Storm • Apache Samza • Apache Spark Streaming • Search and Query • Apache Hive • Presto • Apache Hadoop … 20
  21. 21. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka volume at LinkedIn 21 • Produced • Per day 2Trillion messages • Single cluster • Unique data 5Gbps Inbound • Average 3X consumption • Before mirroring 18Gbps Outbound • Largest cluster has 250k partitions • Up to 10k partitions per broker 2.5M Partitions
  22. 22. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka use-cases at LinkedIn 22 • Member-related Activity Tracking Metrics Queuing Logging • Application metrics, service calls • Internal application data, messaging • Largest users are Samza and Search • Dedicated cluster for application logs going to ELK • High volume, low retention
  23. 23. ©2017 LinkedIn Corporation. All Rights Reserved. 23 Kafka basics ▪ What is Kafka? – Motivation and design philosophy ▪ Who uses Kafka? – Adoption in the open source community and use-cases at LinkedIn ▪ What is the fundamental design of Kafka? – Partition and replication model ▪ How to configure Kafka for your use-case? – Tradeoff among performance, persistence, availability and message order ▪ What is the development roadmap of Kafka? – Recent and upcoming features
  24. 24. ©2017 LinkedIn Corporation. All Rights Reserved. Design goal ▪ Performance – High throughput – Low latency – Scalable ▪ Persistence and availability – Data should be available in the event of (permanent) server failure ▪ Functionality – Rewind back in time ▪ Strong delivery semantics – At-least-once delivery / exactly-once delivery – In-order message delivery within partition 24
  25. 25. ©2017 LinkedIn Corporation. All Rights Reserved. Characteristics • High throughput (~300 MBps per machine) – Immutable append-only data structure for fast disk access – Efficient data transfer via zero copy – Mostly messages are read directly from page cache – Partitioning model for scalability – Batching and compression • Low latency (~2 ms) – Make data universally available in near real-time • Strong guarantees about messages – Messages strictly ordered within partition – All data persistent on disk with replication – Exactly once delivery 25
  26. 26. ©2017 LinkedIn Corporation. All Rights Reserved. Is disk slow? 26
  27. 27. ©2017 LinkedIn Corporation. All Rights Reserved. Traditional data copy 27 ▪ 4 copies ▪ 2 context switches
  28. 28. ©2017 LinkedIn Corporation. All Rights Reserved. Efficient zero copy 28 ▪ 3 copies ▪ 0 context switch ▪ Only 2 copies if consumers are mostly caught up
  29. 29. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka as log 29
  30. 30. ©2017 LinkedIn Corporation. All Rights Reserved. Producer -> Topic -> Consumer 30
  31. 31. ©2017 LinkedIn Corporation. All Rights Reserved. Topic divided into partitions • Partitions are distributed and replicated across brokers • Parallel produce/consume • Messages with the same key go to the same partition 31
  32. 32. ©2017 LinkedIn Corporation. All Rights Reserved. Old New Partition consists of messages with offsets • Append only • Strict order • Messages assigned with incremental offsets 32
  33. 33. ©2017 LinkedIn Corporation. All Rights Reserved. 33 ▪ Disk/network/CPU load distributed across brokers in unit of partitions Broker in Kafka
  34. 34. ©2017 LinkedIn Corporation. All Rights Reserved. Producer in Kafka ▪ Messages with same key go to the same partition ▪ Messages without a key go to a random partition 34
  35. 35. ©2017 LinkedIn Corporation. All Rights Reserved. Consumer in Kafka ▪ Consume can belong to a consumer group (CG) ▪ Consumes in the same CG – Parallel processing of messages – Share the consumer offset 35
  36. 36. ©2017 LinkedIn Corporation. All Rights Reserved. When a broker fails… X 36
  37. 37. ©2017 LinkedIn Corporation. All Rights Reserved. Partition replication in Kafka ▪Brokers can fail – Controlled: e.g., upgrades/config changes – Uncontrolled: disk failure, power outage, out-of-memory etc. ▪Need high availability – Typical failover < 10 ms ▪Need data persistence 37
  38. 38. ©2017 LinkedIn Corporation. All Rights Reserved. Partition replica assignment ▪ Replicas are laid out evenly across brokers ▪ First assigned replica is preferred as leader. ▪ Writes/reads go to leader, which sends message to followers 38
  39. 39. ©2017 LinkedIn Corporation. All Rights Reserved. Replication (at a high-level) 39
  40. 40. ©2017 LinkedIn Corporation. All Rights Reserved. Replication (at a high-level) 40
  41. 41. ©2017 LinkedIn Corporation. All Rights Reserved. Replication (at a high-level) 41
  42. 42. ©2017 LinkedIn Corporation. All Rights Reserved. Replication (at a high-level) 42
  43. 43. ©2017 LinkedIn Corporation. All Rights Reserved. 43 Kafka basics ▪ What is Kafka? – Motivation and design philosophy ▪ Who uses Kafka? – Adoption in the open source community and use-cases at LinkedIn ▪ What is the fundamental design of Kafka? – Partition and replication model ▪ How to configure Kafka for your use-case? – Tradeoff among performance, persistence, availability and message order ▪ What is the development roadmap of Kafka? – Recent and upcoming features
  44. 44. ©2017 LinkedIn Corporation. All Rights Reserved. No one-size-fits-all configuration 44
  45. 45. ©2017 LinkedIn Corporation. All Rights Reserved. Tradeoff between performance and persistence • Should broker send ack to producer right after step 1? • Higher persistence and lower throughput with acks = -1 in producer config X 45
  46. 46. ©2017 LinkedIn Corporation. All Rights Reserved. Tradeoff between performance and message order 46 • Should producer send new message before ack of the last message? • In-order delivery and lower throughput with max.in.flight.requests.per.connection = 1 in producer config Kafka BrokerProducer message 1 message 0 failed retry message 0 message 0
  47. 47. ©2017 LinkedIn Corporation. All Rights Reserved. Tradeoff between persistence and availability • Should we allow message produce if all in-sync replicas are offline? • Higher availability and weaker persistence with unclean.leader.election.enable = true in broker config 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 60 1 2 3 4 5 Follower 1 Follower 2 Leader Read Read 47 7 8 X X
  48. 48. ©2017 LinkedIn Corporation. All Rights Reserved. Tradeoff between availability and cost • Do we need more replicas for the topic? • Higher availability and higher cost with RF=3 in comparison to RF=2) 48 producer Broker Broker Broker producer Broker Broker RF=3 RF=2
  49. 49. ©2017 LinkedIn Corporation. All Rights Reserved. 49 Kafka basics ▪ What is Kafka? – Motivation and design philosophy ▪ Who uses Kafka? – Adoption in the open source community and use-cases at LinkedIn ▪ What is the fundamental design of Kafka? – Partition and replication model ▪ How to configure Kafka for your use-case? – Tradeoff among performance, persistence, availability and message order ▪ What is the development roadmap of Kafka? – Recent and upcoming features
  50. 50. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka provides great performance, availability and data persistence Are there other features that will be valuable to users? 50
  51. 51. ©2017 LinkedIn Corporation. All Rights Reserved. Improved support for multi-tenancy ▪ Sasl/Kerberos and SSL support (KIP-12) ▪ Quota (KIP-13) ▪ Namespace in Kafka topics (KIP-37) ▪ Zookeeper authentication (KIP-38) ▪ End-to-end encryption 51
  52. 52. ©2017 LinkedIn Corporation. All Rights Reserved. Reduced hardware and operational cost ▪ Dynamic configuration (KIP-21) ▪ Rack aware replica assignment (KIP-36) ▪ Self healing (KIP-46) ▪ On demand data deletion (KIP-107) ▪ JBOD support (KIP-112 and KIP-113) 52
  53. 53. ©2017 LinkedIn Corporation. All Rights Reserved. Additional functionality for broader use-cases ▪ Kafka connect for data import/export (KIP-26) ▪ Streaming processor (KIP-28) ▪ Timestamp in message (KIP-32) ▪ Exactly-once delivery and transactional messaging (KIP-98) 53
  54. 54. ©2017 LinkedIn Corporation. All Rights Reserved. Learn more about Kafka ▪ Stream processing meetup ▪ Kafka summit ▪ Kafka improvement proposals https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals ▪ LinkedIn engineering blog https://engineering.linkedin.com/blog 54
  55. 55. ©2017 LinkedIn Corporation. All Rights Reserved. 55
  56. 56. ©2017 LinkedIn Corporation. All Rights Reserved. Agenda ▪ Kafka basics (50 min) ▪ Kafka ecosystem at LinkedIn (40 min) – Projects to monitor and manage Kafka servers – Projects to monitor and debug Kafka clients – Projects to make Kafka easier to use – Projects that are built on Kafka ▪ Hands on (30 min)
  57. 57. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to monitor and manage Kafka servers ▪ cruise-control for automatically balancing partitions across brokers ▪ kafka-monitor for monitoring kafka service availability etc. ▪ kafka-audit for monitoring data loss ▪ InGraph for monitoring all JMX metrics from Kafka as time-series graph 57
  58. 58. ©2017 LinkedIn Corporation. All Rights Reserved. Problems before having Cruise Control ▪ SRE needs to wake up at night to move partitions in case of hardware failure ▪ SRE needs to manually move partitions to balance load across brokers ▪ Reduced availability due to need to wait for manual recovery ▪ The partition movement may impact production traffic 58 Open sourced on Github in Aug, 2017
  59. 59. ©2017 LinkedIn Corporation. All Rights Reserved. Cruise Control Architecture 59 ▪ Self-heal from broker failure ▪ Balance load across brokers without manual intervention ▪ Controlled impact on PROD traffic when moving partitions
  60. 60. ©2017 LinkedIn Corporation. All Rights Reserved. Example Cruise Control goals ▪ Partitions should be distributed across brokers in a rack-aware manner ▪ Broker resource utilization should be below the user-specified threshold ▪ Try to evenly distribute resource utilization across brokers 60
  61. 61. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to monitor and manage Kafka servers ▪ cruise-control for automatically balancing partitions across brokers ▪ kafka-monitor for monitoring kafka service availability etc. ▪ kafka-audit for monitoring data loss ▪ InGraph for monitoring all JMX metrics from Kafka as time-series graph 61
  62. 62. ©2017 LinkedIn Corporation. All Rights Reserved. Problems before having Kafka Monitor ▪ Some issues are discovered only after bug report from Kafka user ▪ Can not quantify the availability and the latency of Kafka cluster ▪ Can not quantify the availability and the latency of Kafka mirrored pipeline 62
  63. 63. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka Monitor Architecture 63 ▪ Alert on service unavailability ▪ Quantify service availability ▪ Measure end-to-end latency ▪ Detect violation of Kafka semantics Our availability SLA is 99.99%
  64. 64. ©2017 LinkedIn Corporation. All Rights Reserved. Other Kafka Monitor features 64 ▪ Automatically distribute partitions of the monitor topic evenly across brokers ▪ Extensible module to export JMX metrics to various stores (e.g. Graphite) ▪ Pluggable interface to test Kafka service with your own client implementation Open sourced on Github in May, 2016
  65. 65. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to monitor and manage Kafka servers ▪ cruise-control for automatically balancing partitions across brokers ▪ kafka-monitor for monitoring kafka service availability etc. ▪ kafka-audit for monitoring data loss ▪ InGraph for monitoring all JMX metrics from Kafka as time-series graph 65
  66. 66. ©2017 LinkedIn Corporation. All Rights Reserved. Problems before having Kafka Audit ▪ Hard to help user identify why their message is not received ▪ Hard to detect and debug message loss in Kafka pipelines 66
  67. 67. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka Audit Architecture 67 ▪ Detect messages loss ▪ Debug message loss ▪ Audit Kafka resource usage
  68. 68. ©2017 LinkedIn Corporation. All Rights Reserved. Example Kafka Audit UI 68 When, where and how many of messages are delivered to Kafka
  69. 69. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to monitor and manage Kafka servers ▪ cruise-control for automatically balancing partitions across brokers ▪ kafka-monitor for monitoring kafka service availability etc. ▪ kafka-audit for monitoring data loss ▪ InGraph for monitoring all JMX metrics from Kafka as time-series graph 69
  70. 70. ©2017 LinkedIn Corporation. All Rights Reserved. InGraph Architecture 70 Metric topic in Kafka Cluster Broker Broker Client InGraph with UI Metric messages metric messages
  71. 71. ©2017 LinkedIn Corporation. All Rights Reserved. Example InGraph UI 71
  72. 72. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to monitor and debug Kafka clients ▪ Burrow for monitoring offset lag of consumer groups ▪ kafka-audit for monitoring Kafka resource usage per client 72
  73. 73. ©2017 LinkedIn Corporation. All Rights Reserved. Burrow Architecture ▪ Detect lagging consumers ▪ Detect stalled consumers ▪ Detect stopped consumers ▪ Detect offset rewind ▪ Open sourced on Github 73
  74. 74. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to monitor and debug Kafka clients ▪ Burrow for monitoring offset lag of consumer groups ▪ kafka-audit for monitoring Kafka resource usage per client 74 Attribute the hardware cost in $$ to users of Kafka and reduce unnecessary usage of Kafka
  75. 75. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to make Kafka easier to use ▪ kafka-rest to allow non-Java client to produce and consume from Kafka cluster ▪ schema-registry for conversion between binary data and IndexedRecord ▪ li-apache-kafka-clients to support large message etc. ▪ Nuage for users to create and manage properties (e.g. retention time) of their topic by themselves 75
  76. 76. ©2017 LinkedIn Corporation. All Rights Reserved. Kafka Rest Architecture 76 ▪ Support non-Java clients ▪ No need to maintain client libraries in multiple languages
  77. 77. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to make Kafka easier to use ▪ kafka-rest to allow non-Java client to produce and consume from Kafka cluster ▪ schema-registry for conversion between binary data and IndexedRecord ▪ li-apache-kafka-clients to support large message etc. ▪ Nuage for users to create and manage properties (e.g. retention time) of their topic by themselves 77
  78. 78. ©2017 LinkedIn Corporation. All Rights Reserved. Schema Registry Architecture 78 ▪ Enable efficient binary encoding of schema in the Kafka message ▪ Track schema evolution for forward and backward compatibility Kafka Cluster LiProducer with Schema cache LiConsumer with Schema cache IndexedRecord IndexedRecord Binary data Binary data Schema Registry Register schema Fetch schema User application User application
  79. 79. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to make Kafka easier to use ▪ kafka-rest to allow non-Java client to produce and consume from Kafka cluster ▪ schema-registry for conversion between binary data and IndexedRecord ▪ li-apache-kafka-clients to support large message etc. ▪ Nuage for users to create and manage properties (e.g. retention time) of their topic by themselves 79
  80. 80. ©2017 LinkedIn Corporation. All Rights Reserved. Large message support in li-apache-kafka-clients 80
  81. 81. ©2017 LinkedIn Corporation. All Rights Reserved. Projects to make Kafka easier to use ▪ kafka-rest to allow non-Java client to produce and consume from Kafka cluster ▪ schema-registry for conversion between binary data and IndexedRecord ▪ li-apache-kafka-clients to support large message etc. ▪ Nuage for users to create and manage properties (e.g. retention time) of their topic by themselves 81
  82. 82. ©2017 LinkedIn Corporation. All Rights Reserved. Put things together 82
  83. 83. ©2017 LinkedIn Corporation. All Rights Reserved. Help yourself with these open source projects ▪ Cruise Control (https://github.com/linkedin/cruise-control) ▪ Kafka Monitor (https://github.com/linkedin/kafka-monitor) ▪ Burrow (https://github.com/linkedin/burrow) ▪ li-apache-kafka-clients (https://github.com/linkedin/li-apache-kafka-clients) ▪ Future projects open sourced by LinkedIn streaming team can be found at https://github.com/linkedin/streaming 83 All projects are actively maintained and used in LinkedIn production environment 100% free of charge!
  84. 84. ©2017 LinkedIn Corporation. All Rights Reserved. Projects at LinkedIn that are built on Kafka ▪ Stream processing – Apache Samza ▪ Change data capture – Brooklin ▪ Strongly consistent key-value store – Espresso ▪ Efficient key-value store for derived data – Venice 84
  85. 85. ©2017 LinkedIn Corporation. All Rights Reserved. 85
  86. 86. ©2017 LinkedIn Corporation. All Rights Reserved. 86 Agenda ▪ Kafka basics (50 min) ▪ Kafka ecosystem at LinkedIn (40 min) ▪ Hands-on (30 min)
  87. 87. ©2017 LinkedIn Corporation. All Rights Reserved. 87 Hands-on ▪ Visit goo.gl/D7GFfB ▪ Single cluster – Download and compile Apache Kafka – Setup a cluster of one broker – Create and describe topic – Produce and consume using Apache Kafka tools – Monitor availability of your cluster using Kafka Monitor ▪ Mirrored pipeline – Setup another cluster of one broker – Setup MM to mirror traffic from the source cluster to the destination cluster – Produce to the source cluster and consume from the destination cluster – Monitor availability of your pipeline using Kafka Monitor

×