Apache kafka

•Download as PPTX, PDF•

0 likes•19 views

sKaushikNarayanan

It gives you a simple introduction about Sikuli

Engineering

 Introduction to pub-sub
 Kafka at LinkedIn
 Hadoop and Kafka
 Design
 Performance
Outline

What is pub sub ?
Producer Consumer
Producer
Consumer
Topic
1
Topic
2
Topic
3
subscribe
publish(topic, msg)
Publish subscribe
system
msg
msg

Motivation
 Activity tracking
 Operational metrics

Kafka
 Distributed
 Persistent
 High throughput

Kafka at LinkedIn
Frontend
Real time
monitoring
BrokerBrokerBroker
Hadoop DWH
Security
systems
News feed
Kafka
Frontend ServiceService

Hadoop Data Load for Kafka
Live data center Offline data center
HadoopHadoopDev
Hadoop
FrontendFrontendReal time
consumers
KafkaKafkaKafka
KafkaKafkaKafka
HadoopHadoopPROD
Hadoop

Multi DC data deployments
Live data centers Offline data centers
Kafka
Real time
consumers
Real time
consumers
Real time
consumers
Real time
consumers
Real time
consumers
Real time
consumers
Kafka
Hadoop
HadoopHadoopHadoop
Hadoop
HadoopHadoopDWH

Volume
 20B events/day
 3 terabytes/day
 150K events/sec

Message queues
•ActiveMQ
•TIBCO
Log aggregators
•Flume
•Scribe
•Low throughput
•Secondary indexes
•Tuned for low
latency
•Focus on HDFS
•Push model
•No rewindable
consumption
KAFKA

What Kafka offers
 Very high performance
 Elastically scalable
 Low operational overhead
 Durable, highly available (coming soon)

Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK

Efficiency #1: simple storage
 Each topic has an ever-growing log
 A log == a list of files
 A message is addressed by a log offset

Efficiency #2: careful transfer
 Batch send and receive
 No message caching in JVM
 Rely on file system buffering
 Zero-copy transfer: file -> socket

Multi subscribers
 1 file system operation per request
 Consumption is cheap
 SLA based message retention
 Rewindable consumption

Guarantees
 Data integrity checks
 At least once delivery
 In order delivery, per partition

Automatic load balancing
Consumer
Producer
Broker Broker
Consumer
Producer

Auditing
# events published = # events consumed

Performance
 2 Linux boxes
• 16 2.0 GHz cores
• 6 7200 rpm SATA drive RAID 10
• 24GB memory
• 1Gb network link
 200 byte messages
 Producer batch size 200 messages

Basic performance metrics
• Producer batch size = 40K
• Consumer batch size = 1MB
• 100 topics, broker flush interval = 100K
– Producer throughput = 90 MB/sec
– Consumer throughput = 60 MB/sec
– Consumer latency = 220 ms

Latency vs throughput
0
50
100
150
200
250
0 20 40 60 80 100
Producer throughput in MB/sec
Consumerlatencyinms
(100 topics, 1 producer, 1 broker)

Scalability
101
190
293
381
0
50
100
150
200
250
300
350
400
1 broker 2 brokers 3 brokers 4 brokers
ThroughputinMB/s
(10 topics, broker flush interval 100K)

Throughput vs Unconsumed data
0
40000
80000
120000
160000
200000
10
105
199
294
388
473
567
662
756
851
945
1039
(1 topic, broker flush interval 10K)
Throughputinmsg/s
Unconsumed data in GB

State of the system
 4 clusters per colo, 4 servers each
 850 socket connections per server
 20 TB
 430 topics
 Batched frontend to offline datacenter latency
=> 6-10 secs
 Frontend to Hadoop latency => 5 min

State of the system
 Successfully deployed in production at LinkedIn
and other startups
 Apache incubator inclusion
 0.7 Release
• Compression
• Cluster mirroring

Some project ideas
 Security
 Long poll
 More compression codecs
 Locality of consumption

Team
• Jay Kreps
• Jun Rao
• Neha Narkhede
• Joel Koshy
• Chris Burroughs

THANK YOU
http://incubator.apache.org/kafka/index.html
kafka-users@incubator.apache.org
http://www.linkedin.com/in/nehanarkhede
@nehanarkhede
#kafka

What's hot

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platformconfluent

Thoughts on kafka capacity planningJamieAlquiza

Kafka presentationMohammed Fazuluddin

Exactly-once Data Processing with Kafka Streams - July 27, 2017confluent

What's new in MongoDB 2.6Matias Cascallares

Running Kafka On Kubernetes With Strimzi For Real-Time Streaming ApplicationsLightbend

Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...confluent

MongoDB 3.4 webinarAndrew Morgan

Netflix Data Pipeline With KafkaSteven Wu

Architecture of a Kafka camus infrastructuremattlieber

NoSQL benchmarkingPrasoon Kumar

Scalable and Reliable Logging at PinterestKrishna Gade

Solving Problems At Scale With RedisRedis Labs

Putting Kafka Together with the Best of Google Cloud Platform confluent

Netflix Keystone—Cloud scale event processing pipelineMonal Daxini

What's hot (15)

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform

Thoughts on kafka capacity planning

Kafka presentation

Exactly-once Data Processing with Kafka Streams - July 27, 2017

What's new in MongoDB 2.6

Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications

Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...

MongoDB 3.4 webinar

Netflix Data Pipeline With Kafka

Architecture of a Kafka camus infrastructure

NoSQL benchmarking

Scalable and Reliable Logging at Pinterest

Solving Problems At Scale With Redis

Putting Kafka Together with the Best of Google Cloud Platform

Netflix Keystone—Cloud scale event processing pipeline

Similar to Apache kafka

F_1330_Narkhede_Kafka .pptxNIMITJAIN71

Apache Kafka - Scalable Message-Processing and more !Guido Schmutz

Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis

Introduction to apache kafka, confluent and why they matterPaolo Castagna

Apache Kafka - Scalable Message-Processing and more !Guido Schmutz

Kafka for ScaleEyal Ben Ivri

Westpac Bank Tech Talk 1: Dive into Apache Kafkaconfluent

Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan

Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB

Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini

What is Apache Kafka and What is an Event Streaming Platform?confluent

Apache Kafka - Scalable Message Processing and more!Guido Schmutz

Data Streaming with Apache Kafka & MongoDBconfluent

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini

Kafka ExplainatonNguyenChiHoangMinh

Streaming Data Ingest and Processing with Apache KafkaAttunity

Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...HostedbyConfluent

Keystone - ApacheCon 2016Peter Bakas

kafka for db as postgresPivotalOpenSourceHub

BDX 2016- Monal daxini @ NetflixIdo Shilon

Similar to Apache kafka (20)

F_1330_Narkhede_Kafka .pptx

Apache Kafka - Scalable Message-Processing and more !

Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...

Introduction to apache kafka, confluent and why they matter

Apache Kafka - Scalable Message-Processing and more !

Kafka for Scale

Westpac Bank Tech Talk 1: Dive into Apache Kafka

Data Streaming with Apache Kafka & MongoDB - EMEA

Webinar: Data Streaming with Apache Kafka & MongoDB

Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016

What is Apache Kafka and What is an Event Streaming Platform?

Apache Kafka - Scalable Message Processing and more!

Data Streaming with Apache Kafka & MongoDB

Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015

Kafka Explainaton

Streaming Data Ingest and Processing with Apache Kafka

Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...

Keystone - ApacheCon 2016

kafka for db as postgres

BDX 2016- Monal daxini @ Netflix

Recently uploaded

Architect Hassan Khalil Portfolio for 2024hassan khalil

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Churning of Butter, Factors affecting .Satyam Kumar

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Introduction to Microprocesso programming and interfacing.pptxvipinkmenon1

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR9953056974 Low Rate Call Girls In Saket, Delhi NCR

power system scada applications and usesDevarapalliHaritha

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar

Design and analysis of solar grass cutter.pdfTagore Institute of Engineering And Technology

Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani

HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95

GDSC ASEB Gen AI study jams presentationGDSCAESB

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxnull - The Open Security Community

POWER SYSTEMS-1 Complete notes examplesDr. Gudipudi Nageswara Rao

Artificial-Intelligence-in-Electronics (K).pptxbritheesh05

complete construction, environmental and economics information of biomass com...asadnawaz62

Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff

Recently uploaded (20)

Architect Hassan Khalil Portfolio for 2024

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

Churning of Butter, Factors affecting .

VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

Introduction to Microprocesso programming and interfacing.pptx

★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR

power system scada applications and uses

Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger

Design and analysis of solar grass cutter.pdf

Microscopic Analysis of Ceramic Materials.pptx

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV

GDSC ASEB Gen AI study jams presentation

Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx

POWER SYSTEMS-1 Complete notes examples

Artificial-Intelligence-in-Electronics (K).pptx

complete construction, environmental and economics information of biomass com...

Call Girls Narol 7397865700 Independent Call Girls

Apache kafka

1. Apache Kafka A distributed publish-subscribe messaging system Neha Narkhede, LinkedIn nnarkhede@linkedin.com, 11/11/11

3.  Introduction to pub-sub  Kafka at LinkedIn  Hadoop and Kafka  Design  Performance Outline

4. What is pub sub ? Producer Consumer Producer Consumer Topic 1 Topic 2 Topic 3 subscribe publish(topic, msg) Publish subscribe system msg msg

5.  Introduction to pub-sub  Kafka at LinkedIn  Hadoop and Kafka  Design  Performance Outline

6. Motivation  Activity tracking  Operational metrics

7. Kafka  Distributed  Persistent  High throughput

8. Kafka at LinkedIn Frontend Real time monitoring BrokerBrokerBroker Hadoop DWH Security systems News feed Kafka Frontend ServiceService

9.  Introduction to pub-sub  Kafka at LinkedIn  Hadoop and Kafka  Design  Performance Outline

10. Hadoop Data Load for Kafka Live data center Offline data center HadoopHadoopDev Hadoop FrontendFrontendReal time consumers KafkaKafkaKafka KafkaKafkaKafka HadoopHadoopPROD Hadoop

11. Multi DC data deployments Live data centers Offline data centers Kafka Real time consumers Real time consumers Real time consumers Real time consumers Real time consumers Real time consumers Kafka Hadoop HadoopHadoopHadoop Hadoop HadoopHadoopDWH

12. Volume  20B events/day  3 terabytes/day  150K events/sec

13. Message queues •ActiveMQ •TIBCO Log aggregators •Flume •Scribe •Low throughput •Secondary indexes •Tuned for low latency •Focus on HDFS •Push model •No rewindable consumption KAFKA

14. What Kafka offers  Very high performance  Elastically scalable  Low operational overhead  Durable, highly available (coming soon)

15. Architecture Producer Consumer Producer Broker Broker Broker Broker Consumer ZK

16.  Introduction to pub-sub  Kafka at LinkedIn  Hadoop and Kafka  Design  Performance Outline

17. Efficiency #1: simple storage  Each topic has an ever-growing log  A log == a list of files  A message is addressed by a log offset

18. Efficiency #2: careful transfer  Batch send and receive  No message caching in JVM  Rely on file system buffering  Zero-copy transfer: file -> socket

19. Multi subscribers  1 file system operation per request  Consumption is cheap  SLA based message retention  Rewindable consumption

20. Guarantees  Data integrity checks  At least once delivery  In order delivery, per partition

21. Automatic load balancing Consumer Producer Broker Broker Consumer Producer

22. Auditing # events published = # events consumed

23.  Introduction to pub-sub  Kafka at LinkedIn  Hadoop and Kafka  Design  Performance Outline

24. Performance  2 Linux boxes • 16 2.0 GHz cores • 6 7200 rpm SATA drive RAID 10 • 24GB memory • 1Gb network link  200 byte messages  Producer batch size 200 messages

25. Basic performance metrics • Producer batch size = 40K • Consumer batch size = 1MB • 100 topics, broker flush interval = 100K – Producer throughput = 90 MB/sec – Consumer throughput = 60 MB/sec – Consumer latency = 220 ms

26. Latency vs throughput 0 50 100 150 200 250 0 20 40 60 80 100 Producer throughput in MB/sec Consumerlatencyinms (100 topics, 1 producer, 1 broker)

27. Scalability 101 190 293 381 0 50 100 150 200 250 300 350 400 1 broker 2 brokers 3 brokers 4 brokers ThroughputinMB/s (10 topics, broker flush interval 100K)

28. Throughput vs Unconsumed data 0 40000 80000 120000 160000 200000 10 105 199 294 388 473 567 662 756 851 945 1039 (1 topic, broker flush interval 10K) Throughputinmsg/s Unconsumed data in GB

29. State of the system  4 clusters per colo, 4 servers each  850 socket connections per server  20 TB  430 topics  Batched frontend to offline datacenter latency => 6-10 secs  Frontend to Hadoop latency => 5 min

30. State of the system  Successfully deployed in production at LinkedIn and other startups  Apache incubator inclusion  0.7 Release • Compression • Cluster mirroring

31. Replication

32. Some project ideas  Security  Long poll  More compression codecs  Locality of consumption

33. Team • Jay Kreps • Jun Rao • Neha Narkhede • Joel Koshy • Chris Burroughs

34. THANK YOU http://incubator.apache.org/kafka/index.html kafka-users@incubator.apache.org http://www.linkedin.com/in/nehanarkhede @nehanarkhede #kafka

35.

36.

37. API

Editor's Notes

Multiple subscribers, decoupling Some examples : email notification on adding connection etc
Every service in LinkedIn uses Kafka. 10-15 of real time consumers using Kafka
Give some background on – basically generating business reports Tracking data is crucial to measure user engagement – unique member visits per day, ad metrics CTR to ad publishers Also data analytics – new models for PYMK Collapse multple boxes into one. Show multiple Hadoop clusters If you track some new data, it will automatically reach Hadoop in a few minutes.
Messaging systems tuned for very low latency, but not high volume
One topic is enough
s
Add compression line here instead
Batching on frontends, batching on both kafka clusters, tuned for throughput
Kafka webpage here
single producer to send 10 million messages, 200 bytes each. single consumer thread Flush interval 10K ActiveMQ – syncOnWrite=false, KahaDB RabbitMQ – mandatory=true, immediate=false, file persistence the producer throughput in messages/secondKafka produces at the rate of 50K / second with batch size of 1 400K / second with batch size of 50. These numbers are orders of magnitude higher than that of ActiveMQ and at least twice better than RabbitMQ. No producer ACKS more compact storage format 9 bytes/message overhead in Kafka 144 bytes/message overhead in ActiveMQ Cost of maintaining heavy index structures One thread in ActiveMQ spent most of its time accessing a B-Tree to maintain message metadata and state
Kafka consumes at 22K messages / sec, more than 4 times that of ActiveMQ and RabbitMQ More efficient storage format - fewer bytes transferred from server to consumer The broker in ActiveMQ/RabbitMQ had to maintain delivery status for each message. One of the threads in ActiveMQ kept writing KahaDB pages to disk during the test. Kafka uses sendFile API to reduce the transmission overhead

Apache kafka

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Apache kafka

Similar to Apache kafka (20)

More from sKaushikNarayanan

More from sKaushikNarayanan (13)

Recently uploaded

Recently uploaded (20)

Apache kafka

Editor's Notes