This slide has the content about Kafka. I have presented this in "Spark-Kafka Summit" on 10th May, 2017 arranged by Unicom [http://www.unicomlearning.com/2017/Spark_Kafka_Summit_Bangalore/]. I talked about kafka producer, cluster and consumer. I did not cover security, mirror,connect and stream.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Testing Kafka components with Kafka for JUnitMarkus Günther
Kafka for JUnit enables developers to start and stop a complete Kafka cluster comprised of Kafka brokers and distributed Kafka Connect workers from within a JUnit test. It also provides a rich set of convenient accessors to interact with such an embedded or external Kafka cluster in a lean and non-obtrusive way.
Kafka for JUnit can be used to both whitebox-test individual Kafka-based components of your application or to blackbox-test applications that offer an incoming and/or outgoing Kafka-based interface.
This presentation gives a brief introduction into Kafka for JUnit, discussing its design principles and code examples to get developers quickly up to speed using the library.
This slide has the content about Kafka. I have presented this in "Spark-Kafka Summit" on 10th May, 2017 arranged by Unicom [http://www.unicomlearning.com/2017/Spark_Kafka_Summit_Bangalore/]. I talked about kafka producer, cluster and consumer. I did not cover security, mirror,connect and stream.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Testing Kafka components with Kafka for JUnitMarkus Günther
Kafka for JUnit enables developers to start and stop a complete Kafka cluster comprised of Kafka brokers and distributed Kafka Connect workers from within a JUnit test. It also provides a rich set of convenient accessors to interact with such an embedded or external Kafka cluster in a lean and non-obtrusive way.
Kafka for JUnit can be used to both whitebox-test individual Kafka-based components of your application or to blackbox-test applications that offer an incoming and/or outgoing Kafka-based interface.
This presentation gives a brief introduction into Kafka for JUnit, discussing its design principles and code examples to get developers quickly up to speed using the library.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Kafka is most popular messaging queue.
Key Areas:
What is Messgaing Queue?
Why Messaging Queue?
Kafka- basic terminologies
Kafka- Architecture (Message Flow)
AWS SQS vs Apache Kafka
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
Abstract:-
Tracking user events as they happen can challenge anyone providing real time user interaction. It can demand both huge scale and a lot of processing to support dynamic adjustment to targeting products and services. As the operational data store Couchbase data services are capable of processing tens of millions of updates a day. Streaming through systems such as Apache Spark and Kafka into Hadoop, information about these key events can be turned into deeper knowledge. We will review Lambda architectures deployed at sites like PayPal, Live Person and LinkedIn that leverage a Couchbase Data Pipeline.
Bio:-
Justin Michaels. With over 20 years experience in deploying mission critical systems, Justin Michaels industry experience covers capacity planning, architecture and industry vertical experience. Justin brings his passion for architecting, implementing and improving Couchbase to the community as a Solution Architect. His expertise involves both conventional application platforms as well as distributed data management systems. He regularly engages with existing and new Couchbase customers in performance reviews, architecture planning and best practice guidance.
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...DataStax Academy
The term "big data" seems to be everywhere these days. With the ever growing number of attendees at big data and Hadoop events, it’s clear big data is here to stay. But what does that mean for the analytics market, and how does big data fit into the picture? This session, featuring Mark Davis, Sr. Product Architect at Dell, will explore what big data means in a practical sense to the IT department. It will also explore the many ways that big data affects an organization’s picture of performance. Plus, see how big data analytics, using technologies like Cassandra and Hadoop, will converge with traditional business intelligence to create a complete picture of the enterprise's information assets, thereby giving the business a complete and insightful view of its operational efficiency.
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
I will present the recent additions to Kafka to achieve exactly-once semantics (0.11.0) within its Streams API for stream processing use cases. This is achieved by leveraging the underlying idempotent and transactional client features. The main focus will be the specific semantics that Kafka distributed transactions enable in Streams and the underlying mechanics to let Streams scale efficiently.
Kafka is most popular messaging queue.
Key Areas:
What is Messgaing Queue?
Why Messaging Queue?
Kafka- basic terminologies
Kafka- Architecture (Message Flow)
AWS SQS vs Apache Kafka
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
Abstract:-
Tracking user events as they happen can challenge anyone providing real time user interaction. It can demand both huge scale and a lot of processing to support dynamic adjustment to targeting products and services. As the operational data store Couchbase data services are capable of processing tens of millions of updates a day. Streaming through systems such as Apache Spark and Kafka into Hadoop, information about these key events can be turned into deeper knowledge. We will review Lambda architectures deployed at sites like PayPal, Live Person and LinkedIn that leverage a Couchbase Data Pipeline.
Bio:-
Justin Michaels. With over 20 years experience in deploying mission critical systems, Justin Michaels industry experience covers capacity planning, architecture and industry vertical experience. Justin brings his passion for architecting, implementing and improving Couchbase to the community as a Solution Architect. His expertise involves both conventional application platforms as well as distributed data management systems. He regularly engages with existing and new Couchbase customers in performance reviews, architecture planning and best practice guidance.
C* Summit 2013: Big Data Analytics – Realize the Investment from Your Big Dat...DataStax Academy
The term "big data" seems to be everywhere these days. With the ever growing number of attendees at big data and Hadoop events, it’s clear big data is here to stay. But what does that mean for the analytics market, and how does big data fit into the picture? This session, featuring Mark Davis, Sr. Product Architect at Dell, will explore what big data means in a practical sense to the IT department. It will also explore the many ways that big data affects an organization’s picture of performance. Plus, see how big data analytics, using technologies like Cassandra and Hadoop, will converge with traditional business intelligence to create a complete picture of the enterprise's information assets, thereby giving the business a complete and insightful view of its operational efficiency.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Kafka is a real-time, fault-tolerant, scalable messaging system.
It is a publish-subscribe system that connects various applications with the help of messages - producers and consumers of information.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Kat Grigg, Confluent, Senior Customer Success Architect + Jen Snipes, Confluent, Senior Customer Success Architect
This presentation will cover tips and best practices for Apache Kafka. In this talk, we will be covering the basic internals of Kafka and how these components integrate together including brokers, topics, partitions, consumers and producers, replication, and Zookeeper. We will be talking about the major categories of operations you need to be setting up and monitoring including configuration, deployment, maintenance, monitoring and then debugging.
https://www.meetup.com/KafkaBayArea/events/270915296/
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Building zero data loss pipelines with apache kafkaAvinash Ramineni
Kafka is playing an increasingly important role in messaging and streaming systems and is becoming the defacto messaging platform in many enterprises. Managing and maintaining Kafka deployments and tuning the data pipelines for high-performance and scalability can become a challenging task.
In this session, we will discuss the lessons learned and the best practices for achieving zero data loss pipelines.
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
Until recently, the Messaging team at Twitter had been running an in-house build Pub/Sub system, namely EventBus (built on top of Apache DistributedLog and Apache Bookkeeper, and similar in architecture to Apache Pulsar) to cater to our pubsub needs. In 2018, we made the decision to move to Apache Kafka by migrating existing use cases as well as onboarding new use cases directly onto Apache Kafka. Fast forward to today, Kafka is now an essential piece of Twitter Infrastructure and processes over 200M messages per second. In this talk, we will share the learning and challenges in our journey moving to Apache Kafka.
Fundamentals and Architecture of Apache KafkaAngelo Cesaro
Fundamentals and Architecture of Apache Kafka.
This presentation explains Apache Kafka's architecture and internal design giving an overview of Kafka internal functions, including:
Brokers, Replication, Partitions, Producers, Consumers, Commit log, comparison over traditional message queues.
Intro to Apache Kafka I gave at the Big Data Meetup in Geneva in June 2016. Covers the basics and gets into some more advanced topics. Includes demo and source code to write clients and unit tests in Java (GitHub repo on the last slides).
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
Apache Kafka is the most used data streaming broker by companies. It could manage millions of messages easily and it is the base of many architectures based in events, micro-services, orchestration, ... and now cloud environments. OpenShift is the most extended Platform as a Service (PaaS). It is based in Kubernetes and it helps the companies to deploy easily any kind of workload in a cloud environment. Thanks many of its features it is the base for many architectures based in stateless applications to build new Cloud Native Applications. Strimzi is an open source community that implements a set of Kubernetes Operators to help you to manage and deploy Apache Kafka brokers in OpenShift environments.
These slides will introduce you Strimzi as a new component on OpenShift to manage your Apache Kafka clusters.
Slides used at OpenShift Meetup Spain:
- https://www.meetup.com/es-ES/openshift_spain/events/261284764/
Stateful stream processing with kafka and samzaGeorge Li
Intergration with in-memory local state is one of Samza's most interesting features, but how do you maintain and update the local state with fault-tolerance and multi-tenancy in mind ? How do you test it? We will talk about our solutions and problems to be solved
2. Today's Menu
●
Quick Kafka Overview
●
Kafka Usage At AppsFlyer
●
AppsFlyer First Cluster
●
Designing The Next Cluster: Requirements And Changes
●
Problems With The New Cluster
●
Changes To The New Cluster
●
Traffic Boost, Then More Issues
●
More Solutions
●
And More Failures
●
Splitting The Cluster, More Changes And The Current Configuration
●
Lessons Learned
●
Testing The Cluster
●
Collecting Metrics And Alerting
3. "A first sign of the beginning of
understanding is the wish to die."
Franz Kafka
5. “An open source, distributed,
partitioned and replicated commit-log
based publish- subscribe messaging
system”
Kafka Overview
6. Kafka Overview
●
Topic: Category which messages are published by the message
producers
●
Broker: Kafka server process (usually one per node)
●
Partitions: Topics are partitioned, each partition is represented by the
ordered immutable sequence of messages. Each message in the partition
is assigned a unique ID called offset
8. AppsFlyer First cluster
●
Traffic: Up to few hundreds millions
●
Size: 4 M1.xlarge brokers
●
~8 Topics
●
Replication factor 1
●
Retention 8-12H
●
Default number of partitions 8
●
Vanilla configuration
Main reason for migration: Lack of storage capacity,
limited parallelism due to low partition count and forecast
for future needs.
9. Requirements for the Next Cluster
●
More capacity to support
Billions of messages
●
Messages replication to
prevent data loss
●
Support loss of brokers up
to entire AZ
●
Much higher parallelism to
support more consumers
●
Longer retention period –
48 hours on most topics
10. The new Cluster changes
●
18 m1.xlarge brokers, 6 per AZ
●
Replication factor of 3
●
All partitions are distributed between AZ
●
Topics # of partitions increased (between 12 to 120 depends on
parallelism needs)
●
4 Network and IO threads
●
Default log retention 48 hours
●
Auto Leader rebalance enabled
●
Imbalanced ratio set to default 15%
* Leader: For each partition there is a leader which serve for writes and reads and the other brokers are replicated from
* Imbalance ratio: The highest percentage of leadership a broker can hold, above that auto rebalance is initiate
Glossary
12. Problems
●
Uneven distributions of
leaders which cause
high load on specific
brokers and eventually
lag in consumers and
brokers failures
●
Constantly rebalanced
of brokers leaders which
caused failures in
python producers
13. Solutions
●
Increase number of
brokers to 24 improve
broker leadership
distribution
●
Rewrite Python
producers in Clojure
●
Decrease number of
partitions where high
parallelism is not
needed
15. Problems
●
High Iowait in the brokers
●
Missing ISR due to leaders overloaded
●
Network bandwidth close to thresholds
●
Lag in consumers
* ISR: In Active Replicas
Glossary
16. More Solutions
●
Split into 2 clusters: launches which contain
80% of messages and all the rest
●
Move launches cluster to i2.2xlarge with local
SSD
●
Finer tuning of leaders
●
Increase number of IO and Network Threads
●
Enable AWS enhanced networking
17. And some few more...
●
Decrease Replication
factor to 2 in Launches
cluster to reduce load
on leaders, reduce disk
capacity and AZ traffic
costs
●
Move 2nd cluster to
i2.2xlarge as well
●
Upgrade ZK due to
performance issues
18. Lessons learned
●
Minimize replication factor as possible to avoid
extra load on the Leaders
●
Make sure that leaders count is well balanced
between brokers
●
Balance partition number to support parallelism
●
Split cluster logically considering traffic and
business importance
●
Retention (time based) should be long enough
to recover from failures
●
In AWS, spread cluster between AZ
●
Support cluster dynamic changes by clients
●
Create automation for reassign
●
Save cluster-reassignment.json of each topic for
future needs!
●
Don't be to cheap on the Zookeepers
19. Testing the cluster
●
Load test using kafka-producer-perf-test.sh &
kafka-consumer-perf-test.sh
●
Broker failure while running
●
Entire AZ failure while running
●
Reassign partitions on the fly
●
Kafka dashboard contains: Leader election rate,
ISR status, offline partitions count, Log Flush time,
All Topics Bytes in per broker, IOWait, LoadAvg,
Disk Capacity and more
●
Set appropriate alerts
20. Collecting metrics & Alerting
●
Using Airbnb plugin for
Kafka, sending metrics to
graphite
●
Internal application that
collects Lag for each Topic
and send values to graphite
●
Alerts are set on Lag (For
each topic, Under replicated
partitions, Broker topic
metrics below threshold,
Leader reelection