Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka?
Introduction to Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have started to expand on the Java examples to correlate with the design discussion of Kafka. We have also expanded on the Kafka design section and added references.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka? This slide deck is a tutorial for the Kafka streaming platform. This slide deck covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example to demonstrate failover of brokers as well as consumers. Then it goes through some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have also expanded on the Kafka design section and added references. The tutorial covers Avro and the Schema Registry as well as advance Kafka Producers.
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
Why is Kafka so fast? Why is Kafka so popular? Why Kafka?
Introduction to Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have started to expand on the Java examples to correlate with the design discussion of Kafka. We have also expanded on the Kafka design section and added references.
Case Studies on Big-Data Processing and Streaming - Iranian Java User GroupAmir Sedighi
During recent years, the data science has undergone a big shift towards big data processing. As a result, a change in our methodology seems to be inevitable. This change, however, does not necessarily translate to a loss in decades of investments in classical data processing technologies and data warehousing. Instead, it supports adapting to the new environment with regards to the mass production of business data, by adopting modern practices.
In this talk we review some frameworks and solutions to modern big data processing approaches, along with a few case studies that have been carried out in Iran.
Big Data and Machine Learning Workshop - Day 7 @ UTACM Amir Sedighi
اسلاید روز هفتم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که به پیاده سازی یک نمونه سرویس صنعتی یادگیری ماشین و آشنایی با روش نصب و بکارگیری تنسورفلو انجام شد
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 5 @ UTACMAmir Sedighi
اسلاید روز پنجم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که با تاکید بر یادگیری ژرف برگزار شد. جلسه ششم کارگاه نیز به یادگیری ژرف و کاربردها اختصاص خواهد یافت. این کارگاه به همت ایسیام دانشگاه تهران در محل دانشکده فنی برگزار میشود
زمان هر جلسه ۲ ساعت است
Uber has one of the largest Kafka deployment in the industry. To improve the scalability and availability, we developed and deployed a novel federated Kafka cluster setup which hides the cluster details from producers/consumers. Users do not need to know which cluster a topic resides and the clients view a "logical cluster". The federation layer will map the clients to the actual physical clusters, and keep the location of the physical cluster transparent from the user. Cluster federation brings us several benefits to support our business growth and ease our daily operation. In particular, Client control. Inside Uber there are a large of applications and clients on Kafka, and it's challenging to migrate a topic with live consumers between clusters. Coordinations with the users are usually needed to shift their traffic to the migrated cluster. Cluster federation enables much control of the clients from the server side by enabling consumer traffic redirection to another physical cluster without restarting the application. Scalability: With federation, the Kafka service can horizontally scale by adding more clusters when a cluster is full. The topics can freely migrate to a new cluster without notifying the users or restarting the clients. Moreover, no matter how many physical clusters we manage per topic type, from the user perspective, they view only one logical cluster. Availability: With a topic replicated to at least two clusters we can tolerate a single cluster failure by redirecting the clients to the secondary cluster without performing a region-failover. This also provides much freedom and alleviates the risks for us to carry out important maintenance on a critical cluster. Before the maintenance, we mark the cluster as a secondary and migrate off the live traffic and consumers. We will present the details of the architecture and several interesting technical challenges we overcame.
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpJosé Román Martín Gil
Apache Kafka is the most used data streaming broker by companies. It could manage millions of messages easily and it is the base of many architectures based in events, micro-services, orchestration, ... and now cloud environments. OpenShift is the most extended Platform as a Service (PaaS). It is based in Kubernetes and it helps the companies to deploy easily any kind of workload in a cloud environment. Thanks many of its features it is the base for many architectures based in stateless applications to build new Cloud Native Applications. Strimzi is an open source community that implements a set of Kubernetes Operators to help you to manage and deploy Apache Kafka brokers in OpenShift environments.
These slides will introduce you Strimzi as a new component on OpenShift to manage your Apache Kafka clusters.
Slides used at OpenShift Meetup Spain:
- https://www.meetup.com/es-ES/openshift_spain/events/261284764/
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Twitter’s Apache Kafka Adoption Journey | Ming Liu, TwitterHostedbyConfluent
Until recently, the Messaging team at Twitter had been running an in-house build Pub/Sub system, namely EventBus (built on top of Apache DistributedLog and Apache Bookkeeper, and similar in architecture to Apache Pulsar) to cater to our pubsub needs. In 2018, we made the decision to move to Apache Kafka by migrating existing use cases as well as onboarding new use cases directly onto Apache Kafka. Fast forward to today, Kafka is now an essential piece of Twitter Infrastructure and processes over 200M messages per second. In this talk, we will share the learning and challenges in our journey moving to Apache Kafka.
In this session you will learn:
1. Kafka Overview
2. Need for Kafka
3. Kafka Architecture
4. Kafka Components
5. ZooKeeper Overview
6. Leader Node
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Big Data and Machine Learning Workshop - Day 6 @ UTACMAmir Sedighi
اسلاید روز ششم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که با تاکید بر یادگیری ژرف برگزار شد. جلسه ششم کارگاه نیز به یادگیری ژرف و کاربردها اختصاص خواهد یافت. این کارگاه به همت ایسیام دانشگاه تهران در محل دانشکده فنی برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 4 @ UTACM Amir Sedighi
اسلاید روز چهارم از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که شامل مقدمه ای بر شبکههای عصبی مصنوعی و یک نمونه پیاده سازی ساده به زبان جاوا است. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 3 @ UTACMAmir Sedighi
اسلاید سومین روز از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین با معرفی راهکارهای متن باز پردازش دادههای بزرگ و راهحلهای پردازش جریانداده برگزار شد. مفاهیم مورد بررسی قرار گرفت. یک نمونه کوچک اجرایی از بهره گیری هدوپ ارائه شد. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 2 @ UTACMAmir Sedighi
اسلاید دومین روز از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین که با تاکید بر یادگیری بدون نظارت و یک نمونه کاربردی خوشه بندی متن با استفاده از الگوریتمهای وزندهی به واژهها، کانوپی و کیمینز در تاریخ ۱۳ مرداد ۱۳۹۵ در محل دانشکده فنی دانشگاه تهران برگزار شد. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Big Data and Machine Learning Workshop - Day 1 @ UTACMAmir Sedighi
اولین روز از کارگاه ۷ روزه دادههای بزرگ و یادگیری ماشین، با تاکید بر یادگیری بانظارت و یک نمونه کاربردی کشف تقلب در تاریخ ۶ مرداد ۱۳۹۵ در محل دانشکده فنی دانشگاه تهران برگزار شد. این اسلاید روز اول است. این دوره به همت ایسیام دانشگاه تهران برگزار میشود
زمان هر جلسه ۲ ساعت است
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
10. 10
Message Delivery Semantics
● At most once
– Messages may be lost by are never delivered.
● At least once
– Messages are never lost byt may be redliverd.
● Exactly once
– This is what people actually want.
11. 11
Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
12. 12
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
13. 13
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
14. 14
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
15. 15
Apache Kafka
● A single Kafka broker
(server) can handle
hundreds of
megabytes of reads
and writes per second
from thousands of
clients.
16. 16
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
17. 17
Apache Kafka
● Kafka is designed to
allow a single cluster
to serve as the central
data backbone for a
large organization. It
can be elastically and
transparently
expanded without
downtime.
18. 18
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
19. 19
Apache Kafka
● Messages are
persisted on disk and
replicated within the
cluster to prevent
data loss. Each
broker can handle
terabytes of
messages without
performance impact.
20. 20
Apache Kafka
● Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.
– Kafka is super fast.
– Kafka is scalable.
– Kafka is durable.
– Kafka is distributed by design.
21. 21
Apache Kafka
● Kafka has a modern
cluster-centric design
that offers strong
durability and fault-tolerance
guarantees.
26. 26
Topic
● Topic
● Producer
● Consumer
● Broker
● Kafka maintains feeds
of messages in
categories called
topics.
● Topics are the highest
level of abstraction
that Kafka provides.
34. 34
Consumer
● Topic
● Producer
● Consumer
● Broker
● We'll call processes
that subscribe to
topics and process
the feed of published
messages,
consumers.
– Hadoop Consumer
39. 39
Topics
● A topic is a category
or feed name to which
messages are
published.
● Kafka cluster
maintains a
partitioned log for
each topic.
40. 40
Partition
● Is an ordered,
immutable sequence of
messages that is
continually appended to
a commit log.
● The messages in the
partitions are each
assigned a sequential id
number called the offset.
44. 44
Producer
● The producer is responsible for choosing which
message to assign to which partition within the
topic.
– Round-Robin
– Load-Balanced
– Key-Based (Semantic-Oriented)
53. 53
Create Topic
● bin/kafka-topics.sh --create --zookeeper
localhost:2181 --replication-factor 1 --partitions
1 --topic test
> Created topic "test".
54. 54
List all Topics
● bin/kafka-topics.sh --list --zookeeper
localhost:2181
55. 55
Send some Messages by Producer
● bin/kafka-console-producer.sh --broker-list
localhost:9092 --topic test
Hello DatisPars Guys!
How is it going with you?
56. 56
Start a Consumer
● bin/kafka-console-consumer.sh --zookeeper
localhost:2181 --topic test --from-beginning
59. 59
Use Cases
● Messaging
– Kafka is comparable to traditional messaging
systems such as ActiveMQ and RabbitMQ.
● Kafka provides customizable latency
● Kafka has better throughput
● Kafka is highly Fault-tolerance
60. 60
Use Cases
● Log Aggregation
– Many people use Kafka as a replacement for a log aggregation
solution.
– Log aggregation typically collects physical log files off servers
and puts them in a central place (a file server or HDFS perhaps)
for processing.
– In comparison to log-centric systems like Scribe or Flume, Kafka
offers equally good performance, stronger durability guarantees
due to replication, and much lower end-to-end latency.
● Lower-latency
● Easier support
61. 61
Use Cases
● Stream Processing
– Storm and Samza are popular frameworks for stream processing. They
both use Kafka.
● Event Sourcing
– Event sourcing is a style of application design where state changes are
logged as a time-ordered sequence of records. Kafka's support for very
large stored log data makes it an excellent backend for an application
built in this style.
● Commit Log
– Kafka can serve as a kind of external commit-log for a distributed
system. The log helps replicate data between nodes and acts as a re-syncing
mechanism for failed nodes to restore their data.
62. 62
Message Format
● /**
● * A message. The format of an N byte message is the following:
● * If magic byte is 0
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 4 byte CRC32 of the payload
● * 3. N - 5 byte payload
● * If magic byte is 1
● * 1. 1 byte "magic" identifier to allow format changes
● * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the
version (e.g. compression enabled, type of codec used)
● * 3. 4 byte CRC32 of the payload
● * 4. N - 6 byte payload
● */