The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5
It provides a brief introduction to the motivation for building Kafka and how it works from a high level.
Please download the presentation if you wish to see the animated slides.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
A brief introduction to Apache Kafka and describe its usage as a platform for streaming data. It will introduce some of the newer components of Kafka that will help make this possible, including Kafka Connect, a framework for capturing continuous data streams, and Kafka Streams, a lightweight stream processing library.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
The session discusses on how companies are using Apache Kafka & also covers under the hood details like partitions, brokers, replication.
About apache kafka: Apache Kafka is a distributed a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and is able to process streams of events. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
The session discusses on how companies are using Apache Kafka & also covers under the hood details like partitions, brokers, replication.
About apache kafka: Apache Kafka is a distributed a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and is able to process streams of events. Kafka provides reliable, millisecond responses to support both customer-facing applications and connecting downstream systems with real-time data.
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
Watch this talk here: https://www.confluent.io/online-talks/how-apache-kafka-works-on-demand
Pick up best practices for developing applications that use Apache Kafka, beginning with a high level code overview for a basic producer and consumer. From there we’ll cover strategies for building powerful stream processing applications, including high availability through replication, data retention policies, producer design and producer guarantees.
We’ll delve into the details of delivery guarantees, including exactly-once semantics, partition strategies and consumer group rebalances. The talk will finish with a discussion of compacted topics, troubleshooting strategies and a security overview.
This session is part 3 of 4 in our Fundamentals for Apache Kafka series.
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
How are new IoT devices being designed, built & integrated to big data platforms such as Hadoop. Ammeon design such systems to integrate with and provide critical support for new device creators to bring their products to market.
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.
Streaming Data Ingest and Processing with Apache KafkaAttunity
Apache™ Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing.
Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to:
-Realize the value of streaming data ingest with Kafka
-Turn databases into live feeds for streaming ingest and processing
-Accelerate data delivery to enable real-time analytics
-Reduce skill and training requirements for data ingest
The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts.
For more information, please go to www.attunity.com/kafka.
With Apache Kafka 0.9, the community has introduced a number of features to make data streams secure. In this talk, we’ll explain the motivation for making these changes, discuss the design of Kafka security, and explain how to secure a Kafka cluster. We will cover common pitfalls in securing Kafka, and talk about ongoing security work.
Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen
Task from the Strata & Hadoop World conference in London, 2016: Apache Flink and Continuous Processing.
The talk discusses some of the shortcomings of building continuous applications via batch processing, and how a stream processing architecture naturally solves many of these issues.
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...confluent
BY Jun Rao
From the Bay Area Apache Kafka September 2016 Meetup.
Abstract: To manage the ever-increasing volume and velocity of data within your company you have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center powered by Apache Kafka. But what needs to be done if one data center is not enough? In this session we describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence. We provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication and mirroring as well as disaster scenarios and failure handling.
Kafka Streams is a new stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a natural DSL for writing stream processing applications. As such it is the most convenient yet scalable option to analyze, transform, or otherwise process data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Spark Streaming or Storm, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a new breed of messaging system built for the "big data" world. Coming out of LinkedIn (and donated to Apache), it is a distributed pub/sub system built in Scala. It has been an Apache TLP now for several months with the first Apache release imminent. Built for speed, scalability, and robustness, Kafka should definitely be one of the data tools you consider when designing distributed data-oriented applications.
The talk will cover a general overview of the project and technology, with some use cases, and a demo.
Like many other messaging systems, Kafka has put limit on the maximum message size. User will fail to produce a message if it is too large. This limit makes a lot of sense and people usually send to Kafka a reference link which refers to a large message stored somewhere else. However, in some scenarios, it would be good to be able to send messages through Kafka without external storage. At LinkedIn, we have a few use cases that can benefit from such feature. This talk covers our solution to send large message through Kafka without additional storage.
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent
(Bob Lehmann, Bayer) Kafka Summit SF 2018
You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why?
In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform.
In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.
Operational Analytics on Event Streams in Kafkaconfluent
Speaker: Anirudh Ramanthan, Product Manager, Rockset
Tracking key events and analyzing these event streams are critical to many enterprises. We highlight how organizations are using Apache Kafka® as a fast, reliable event streaming platform alongside Rockset, a serverless search and analytics engine, to create stateful microservices to analyze their event streams.
In this talk, we will discuss a stateful microservices architecture, where events from multiple channels are collected and streamed into Kafka and continuously ingested into Rockset with no explicit schema or metadata specification required. Developers then use serverless compute frameworks, like AWS Lambda, in conjunction with serverless data management from Rockset to build microservices to derive insights on the data from Kafka. Organizations can leverage this pattern to support low-latency queries on event streams, providing immediate insight on their business.
Building a big data intelligent application on top of xPatterns using tools that leverage Spark, Shark, Mesos, Tachyon and Cassandra. Jaws, open sourcing our own spark sql restful service and our own contributions to the Spark and Mesos projects, lessons learned
DBCC 2021 - FLiP Stack for Cloud Data LakesTimothy Spann
DBCC 2021 - FLiP Stack for Cloud Data Lakes
With Apache Pulsar, Apache NiFi, Apache Flink. The FLiP(N) Stack for Event processing and IoT. With StreamNative Cloud.
DBCC International – Friday 15.10.2021
Powered by Apache Pulsar, StreamNative provides a cloud-native, real-time messaging and streaming platform to support multi-cloud and hybrid cloud strategies.
___________________________________________
Meetup#7 | Session 2 | 21/03/2018 | Taboola
_____________________________________________
In this talk, we will present our multi-DC Kafka architecture, and discuss how we tackle sending and handling 10B+ messages per day, with maximum availability and no tolerance for data loss.
Our architecture includes technologies such as Cassandra, Spark, HDFS, and Vertica - with Kafka as the backbone that feeds them all.
In this presentation Guido Schmutz talks about Apache Kafka, Kafka Core, Kafka Connect, Kafka Streams, Kafka and "Big Data"/"Fast Data Ecosystems, Confluent Data Platform and Kafka in Architecture.
Consensus in Apache Kafka: From Theory to Production.pdfGuozhang Wang
In this talk I'd like to cover an everlasting story in distributed systems: consensus. More specifically, the consensus challenges in Apache Kafka, and how we addressed it starting from theory in papers to production in the cloud.
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
Qlik is an industry leader across its solution stack, both on the Data Integration side of things with Qlik Replicate (real-time CDC) and Qlik Compose (data warehouse and data lake automation), and on the Analytics side with Qlik Sense. These two “sides” of Qlik are coming together more frequently these days as the need for “always fresh” data increases across organizations.
When real-time streaming applications are the topic du jour, those companies are looking to Apache Kafka to provide the architectural backbone those applications require. Those same companies turn to Qlik Replicate to put the data from their enterprise database systems into motion at scale, whether that data resides in “legacy” mainframe databases; traditional relational databases such as Oracle, MySQL, or SQL Server; or applications such as SAP and SalesForce.
In this session we will look in depth at how Qlik Replicate can be used to continuously stream changes from a source database into Apache Kafka. From there, we will explore how a purpose-built consumer can be used to provide the bridge between Apache Kafka and an analytics application such as Qlik Sense.
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Apache NiFi, Apache Flink, Apache Kafka
Timothy Spann
Principal Developer Advocate
Cloudera
Data in Motion
https://budapestdata.hu/2023/en/speakers/timothy-spann/
Timothy Spann
Principal Developer Advocate
Cloudera (US)
LinkedIn · GitHub · datainmotion.dev
June 8 · Online · English talk
Building Modern Data Streaming Apps with NiFi, Flink and Kafka
In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg.
We use the best streaming tools for the current applications with FLaNK. flankstack.dev
BIO
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
5. Kafka Overview
▪ High-throughput distributed messaging system
▪ Kafka guarantees:
– At least once delivery
– Strong ordering
▪ Developed at Linkedin and open sourced in early 2011
▪ Implemented in Scala and Java
11. How is Kafka used at Linkedin?
▪ Monitoring (inGraphs)
▪ User tracking
▪ Email and SMS notifications
▪ Stream processing (Samza)
▪ Database Replication
12. Facts and figures
▪ Over 1,300,000,000,000 messages are produced to Kafka everyday at
LinkedIn
▪ 300 Terabytes of inbound and 900 Terabytes of outbound traffic
▪ 4.5 Million messages per second, on single cluster
▪ Kafka runs on ~1300 servers at LinkedIn
22. Kafka at Linkedin
▪ Multiple data centers
▪ Mirror data
▪ Cluster Types
– Tracking
– Metrics
– Queuing
▪ Data transport from applications to Hadoop, and back
23. Metrics collection
▪ Building Blocks
– Sensors
– RRD
– Front end
▪ Facts & Figures
– 320,000,000 metrics
collected per minute
– 530 TB of disk space
– Over 210,000 metrics
collected per service
27. How Can You Get Involved?
▪ http://kafka.apache.org
▪ Join the mailing lists
–users@kafka.apache.org
▪ irc.freenode.net - #apache-kafka
▪ Contribute
SRE stands for Site Reliability Engineering.
SRE combines several roles that fit together into one Operations position
Foremost, we are administrators. We manage all of the systems in our area
We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together
And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them.
At the end of the day, our job is to keep the site running, always.
Kafka is distributed partitioned replicated commit logKafka guarantees at least once delivery or messages and strong ordering on per partition basis.
Some of the companies powered by Kafka.Source: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Allows retention of data, which is a huge plus as it makes bootstrapping a new service from a past point of time easy.
There is durability due to redundancy on partition level
Horizontally scalable
Most of the reads that hit the kafka brokers are served off the memory which results in low latency reads for a consumer which is relatively caught up
Custom data expiry rule
Apache Kafka was built at LinkedIn with a specific purpose in mind: to serve as a central repository of data streams.
There were two major motivations:
1)The first problem was how to transport data between systems. We had lots of data systems and each of these needed reliable feeds of data in a geographically distributed environment
2)The second part of this problem was the need to do richer analytical data processing—the kind of thing that would normally happen in a data warehouse or Hadoop cluster—but with very low latency
It was evident that a system that catered to both the above needs would need to have high throughput and be horizontally scalable as well.
Initially, our approach was very ad hoc: we built custom piping between systems and applications on an as needed basis and shoe-horned any asynchronous processing into request-response web services.Over time this set-up got more and more complex as we ended up building pipelines between all kinds of different systems.
After we introduced Kafka, the producers and the consumers got completely decoupled and this allowed services to just connect to a central system for all their data production/consumption needs without worrying about the other services which may be consuming/producing this data.
We have many use cases of Kafka at Linkedin, here are summaries of a few of them
Every application emits metrics into Kafka and we have systems that read and store this data to generate Graphs and thresholds
User tracking of all website activities, clicks, page views, experiments which we turn on for subsets of users. Each time you visit LinkedIn many different services are called to generate the page you are looking at, each service sends a message to kafka with details of that request. We then later analyze all of that data with a Samza job that allows us to build a full call tree for the particular request. We can then use this data to troubleshoot issues on the site.
Samza, by the way, is another open source product developed at LinkedIn that our team supports.
All of the emails that get sent out from LinkedIn go through Kafka at least one time, and often a few times. They are often generated in Hadoop, sent to a production system using Kafka which then decorates the emails with additional information and then sends it back in to Kafka for another application to read and turn into an actual email.
We stream changes to our search indexes in real time through Kafka to allow us to update search results in real time.
We also use Kafka combined with Apache Samza to standardize things like Job titles, phone numbers and addresses.
We are also currently exploring the use case of using Kafka to replicate databases. The rough idea is that a stream of transactions received by a database can be copied over through kafka to another db and replayed in the same order to achieve same state as the first database.
All of the previous use cases I described, and many more add up to a ton of data. 1.3T messages per day.
As it is evident, the total read traffic is almost thrice the write traffic. This is where data retention really shines as Kafka does not have to push the data to consumers every time it is read. The data resides on disk and any consumer can access and start reading the data for a Kafka cluster.
We replicate most of the data between datacenters to keep applications in sync.
Simple data structure
Writes happen on tail
Messages are in chronological order from head to tail
Easy movement in stream by offset
Allows read scalability
A “message” is a discrete unit of data within Kafka
Clients who send data into Kafka are called Producers
Clients who read data from Kafka are called Consumers
Every message that gets sent to Kafka belongs to a Topic, this allows for different types of data to be sent into a single cluster. The topic is then divided into multiple partitions for parallelism.
These partitions exist across kafka servers (brokers) that make up the Kafka cluster.
This diagram depicts how data is written into partitions.
Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each message goes to one of them; in publish-subscribe the message is broadcast to all consumers. Kafka offers a single consumer abstraction that generalizes both of these—the consumer group.
Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes within a single host, or on separate machines.
If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
This shows how the data flows through a cluster.
Kafka is a publish-subscribe messaging system, in which there are four components:
- Broker (what we call the Kafka server)
- Zookeeper (which serves as a data store for information about the cluster and consumers)
- Producer (sends data into the system)
- Consumer (reads data out of the system)
Data is organized into topics (here we show a topic named “A”) and topics are split into partitions (we have partitions 0 and 1 here).
A “message” is a discrete unit of data within Kafka. Producers create messages and send them into the system. The broker stores them, and any number of consumers can then read those messages.
In order to provide scalability, we have multiple brokers. By spreading out the partitions, we can handle more messages in any topic.
This also provides redundancy. We can now replicate partitions on separate brokers. When we do this, one broker is the designated “leader” for each partition. This is the only broker that producers and consumers connect to for that partition. The brokers that hold the replicas are designated “followers” and all they do with the partition is keep it in sync with the leader.
When a broker fails, one of the brokers holding an in-sync replica takes over as the leader for the partition. The producer and consumer clients have logic built-in to automatically rebalance and find the new leader when the cluster changes like this. When the original broker comes back online, it gets its replicas back in sync, and then it functions as the follower.
Kafka is incredibly fast for a few reasons:
Most reads never actually hit the disk – usually consumers are caught up.
Head seek time reduction due to linear IO
On a read Kafka utilizes the sendfile() system call which allows the data to be directly written to a socket without first being loaded into the application. This reduces context switching.
Batching allows higher throughput and better compression
We run Kafka on hardware with lots of disk spindles in a RAID 10 configuration.
We put our Zookeeper clusters on SSDs which brought our average request latency down to zero milliseconds
We monitor Kafka in several different ways with tooling developed by the SRE team.
Lag monitoring, lag is defined as the number of messages between the latest message available in Kafka and the newest message available in Kafka.
Under Replicated Partitions, this is the count of Follower replicas which have fallen behind the leader. This metric is reported per broker. In the healthy state these should always be zero.
Unclean leader elections. When this happens data has been lost. This occurs when there is a leader failure and there was not a follower who was insync at that time.
Burrow is a tool developed and open sourced by one of the Kafka SREs at LinkedIn. It is our new way of monitoring Lag within Kafka which uses velocity calculations to determine if a consumer is falling behind.
We have also developed tooling to ensure all brokers within a cluster are doing the same amount of work.
in the Size based balance we ensure that each broker has the same amount of data on disk. If they are not within our defined threshold we move the optimal number of partitions around to make it balanced.
In the Partition based balance we ensure that each broker has the same number of partitions. If they are not within our defined threshold we move the optimal number of partitions around to make it balanced.
Cluster types:
User activities on linkedin sites are tracked. These data flow into the tracking clusters. Linkedin has multiple colos and users are served from different colo based on their unique ID. The tracking data goes to the local tracking clusters. We have aggregator cluster, which gets the data aggregated from the multiple colos using mirror makers. The downstream application which process the tracking data consumes from the aggregate clusters
OS and application generate metrics, and these metrics are used for understanding state of the system. These values are pumped into a separate metrics cluster. More about metrics in the next slide
Queuing cluster is used for the traditional queuing scenarios when you have multiple applications and you want to coordinate their activities.
We at Linkedin use Kafka for pumping metrics into our graphing engine – InGraphs
The basic idea is that we have have services which expose a certain set of metrics using Mbeans which are picked up using sensors, processed, and pumped into Kafka. These enriched metrics are all consumed by a service which filters metrics by tags and push this data into RRD. These RRDs are used to generate graphs which are served to the end user.
This is just a sample screenshot of final graphs in InGraphs.Different colors correspond to different hosts
One new use case for Kafka at LinkedIn is for Database replication. In this diagram we show how this is done.
The database on the left streams its transaction log into Kafka. The data replicator consumes the transaction log stream from Kafka and replays them into the database on the right. This is a great method for doing cross-datacenter replication of databases.
One of the obvious advantage over the traditional master slave database replication is the decoupling of both databases.
To initially start the secondary database you first must create a backup snapshot of the data in DB1, and load it into DB2. After that DB2 can listen to the transaction log stream via Data Replicator and stay in sync.
This also works for a master master relationship where you stream the transactions originating in the second colo back to the database in first colo.Additional filtering logic is added to Data Replicator to ensure that a loop is not created, in other words, the transaction originating in colo A needs to be mirrored to colo B but should not be replicated back to colo A.
So how can you get more involved in the Kafka community?
The most obvious answer is to go apache.kafka.org. From there you can:
1) Join the mailing lists, either on the development or the user side
2) You can also dive into the source repository, and work on and contribute your own tools back.