Kafka - A little introduction

•

6 likes•2,391 views

A brief run through of Kafka and some of it's interesting characteristics that make it a great messaging system for collecting and aggregating data.

Technology

Disk/Memory Performance
1000M

100M

10M

1M
Read values/second

100,000

10,000

1,000

100

10

1 Disk SSD Memory

Random access
Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874

Disk/Memory Performance
1000M

100M

10M

1M
Read values/second

100,000
Sequential disk read
10,000
faster than random
1,000

100
memory read
10

1 Disk SSD Memory

Random access
Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874

Length Magic Value Checksum Payload

4 bytes 1 byte 4 bytes n bytes

Token
Offset: 0 Input
Broker: kafka.local
Topic: Testing

MR Job
Output Output

Offset: 130098
Broker: kafka.local
Topic: Testing

Sequence File

Useful Things

• http://incubator.apache.org/kafka/
• https://github.com/pingles/clj-kafka

This document provides recommendations for system capacity planning for an Oracle database: - Plan for 1 CPU per 200 concurrent users and prefer medium speed CPUs over fewer faster CPUs. - Reserve 10% of memory for the operating system and allocate 220 MB for the Oracle SGA and 3 MB per user process. - Use striped and mirrored or striped with parity RAID for disks. Consider raw devices or SANs if possible. - Ensure the network capacity is adequate based on site size.

Being closer to Cassandra by Oleg Anastasyev. Talk at Cassandra Summit EU 2013

odnoklassniki.ru

Odnoklassniki uses cassandra for its business data, which doesn't fit into RAM. This data is typically fast growing, frequently accessed by our users and must be always available, because it constitute our primary business as a social network. The way we use cassandra is somewhat unusual - we don't use thrift or netty based native protocol to communicate with cassandra nodes remotely. Instead, we co-locate cassandra nodes in the same JVM with business service logic, exposing not generic data manipulation, but business level interface remotely. This way, we avoid extra network roundtrips within a single business transaction and use internal calls to Cassandra classes to get information faster. Also, this helps us to create many small hacks on Cassandra's internals, making huge gains on efficiency and ease of distributed server development.

Clase4 (consola linux)

Miguel Eduardo Luces

The document provides commands and descriptions for common Linux terminal tasks including system administration, networking, package management, and navigating files and directories. It lists commands for changing passwords, moving through directories, copying/deleting files, mounting devices, starting/stopping services, checking network information, installing and removing packages, and more. Precautions are given for potentially dangerous commands.

Cassandra Day SV 2014: Basic Operations with Apache Cassandra

DataStax Academy

Operations and tuning for Cassandra involve: 1) Ensuring a good data model first before trying to optimize operations, as a bad model cannot be fixed by operations alone. 2) Sizing for latency and operations depends on factors like CPU, memory, disk type and replication factor, with SSDs offering much faster performance than mechanical disks. 3) Various tuning techniques are described like disabling access time, warming the buffer cache, using SSDs, adjusting read ahead and schedulers on SSDs, choosing appropriate compaction strategies, and Cassandra heap size settings.

unixtoolbox

wensheng wei

The document is a reference guide for Unix/Linux commands and tasks useful for system administration and advanced users. It contains over 20 sections covering topics like the system, processes, file system, networking, encryption, version control and programming. Each section provides concise explanations of relevant commands and how to perform common tasks in that area. The reader is expected to have a working knowledge of the Unix environment.

KCC_Final.pdf

Oleg Sehelin

The document is a reference guide for Unix/Linux commands, organized into sections covering topics such as the system, processes, file system, network, and programming. It provides concise explanations of commands and tasks for advanced users, with the goal of being a practical toolbox reference. Sections include commands for viewing hardware and software information, monitoring system performance and activity, managing users and groups, and configuring process limits.

Docker 101

Josué Neis

This document discusses containers and virtual machines. It covers the key differences between containers and VMs, such as containers sharing an OS kernel while VMs make full copies. It also outlines Docker concepts like images, containers, and the Docker engine. The document explains how to run and build containers and images, and mentions some disadvantages of containers related to security and networking.

Cassandra Performance: Past, present & future

Acunu

The document discusses the performance of Cassandra over multiple versions from 0.7.0 to 1.0.0, noting new features introduced in each release including counters, CQL, compression, and levelDB-style compactions. It then analyzes the performance improvements achieved through optimizations like compression and leveled compaction on a single machine workload of inserts, point gets, and range queries. Finally, it invites questions about Cassandra's future performance.

Nowadays, scaling and auto-scaling have become relatively easy tasks. Everyone knows how to set up auto-scaling environments - Auto-Scaling groups, Swarm, Kubernetes, etc. But when we try to scale I/O Bound workloads: - Message queues (Kafka, Rabbit, NATS) - Distributed databases (Hadoop, Cassandra) - Storage subsystems (CEPH, GlusterFS, HDFS), the traditional auto-scaling mechanisms are just not enough. Heavy calculations must be performed to determine the I/O bottlenecks. Rebalancing the data after a scaling event can take up to hours depending on your data & could, resulting in data loss if not properly designed. We will deep dive into this type of workload and walk you through code samples you can apply in your own environment.

ubunturef

wensheng wei

This document provides a summary of common commands and configuration files used in Ubuntu systems for privileges, networking, display, package management, applications, services, and system recovery. It includes commands for sudo access, configuring networking and wireless settings, starting and stopping services, installing and removing packages, checking the system version, and rebooting the system through keyboard shortcuts. Configuration files like /etc/network/interfaces and /etc/X11/xorg.conf are also listed.

Container security: seccomp, network e namespaces

Kiratech

Le slides hanno l'obiettivo di evidenziare le nuove features di sicurezza introdotte nell'ultima release docker sia descrivendone il funzionamento sia mostrando, attraverso alcune demo, l'eventuale impatto in ambienti di produzione. Viene fatta una comparazione, in termini di analisi del rischio, tra ambienti host utilizzanti engine inferiore a release 1.9 e nuove versioni, soffermandosi su mancanze e future implementazioni.

JavaScript is the new black - Why Node.js is going to rock your world - Web 2...

Tom Croucher

Node.js allows JavaScript to be used for server-side programming. It is a popular choice because JavaScript programmers can reuse code and libraries on both the client-side and server-side. Node.js is also fast and non-blocking which allows for high concurrency levels. The Node.js ecosystem includes many libraries like Express for building web servers and Mustache.js for templating that make building server-side JavaScript applications easy.

Disk suit 4 setup and installation

ppratish

The document provides steps for setting up and configuring DISKSuit 4.0 on a new machine. It describes initially partitioning the disks with separate partitions for root, var, backup, swap, mirror, usr, home. It also provides an example initial /etc/vfstab file. It then describes steps to mirror the root (/), opt, and var partitions using DISKSuit, which involves adding configuration to the md.tab file, creating a meta database on dedicated partitions using metadb, and encapsulating the root partitions.

FreeBSD under DigitalOcean VPS

Ryo ONODERA

Disruptor 2015-12-22 @ java.il

Amir Langer

The document discusses the LMAX Disruptor, a high performance inter-thread messaging library. It describes problems with traditional queues and linked lists for inter-thread messaging due to contention. The Disruptor uses a single-producer principle and volatile variables to synchronize producers and consumers without locking, enabling high throughput. Key components include a ring buffer, events, publishers, processors and barriers. The Disruptor provides low latency, high throughput messaging and zero garbage collection overhead.

Unixtoolbox

LILIANA FERNANDEZ

This document provides a collection of Unix/Linux commands useful for system administration and advanced users. It covers topics such as system information, processes, file systems, networks, encryption, version control, software installation and more. Each section provides concise explanations of commands within that topic area. The reader is expected to have a working knowledge of Unix-like systems.

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject. We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup. Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved. The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring. The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise. By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered: • Communication Mining Overview • Why is it important? • How can it help today’s business and the benefits • Phases in Communication Mining • Demo on Platform overview • Q/A

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/ Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit. In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing. van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.

RESUME BUILDER APPLICATION Project for students

KAMESHS29

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

Full-RAG: A modern architecture for hyper-personalization

Zilliz

Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.

Large Language Model (LLM) and it’s Geospatial Applications

Rohit Gautam

What's hot

Instal vnc in cent os

Manusia Tenan

Iscsi

Md Shihab

Scaling IO-bound microservices

Salo Shp

ubunturef

wensheng wei

Container security: seccomp, network e namespaces

Kiratech

JavaScript is the new black - Why Node.js is going to rock your world - Web 2...

Tom Croucher

Disk suit 4 setup and installation

ppratish

FreeBSD under DigitalOcean VPS

Ryo ONODERA

Disruptor 2015-12-22 @ java.il

Amir Langer

Unixtoolbox

LILIANA FERNANDEZ

What's hot (10)

Instal vnc in cent os

Iscsi

Scaling IO-bound microservices

ubunturef

Container security: seccomp, network e namespaces

JavaScript is the new black - Why Node.js is going to rock your world - Web 2...

Disk suit 4 setup and installation

FreeBSD under DigitalOcean VPS

Disruptor 2015-12-22 @ java.il

Unixtoolbox

Recently uploaded

Artificial Intelligence for XMLDevelopment

Octavian Nadolu

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

Communications Mining Series - Zero to Hero - Session 1

DianaGray10

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Neo4j

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

Edge AI and Vision Alliance

RESUME BUILDER APPLICATION Project for students

KAMESHS29

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

Full-RAG: A modern architecture for hyper-personalization

Zilliz

Large Language Model (LLM) and it’s Geospatial Applications

Rohit Gautam

Mind map of terminologies used in context of Generative AI

Kumud Singh

UiPath Test Automation using UiPath Test Suite series, part 6

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI. UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities. Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes. What will you get from this session? 1. Insights into integrating generative AI. 2. Understanding how this integration enhances test automation within the UiPath platform 3. Practical demonstrations 4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath Topics covered: What is generative AI Test Automation with generative AI and Open AI. UiPath integration with generative AI Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...

Zilliz

Climate Impact of Software Testing at Nordic Testing Days

Kari Kakkonen

My slides at Nordic Testing Days 6.6.2024 Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Malak Abu Hammad

Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers: * What is Vector Search? * Importance and benefits of vector search * Practical use cases across various industries * Step-by-step implementation guide * Live demos with code snippets * Enhancing LLM capabilities with vector search * Best practices and optimization strategies Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications. #MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

20240609 QFM020 Irresponsible AI Reading List May 2024

Matthew Sinclair

みなさんこんにちはこれ何文字まで入るの？40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの？えこ...

名前です男

Recently uploaded (20)

Artificial Intelligence for XMLDevelopment

UiPath Test Automation using UiPath Test Suite series, part 5

Communications Mining Series - Zero to Hero - Session 1

GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...

Introduction to CHERI technology - Cybersecurity

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...

RESUME BUILDER APPLICATION Project for students

How to use Firebase Data Connect For Flutter

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Full-RAG: A modern architecture for hyper-personalization

Large Language Model (LLM) and it’s Geospatial Applications

Mind map of terminologies used in context of Generative AI

UiPath Test Automation using UiPath Test Suite series, part 6

Introducing Milvus Lite: Easy-to-Install, Easy-to-Use vector database for you...

Climate Impact of Software Testing at Nordic Testing Days

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Video Streaming: Then, Now, and in the Future

20240609 QFM020 Irresponsible AI Reading List May 2024

Kafka - A little introduction

1. Kafka A little introduction

3. Pub-Sub Messaging System

8. Distributed

10. Performance

11. Disk/Memory Performance 1000M 100M 10M 1M Read values/second 100,000 10,000 1,000 100 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874

12. Disk/Memory Performance 1000M 100M 10M 1M Read values/second 100,000 10,000 1,000 100 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874

13. Disk/Memory Performance 1000M 100M 10M 1M Read values/second 100,000 10,000 1,000 100 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874

14. Disk/Memory Performance 1000M 100M 10M 1M Read values/second 100,000 Sequential disk read 10,000 faster than random 1,000 100 memory read 10 1 Disk SSD Memory Random access Sequential Access Source: http://queue.acm.org/detail.cfm?id=1563874

15. Persistent

16.

17.

18.

19. Length Magic Value Checksum Payload 4 bytes 1 byte 4 bytes n bytes

20.

21.

22.

23. Token Offset: 0 Input Broker: kafka.local Topic: Testing MR Job Output Output Offset: 130098 Broker: kafka.local Topic: Testing Sequence File

24. Token Offset: 0 Input Broker: kafka.local Topic: Testing MR Job Output Output Offset: 130098 Broker: kafka.local Topic: Testing Sequence File

25.

26. Useful Things • http://incubator.apache.org/kafka/ • https://github.com/pingles/clj-kafka

Editor's Notes

\n
built by linkedin to process + store high-volume activity stream data, but its really a general use messaging system...\n\n
at it&#x2019;s heart, its a pub-sub messaging system...\n
It starts with a broker\n
Publishers connect to the broker\n
and send their messages, \n
So we connect some consumers and they can pull messages.\n\nnote when they connect, we&#x2019;ll receive all messages for a topic, not just since they&#x2019;ve connected more on that later...\n
but its also distributed, which is to say...\n
we can have multiple brokers in multiple places and aggregate together...\n\ninternally we can also partition within topics to allow parallel consumption, but thats for another talk...\n
before we get into what makes it particularly different (persistence), its useful to understand some of the engineering decisions behind how it works.\n\nperformance is interesting because the behaviour of disks / memory has informed the way kafka has been built to embrace disk persistence\n
research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS&#x2019;s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS&#x2019;s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS&#x2019;s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
research from an ACM paper\n\nvalues/sec is the number of 4-byte integer values read per second from a 1-billion-long array on disk and in memory\n\nnumber of four-byte integer values read per second from a 1-billion-long (4 GB) array on disk or in memory\n\nuses the OS&#x2019;s default page caching, rather than using custom in-memory stores\ngiven all disk writes/reads will be cached\nmeans we can avoid paying the caching overhead of objects within the JVM\n\nrather than maintaining everything in memory and flush when necessary\neverything is written immediately\n\nconfigurable flushing determines how much data is at risk\n\nsimilar to varnish\n
\n
it starts with a topic, a text description for the messages contained within. we use it to describe how to deserialize the message bytes\n
so we send a message to the topic, what happens?\n
kafka creates a file\nand it persists the message, which is to say it hands it off to the O/S to write\n\nfiles are just sets of bytes, nothing clever\n\ninternally it abstracts the collection of message bytes into a messageset, which is then backed by a file\n\nso what does each message look like...\n
so, our message length is n - 9 bytes\n\nwith a 91 byte payload we have a 100 byte message.\n\nwhich means our next message would start at offset 100\n
and we can see our offsets at the bottom...\n
so we have the offsets which lets us send all messages to consumers, not just those that were sent after they connected... \n
up to the consumer to remember what they&#x2019;ve consumed, but means you can re-consume an entire set of messages easily, which is very useful when integrating with long-term storage like HDFS...\n\nquick look at the way it works\n
\nour input to the hadoop job is a token file that specifies the offset to read from, the topic etc.\n\nhaving read the token, the mapper connects, and consumes messages from a given offset\n\nthe mapper outputs 2 sets of data- the mapped output, such as the message payloads, and an updated token file with the last read offset.\n\nthis is the key, successful completion of the job results in new metadata for the next run and the output data\n\nmeans that if the job fails we can re-run and it&#x2019;ll consume from the last consumed offset\n
the newly created output becomes the next input\n
and this is why kafka is an interesting messaging system\n\nsuitable for batch and realtime\n
\n

Kafka - A little introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Recently uploaded

Recently uploaded (20)

Kafka - A little introduction

Editor's Notes