Kafka is a distributed, partitioned, replicated commit-log service that provides functionality of a messaging system. It allows for high throughput and scalability of data and guarantees ordering of messages. The four core APIs allow sending and receiving data streams and implementing connectors. Internally, Kafka uses logs and ZooKeeper for cluster membership, electing controllers, and topic configuration. It is open source software available on GitHub.
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, MicrosoftHostedbyConfluent
Kafka Connectors are used extensively in data migration solutions, serving as a middle tier when migrating data across databases. In addition, microservice architectures also use Kafka Connectors heavily when communicating with one another while still operating independently on their own data stores. In this talk, we cover these use cases in more detail along with a deep dive into the architecture of the source and sink Kafka Connectors for Cosmos DB.
Real-Time Dynamic Data Export Using the Kafka Ecosystemconfluent
(Preston Thompson, Braze) Kafka Summit SF 2018
If you collect billions of data points every day and create billions more sending and tracking messages, then you know you need to get your infrastructure right. Our clients use Braze to engage their users over their lifecycle via push notifications, emails, in-app messages and more. Using our Currents product, clients can enable multiple configurable integrations to export this event data in real time to a variety of third-party systems, allowing them to tightly integrate with the rest of their operations and understand the impacts of their engagement strategy.
We use Kafka and the Kafka ecosystem to power this high volume real-time export. As you’d expect in a big data environment, we take data collected from a variety of sources—our SDKs, email partner APIs, our own systems—and produce it to Kafka, with topics for each type of event (about 30 types). Kafka Streams filters and transforms this data according to the configurations set by our clients. Clients can choose which types of events should be sent to which third-party systems. Kafka Connect helps to export the data to third-party systems in real time using custom developed connectors. We run a connector instance for each integration for each customer that consumes from the integration-specific topic. On top of it all, we built a service to manage the pipeline. The service provides configurations to the Streams application and also creates topics for new integrations and uses the Connect REST API to create and manage connectors.
In this talk, I will discuss:
-How we started our journey in designing this large-scale streaming architecture
-Why streaming technologies were necessary to solve our technology and business issues
-The lessons we learned along the way that can help you with your Kafka-based architecture
Integrating Apache Kafka and Elastic Using the Connect Frameworkconfluent
As a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and excels at processing streams of real-time events. Kafka provides reliable, millisecond delivery for connecting downstream systems with real-time data.
In this talk, we will show how easy it is to leverage Kafka and the Elasticsearch connector to keep your indices populated with the latest data from the rest of your enterprise, as it changes.
It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...HostedbyConfluent
The data team at Cloudflare uses Kafka to process tens of petabytes a day. All this data is moved using the 2 foundational Kafka api calls: Produce (api key 0) and Fetch (api key 1). Understanding the structure of these calls (and of the underlying RecordSet structure) is key to building high throughput clients.
The talk describes the basics of the Kafka wire protocol (api keys, correlation id), and the structure of the Produce and Fetch calls. It shows how the asynchronous nature of the wire protocol can combine with the structure of the Produce and Fetch calls to increase latency and reduce client throughput; a solution is offered through use of synchronous single-partition calls.
The RecordSet structure, which is used to encode and store sets (batches) of records is described, and its implications on Fetch requests are discussed. The relationship between Fetch api calls and ""consume"" operations is discussed, as is the impact of offset alignment to RecordSet boundaries.
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems.
However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes.
We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
eventbrite_kafka_summit_event_logo_v3-035858-edited.png
Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex
Jeremy Custenborder from Confluent talked about how Kafka brings an event-centric approach to building streaming applications, and how to use Kafka Connect and Kafka Streams to build them.
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, MicrosoftHostedbyConfluent
Kafka Connectors are used extensively in data migration solutions, serving as a middle tier when migrating data across databases. In addition, microservice architectures also use Kafka Connectors heavily when communicating with one another while still operating independently on their own data stores. In this talk, we cover these use cases in more detail along with a deep dive into the architecture of the source and sink Kafka Connectors for Cosmos DB.
Real-Time Dynamic Data Export Using the Kafka Ecosystemconfluent
(Preston Thompson, Braze) Kafka Summit SF 2018
If you collect billions of data points every day and create billions more sending and tracking messages, then you know you need to get your infrastructure right. Our clients use Braze to engage their users over their lifecycle via push notifications, emails, in-app messages and more. Using our Currents product, clients can enable multiple configurable integrations to export this event data in real time to a variety of third-party systems, allowing them to tightly integrate with the rest of their operations and understand the impacts of their engagement strategy.
We use Kafka and the Kafka ecosystem to power this high volume real-time export. As you’d expect in a big data environment, we take data collected from a variety of sources—our SDKs, email partner APIs, our own systems—and produce it to Kafka, with topics for each type of event (about 30 types). Kafka Streams filters and transforms this data according to the configurations set by our clients. Clients can choose which types of events should be sent to which third-party systems. Kafka Connect helps to export the data to third-party systems in real time using custom developed connectors. We run a connector instance for each integration for each customer that consumes from the integration-specific topic. On top of it all, we built a service to manage the pipeline. The service provides configurations to the Streams application and also creates topics for new integrations and uses the Connect REST API to create and manage connectors.
In this talk, I will discuss:
-How we started our journey in designing this large-scale streaming architecture
-Why streaming technologies were necessary to solve our technology and business issues
-The lessons we learned along the way that can help you with your Kafka-based architecture
Integrating Apache Kafka and Elastic Using the Connect Frameworkconfluent
As a streaming platform, Apache Kafka provides low-latency, high-throughput, fault-tolerant publish and subscribe pipelines and excels at processing streams of real-time events. Kafka provides reliable, millisecond delivery for connecting downstream systems with real-time data.
In this talk, we will show how easy it is to leverage Kafka and the Elasticsearch connector to keep your indices populated with the latest data from the rest of your enterprise, as it changes.
It covers a brief introduction to Apache Kafka Connect, giving insights about its benefits,use cases, motivation behind building Kafka Connect.And also a short discussion on its architecture.
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...HostedbyConfluent
The data team at Cloudflare uses Kafka to process tens of petabytes a day. All this data is moved using the 2 foundational Kafka api calls: Produce (api key 0) and Fetch (api key 1). Understanding the structure of these calls (and of the underlying RecordSet structure) is key to building high throughput clients.
The talk describes the basics of the Kafka wire protocol (api keys, correlation id), and the structure of the Produce and Fetch calls. It shows how the asynchronous nature of the wire protocol can combine with the structure of the Produce and Fetch calls to increase latency and reduce client throughput; a solution is offered through use of synchronous single-partition calls.
The RecordSet structure, which is used to encode and store sets (batches) of records is described, and its implications on Fetch requests are discussed. The relationship between Fetch api calls and ""consume"" operations is discussed, as is the impact of offset alignment to RecordSet boundaries.
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent
Many companies are adopting Apache Kafka to power their data pipelines, including LinkedIn, Netflix, and Airbnb. Kafka’s ability to handle high throughput real-time data makes it a perfect fit for solving the data integration problem, acting as the common buffer for all your data and bridging the gap between streaming and batch systems.
However, building a data pipeline around Kafka today can be challenging because it requires combining a wide variety of tools to collect data from disparate data systems. One tool streams updates from your database to Kafka, another imports logs, and yet another exports to HDFS. As a result, building a data pipeline can take significant engineering effort and has high operational overhead because all these different tools require ongoing monitoring and maintenance. Additionally, some of the tools are simply a poor fit for the job: the fragmented nature of the data integration tools ecosystem lead to creative but misguided solutions such as misusing stream processing frameworks for data integration purposes.
We describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
eventbrite_kafka_summit_event_logo_v3-035858-edited.png
Confluent building a real-time streaming platform using kafka streams and k...Thomas Alex
Jeremy Custenborder from Confluent talked about how Kafka brings an event-centric approach to building streaming applications, and how to use Kafka Connect and Kafka Streams to build them.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...Fastly
Braze is a customer engagement platform that delivers more than a billion messaging experiences across push, email, apps and more each day. In this session, Jon Hyman will describe the company's challenges during an inflection point in 2015 when the company reached the limitation of their physical networking equipment, and how Braze has since grown more than 7x on Fastly. Jon will also discuss how Braze uses Fastly's Layer 7 load balancing to improve stability and uptime of its APIs.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Change data capture with MongoDB and Kafka.Dan Harvey
In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard confluent
(Stephen Parente + Jeff Field, Blizzard) Kafka Summit SF 2018
Blizzard’s global data platform has become a driving force in both business and operational analytics. As more internal customers onboard with the system, there is increasing demand for custom applications to access this data in near real time. In order to avoid many independent teams with varying levels of Kafka expertise all accessing the firehose from our critical production Kafkas, we developed our own pub-sub system on top of Kafka to provide specific datasets to customers on their own cloud deployed Kafka clusters.
Embracing Database Diversity with Kafka and DebeziumFrank Lyaruu
There was a time not long ago when we used relational databases for everything. Even if the data wasn’t particularly relational, we shoehorned it into relational tables, often because that was the only database we had. Thank god these dark times are over and now we have many different kinds of NoSQL databases: Document, realtime, graph, column, but that does not solve the problem that the same data might be a graph from one perspective, but a collection of documents from another.
It would be really nice if we can access that same data in many different ways, depending on the context of what we want to achieve in our current task.
As software architects this is not easy to solve but definitely possible: We can design an architecture using Event Sourcing: Capture the data with Debezium, post it to a Kafka queue, use Kafka Streams to model the data the way we like, and store the data in various different data sources, so we can synchronize data between data sources.
Agile Data Integration: How is it possible?confluent
In this talk, we are going to tell you the story of building the Connection Platform (CoPa). This is an endeavor undertaken at Generali Switzerland over the course of the last year, in a collaboration with Innovation Process Technology. The goal was to design a general purpose, state of the art integration platform, which covers all integration needs of the enterprise. The central data distribution and integration layer are powered by Confluent Kafka. We will throw a spotlight on three different aspects of this platform that, all in their own right, are essential for agile data integration.
First of all, the platform is hosted on the container platform Redhat Openshift. Everything is set up in flexible Docker containers. Automated pipelines are used to build, provision and deploy everything on the platform from infrastructure to data pipeline
Confluent Operations Training for Apache Kafkaconfluent
Course Objectives
In this three-day hands-on course, you will learn how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka experts. You will learn how Kafka and the Confluent Platform work, their main subsystems, how they interact, and how to set up, manage, monitor, and tune your cluster.
For more information, please visit www.confluent.io/training/
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...HostedbyConfluent
Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores.
In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Lightbend
In this talk by Gerard Maas, O’Reilly author and Senior Software Engineer at Lightbend, we focus on choosing the right Fast Data stream processing features of Apache Spark, taking a practical, code-driven look at the two APIs available for this: the mature Spark Streaming and its younger sibling, Structured Streaming.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...HostedbyConfluent
You have learned about Kafka event sourcing with streams and using Kafka as a database, but you may be having a tough time wrapping your head around what that means and what challenges you will face. Kafka’s exactly once semantics, data retention rules, and stream DSL make it a great database for real-time transaction processing. This talk will focus on how to use Kafka events as a database. We will talk about using KTables vs GlobalKTables, and how to apply them to patterns we use with traditional databases. We will go over a real-world example of joining events against existing data and some issues to be aware of. We will finish covering some important things to remember about state stores, partitions, and streams to help you avoid problems when your data sets become large.
Talk on Fundamentals of Parallel Computing ranging from basics and history of parallel computing to comparison of recent trends in the area of high performance computing delivered at Indo-German Winter Academy 2009.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Altitude San Francisco 2018: Scale and Stability at the Edge with 1.4 Billion...Fastly
Braze is a customer engagement platform that delivers more than a billion messaging experiences across push, email, apps and more each day. In this session, Jon Hyman will describe the company's challenges during an inflection point in 2015 when the company reached the limitation of their physical networking equipment, and how Braze has since grown more than 7x on Fastly. Jon will also discuss how Braze uses Fastly's Layer 7 load balancing to improve stability and uptime of its APIs.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Change data capture with MongoDB and Kafka.Dan Harvey
In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard confluent
(Stephen Parente + Jeff Field, Blizzard) Kafka Summit SF 2018
Blizzard’s global data platform has become a driving force in both business and operational analytics. As more internal customers onboard with the system, there is increasing demand for custom applications to access this data in near real time. In order to avoid many independent teams with varying levels of Kafka expertise all accessing the firehose from our critical production Kafkas, we developed our own pub-sub system on top of Kafka to provide specific datasets to customers on their own cloud deployed Kafka clusters.
Embracing Database Diversity with Kafka and DebeziumFrank Lyaruu
There was a time not long ago when we used relational databases for everything. Even if the data wasn’t particularly relational, we shoehorned it into relational tables, often because that was the only database we had. Thank god these dark times are over and now we have many different kinds of NoSQL databases: Document, realtime, graph, column, but that does not solve the problem that the same data might be a graph from one perspective, but a collection of documents from another.
It would be really nice if we can access that same data in many different ways, depending on the context of what we want to achieve in our current task.
As software architects this is not easy to solve but definitely possible: We can design an architecture using Event Sourcing: Capture the data with Debezium, post it to a Kafka queue, use Kafka Streams to model the data the way we like, and store the data in various different data sources, so we can synchronize data between data sources.
Agile Data Integration: How is it possible?confluent
In this talk, we are going to tell you the story of building the Connection Platform (CoPa). This is an endeavor undertaken at Generali Switzerland over the course of the last year, in a collaboration with Innovation Process Technology. The goal was to design a general purpose, state of the art integration platform, which covers all integration needs of the enterprise. The central data distribution and integration layer are powered by Confluent Kafka. We will throw a spotlight on three different aspects of this platform that, all in their own right, are essential for agile data integration.
First of all, the platform is hosted on the container platform Redhat Openshift. Everything is set up in flexible Docker containers. Automated pipelines are used to build, provision and deploy everything on the platform from infrastructure to data pipeline
Confluent Operations Training for Apache Kafkaconfluent
Course Objectives
In this three-day hands-on course, you will learn how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka experts. You will learn how Kafka and the Confluent Platform work, their main subsystems, how they interact, and how to set up, manage, monitor, and tune your cluster.
For more information, please visit www.confluent.io/training/
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...HostedbyConfluent
Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores.
In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.
Hands On With Spark: Creating A Fast Data Pipeline With Structured Streaming ...Lightbend
In this talk by Gerard Maas, O’Reilly author and Senior Software Engineer at Lightbend, we focus on choosing the right Fast Data stream processing features of Apache Spark, taking a practical, code-driven look at the two APIs available for this: the mature Spark Streaming and its younger sibling, Structured Streaming.
Introduction to Apache Kafka and Confluent... and why they matterconfluent
Milano Apache Kafka Meetup by Confluent (First Italian Kafka Meetup) on Wednesday, November 29th 2017.
Il talk introduce Apache Kafka (incluse le APIs Kafka Connect e Kafka Streams), Confluent (la società creata dai creatori di Kafka) e spiega perché Kafka è un'ottima e semplice soluzione per la gestione di stream di dati nel contesto di due delle principali forze trainanti e trend industriali: Internet of Things (IoT) e Microservices.
Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...HostedbyConfluent
You have learned about Kafka event sourcing with streams and using Kafka as a database, but you may be having a tough time wrapping your head around what that means and what challenges you will face. Kafka’s exactly once semantics, data retention rules, and stream DSL make it a great database for real-time transaction processing. This talk will focus on how to use Kafka events as a database. We will talk about using KTables vs GlobalKTables, and how to apply them to patterns we use with traditional databases. We will go over a real-world example of joining events against existing data and some issues to be aware of. We will finish covering some important things to remember about state stores, partitions, and streams to help you avoid problems when your data sets become large.
Talk on Fundamentals of Parallel Computing ranging from basics and history of parallel computing to comparison of recent trends in the area of high performance computing delivered at Indo-German Winter Academy 2009.
Rendez-vous de l’économie FTI Consulting en partenariat avec Odoxa - Les Echo...FTI Consulting FR
Ce mois-ci, notre « Rendez-vous de l’économie FTI Consulting en partenariat avec Odoxa - Les Echos et Radio Classique » porte sur la loi santé.
Les résultats de ce sondage ont été publiés ce matin dans Les Echos et diffusés sur Radio Classique. On y apprend notamment que :
• Sept Français sur dix soutiennent la généralisation du tiers-payant, disposition phare du PLS de Marisol Touraine
• Pourtant, les Français comprennent aussi l’opposition des médecins à cette généralisation du tiers payant
• Santé publique : l’assouplissement de la loi Evin et l’instauration des paquets neutres divisent les Français, suscitant un assez fort clivage gauche-droite
For developers new to MongoDB and Node.js, however, some the common design patterns are very different than those of a RDBMS and traditional synchronous languages. Developers learning these technologies together may find it a bit bewildering. In reality, however, these tools fit perfectly together and enable I high degree of developer productivity and application performance.
This webinar will walk developers through common MongoDB development patterns in Node.js, such as efficiently loading data into MongoDB using MongoDB's bulk API, iterating through query results, and managing simultaneous asynchronous MongoDB queries to provide the best possible application performance. Working Node.js and MongoDB examples will be used throughout the presentation.
MongoDB Days Silicon Valley: MongoDB and the Hadoop ConnectorMongoDB
Presented by Luke Lovett, Software Engineer, MongoDB
Experience level: Introductory
MongoDB and Hadoop work powerfully together as complementary technologies. Learn how the Hadoop connector allows you to leverage the power of MapReduce to process data sourced from your MongoDB cluster.
Comparto con ustedes una presentación que elaboré para un seminario sobre móviles, donde explico que involucra usar la tecnología móvil en celulares y tabletas dentro de un ecosistema lleno de necesidades en los diferentes contexto de uso. Demostrando que, no solo el tamaño del dispositivo y la forma de interactuar con este son suficientes para resolver una adecuada experiencia usuaria, si no, que el contexto de uso y las consideraciones a tomar a partir de este en el comportamiento de nuestros usuarios (motivaciones y/o frenos) son temas relevantes al momento de conceptualizar nuestro producto y definir sus objetivos rentables.
Temas en la presentación:
- ¿Qué ha remplazado el móvil?
- Contexto Móvil =/ Contexto Tablet =/ Contexto PC
- Marcos mentales
- Contextos de uso
- Consideraciones contextuales para movilizar
- Algunos estudios a nivel mundial sobre el uso de celulares inteligentes
Fuente: www.blog.pucp.edu.pe/ux
Webinar: Simplifying the Database Experience with MongoDB AtlasMongoDB
MongoDB Atlas is our database as a service for MongoDB. In this webinar you’ll learn how it provides all of the features of MongoDB, without all of the operational heavy lifting, and all through a pay-as-you-go model billed on an hourly basis.
Importancia del color en la experiencia de usoPercy Negrete
El color en la navegación puede ayudar al usuario a encontrar lo que busca con mayor rapidez agrupando contenidos. Aquí revelamos el poder del color y su influencia en los usuarios.
En esta presentación podrá encontrar:
- Introducción.
- ¿Cómo podemos usar los colores adecuadamente?.
- La sobrecarga cognitiva.
- Problemas con los colores.
- Interpretaciones de los colores para personas.
- Efecto Stroop.
Fuente: www.blog.pucp.edu.pe/ux
Otras realidades, otros impactos, otras métricas: la nueva bibliometría
1. La medida de la ciencia
2. Hitos históricos evaluación bibliométrica
De la Bibliometrics: la evaluación de unos pocos, por unos pocos y para unos pocos A la Webmetrics y a la Altmetrics: La popularización y democratización de la evaluación científica La evaluación de todos, por todos, para todos, de todo, a todas horas y en todos los lugares
3.La nueva bibliometría:
3.1 Otras realidades
- Nuevos medios de comunicación Los sitios web
- Nuevos medios de comunicación Blogs
- Nuevos medios de comunicación Twitter
- Nuevos medios de comunicación Presentaciones
- Nuevos almacenes de información bibliográfica: los repositorios
- Nuevos almacenes de información bibliográfica : los gestores de referencias bibliográficas
- Las redes sociales científicas
3.2 Otros impactos Otras métricas
- Midiendo el impacto de los sitios web
- Midiendo el impacto de un Blog
- Midiendo el impacto en Twitter
- Midiendo el impacto de las presentaciones
- Midiendo el impacto de los documentos indizados en los repositorios
- Midiendo el impacto de los documentos indizados en los nuevos almacenes de información bibliográfica : los gestores de referencias bibliográficas
- Midiendo en las redes sociales científicas
3.3 Otras herramientas
- Construyendo rankings web. Nivel macro, Nivel micro (Google Analytics)
- Google Scholar: la nueva "casa de citas"
- LOS DERIVADOS BIBLIOMÉTRICOS DE GOOGLE SCHOLAR Google Scholar Metrics, Google Scholar Citations
3.4 ¿Qué futuro aguarda a los nuevas métricas?, ¿Cuál es el futuro de los nuevos medios de comunicación?, ¿Cuántos documentos posee las nuevas métricas?
- ¿Qué sabemos de las nuevas métricas? El sentido común Evidencias empíricas
- ¿Para qué los nuevos indicadores?
- ¿Qué impacto miden? Científico Profesional Educativo Social
- ¿Qué sabemos de Google Scholar como fuente de evaluación científica?
4. Los riesgos de la nueva bibliometría
- Problemas: La FUGACIDAD
- The Googledependency Problemas: La dependencia tecnológica
- El gran peligro: La MANIPULACIÓN
- ¿Se convertirá la métrica en un fin en sí mismo?
¿la medida alterará el fin mismo de la ciencia?
¿Un inquietante futuro?
Neurodiseño, una tendencia en el diseño de experienciaPercy Negrete
El neurodiseño es un nuevo proceso de trabajo y de investigación en el diseño UX donde el usuario es visto como un todo integrado, es decir, que hay que considerar al usuario no solo como un sujeto físico y mental a nivel de sensaciones y emociones, sino, también considerar su subconsciente que es el que realmente toma las decisiones de comportamiento.
Por eso, es importante que el diseñador de experiencia sea un poco psicólogo ya que conociendo no solo cuestiones de ergonomía del diseño sino también los principios cognitivos en los que se basa esta disciplina, como funciona el cerebro humano y como siguiendo los principios del neurodiseño podemos crear productos mas persuasible y poder influir en el comportamiento.
Hay que aclarar es que estos estudios evalúan las reacciones fisiológicas y biológicas del cerebro ante los estímulos planteados y son lo más cercano a evaluar el inconsciente.
------------
Esta presentación fué elaborada por Sandra Vilchez y presentada en famoso Encuentro Americano de Diseño 2012 celebrado en Palermo, Buenos Aires, Argentina
Más información sobre esta presentación en: http://blog.pucp.edu.pe/item/165446/neuro-design
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Apache NiFi, Apache Flink, Apache Kafka
Timothy Spann
Principal Developer Advocate
Cloudera
Data in Motion
https://budapestdata.hu/2023/en/speakers/timothy-spann/
Timothy Spann
Principal Developer Advocate
Cloudera (US)
LinkedIn · GitHub · datainmotion.dev
June 8 · Online · English talk
Building Modern Data Streaming Apps with NiFi, Flink and Kafka
In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg.
We use the best streaming tools for the current applications with FLaNK. flankstack.dev
BIO
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...Timothy Spann
Scenic city summit real-time streaming in any and all clouds, hybrid and beyond
24-September-2021. Scenic City Summit. Virtual. Real-Time Streaming in Any and All Clouds, Hybrid and Beyond
Apache Pulsar, Apache NiFi, Apache Flink
StreamNative
Tim Spann
https://sceniccitysummit.com/
Building streaming data applications using Kafka*[Connect + Core + Streams] b...Data Con LA
Abstract:- Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing. In this talk you will learn more about: A quick introduction to Kafka Core, Kafka Connect and Kafka Streams through code examples, key concepts and key features. A reference architecture for building such Kafka-based streaming data applications. A demo of an end-to-end Kafka-based streaming data application.
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...HostedbyConfluent
As a data professional, you are the glue that makes cross-platform integrations possible. With the increase in adoption of hybrid cloud architectures, Kafka is an increasingly relevant tool for building data pipelines between platforms and accelerating delivery on cloud projects. Early exposure to Kafka on Azure capabilities gives you an edge to build better mousetraps at the design phase.
Customers already running Kafka on premises and are looking to extend Kafka systems to Azure can get started quickly with Confluent Cloud. Additionally, DevOps for self-managed options can be easily scalable with Ansible for Virtual Machines or containers via Azure Kubernetes Services or Azure Container Instances.
This session is presented from the Microsoft Solution Architect perspective by Israel Ekpo, Microsoft Cloud Solution Architect and Alicia Moniz, Microsoft MVP. They will cover use cases and scenarios, along with key Azure integration points and architecture patterns.
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing.
In this talk you will learn more about:
1. A quick introduction to Kafka Core, Kafka Connect and Kafka Streams: What is and why?
2. Code and step-by-step instructions to build an end-to-end streaming data application using Apache Kafka
OSSNA Building Modern Data Streaming AppsTimothy Spann
OSSNA
Building Modern Data Streaming Apps
https://ossna2023.sched.com/event/1Jt05/virtual-building-modern-data-streaming-apps-with-open-source-timothy-spann-streamnative
Timothy Spann
Cloudera
Principal Developer Advocate
Data in Motion
In my session, I will show you some best practices I have discovered over the last seven years in building data streaming applications, including IoT, CDC, Logs, and more. In my modern approach, we utilize several open-source frameworks to maximize all the best features. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar. From there, we build streaming ETL with Apache Spark and enhance events with Pulsar Functions for ML and enrichment. We make continuous queries against our topics with Flink SQL. We will stream data into various open-source data stores, including Apache Iceberg, Apache Pinot, and others. We use the best streaming tools for the current applications with the open source stack - FLiPN. https://www.flipn.app/ Updates: This will be in-person with live coding based on feedback from the crowd. This will also include new data stores, new sources, and data relevant to and from the Vancouver area. This will also include updates to the platforms and inclusion of Apache Iceberg, Apache Pinot and some other new tech.
https://github.com/tspannhw/SpeakerProfile Tim Spann is a Principal Developer Advocate for Cloudera. He works with Apache Kafka, Apache Flink, Flink SQL, Apache NiFi, MiniFi, Apache MXNet, TensorFlow, Apache Spark, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Timothy J Spann
Cloudera
Principal Developer Advocate
Hightstown, NJ
Websitehttps://datainmotion.dev/
In this open marketing meeting, we discuss the major features for the Grizzly release, coming April 4, as well as a preview of the Summit and upcoming upcoming events.
IBM Message Hub service in Bluemix - Apache Kafka in a public cloudAndrew Schofield
This talk was presented at the Kafka Meetup London meeting on 20 January 2016. You can find more information about Message Hub here: http://ibm.biz/message-hub-bluemix-catalog
Deep Dive on AWS Lambda - January 2017 AWS Online Tech TalksAmazon Web Services
AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume - there is no charge when your code is not running. With Lambda, you can run code for virtually any type of application or backend service - all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app. In this session, we dive deep into AWS Lambda to learn about capabilities, features and benefits.
Learning Objectives:
• Dive deep into AWS Lambda
• Learn about the capabilities, features and benefits of AWS Lambda
• Learn about the different use cases
• Learn how to get started using AWS Lambda
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMEconfluent
Confluent Platform is supporting London Metal Exchange’s Kafka Centre of Excellence across a number of projects with the main objective to provide a reliable, resilient, scalable and overall efficient Kafka as a Service model to the teams across the entire London Metal Exchange estate.
Watch this webcast here: https://www.confluent.io/online-talks/whats-new-in-confluent-platform-55/
Join the Confluent Product Marketing team as we provide an overview of Confluent Platform 5.5, which makes Apache Kafka and event streaming more broadly accessible to developers with enhancements to data compatibility, multi-language development, and ksqlDB.
Building an event-driven architecture with Apache Kafka allows you to transition from traditional silos and monolithic applications to modern microservices and event streaming applications. With these benefits has come an increased demand for Kafka developers from a wide range of industries. The Dice Tech Salary Report recently ranked Kafka as the highest-paid technological skill of 2019, a year removed from ranking it second.
With Confluent Platform 5.5, we are making it even simpler for developers to connect to Kafka and start building event streaming applications, regardless of their preferred programming languages or the underlying data formats used in their applications.
This session will cover the key features of this latest release, including:
-Support for Protobuf and JSON schemas in Confluent Schema Registry and throughout our entire platform
-Exactly once semantics for non-Java clients
-Admin functions in REST Proxy (preview)
-ksqlDB 0.7 and ksqlDB Flow View in Confluent Control Center
apidays LIVE Hong Kong 2021 - Multi-Protocol APIs at Scale in Adidas by Jesus...apidays
apidays LIVE Hong Kong 2021 - API Ecosystem & Data Interchange
August 25 & 26, 2021
Multi-Protocol APIs at Scale in Adidas
Jesus de Diego, API Evangelist at Adidas
Building distributed systems is challenging. Luckily, Apache Kafka provides a powerful toolkit for putting together big services as a set of scalable, decoupled components. In this talk, I'll describe some of the design tradeoffs when building microservices, and how Kafka's powerful abstractions can help. I'll also talk a little bit about what the community has been up to with Kafka Streams, Kafka Connect, and exactly-once semantics.
Presentation by Colin McCabe, Confluent, Big Data Day LA
Similar to Kafka. seattle data science and data engineering meetup (20)
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
5. Introduction: What is Kafka ?
Distributed, partitioned, replicated commit-log service
Provides the functionality of a messaging system, but with a unique-design
5
Competitive Landscape:
● AWS Kinesis, Azure EventHub
Use Cases:
● Messaging
● Website Activity Tracking
● Logging
● Stream Processing
6. Introduction: Characteristics
6
Scalability of a filesystem
High Throughput
Many TB per server
Guarantees of a database
Messages strictly ordered
All data persistent
Distributed by default
Replication
Partitioning
7. Introduction: APIs
Four core APIs:
Producer API
allows applications to send streams of data to topics in the Kafka cluster.
Consumer API
allows applications to read streams of data from topics in the Kafka cluster.
Connect API
allows implementing connectors that continually pull from some source system or application into
Kafka or push from Kafka into some sink system or application.
Streams API
generalization of batch processing in a real time environment, low latency requirements.
7
Two main challenges.
Large volume of data
Different sources and destinations
(and the second challenge is to analyze the collected data. To overcome those challenges, you must need a messaging system)
Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications.
What is a Messaging System?
A Messaging System is responsible for transferring data from one application to another, so the applications can focus on data, but not worry about how to share it. Distributed messaging is based on the concept of reliable message queuing. Messages are queued asynchronously between client applications and messaging system. Two types of messaging patterns are available − one is point to point and the other is publish-subscribe (pub-sub) messaging system. Most of the messaging patterns follow pub-sub.
In a point-to-point system, messages are persisted in a queue. One or more consumers can consume the messages in the queue, but a particular message can be consumed by a maximum of one consumer only. Once a consumer reads a message in the queue, it disappears from that queue
In the publish-subscribe system, messages are persisted in a topic. Unlike point-to-point system, consumers can subscribe to one or more topic and consume all the messages in that topic. In the Publish-Subscribe system, message producers are called publishers and message consumers are called subscribers
Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis.
Benefits
Following are a few benefits of Kafka −
- Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.
- Scalability − Kafka messaging system scales easily without down time..
- Durability − Kafka uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable..
- Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored.
Kafka is very fast and guarantees zero downtime and zero data loss.
Use Cases
Kafka can be used in many Use Cases. Some of them are listed below −
- Metrics − Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.Log Aggregation Solution − Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers.
- Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka’s strong durability is also very useful in the context of stream processing.
Need for Kafka
Kafka is a unified platform for handling all the real-time data feeds. Kafka supports low latency message delivery and gives guarantee for fault tolerance in the presence of machine failures. It has the ability to handle a large number of diverse consumers. Kafka is very fast, performs 2 million writes/sec. Kafka persists all data to the disk, which essentially means that all the writes go to the page cache of the OS (RAM). This makes it very efficient to transfer data from page cache to a network socket
-----------
Kafka includes four core apis:
The Producer API allows applications to send streams of data to topics in the Kafka cluster.
The Consumer API allows applications to read streams of data from topics in the Kafka cluster.
The Streams API allows transforming streams of data from input topics to output topics.
The Connect API allows implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application.
Kafka exposes all its functionality over a language independent protocol which has clients available in many programming languages. However only the Java clients are maintained as part of the main Kafka project, the others are available as independent open source projects. A list of non-Java clients is available here.
2.1 Producer API
The Producer API allows applications to send streams of data to topics in the Kafka cluster.
Examples showing how to use the producer are given in the javadocs.
To use the producer, you can use the following maven dependency:
<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.10.1.0</version> </dependency>
2.2 Consumer API
The Consumer API allows applications to read streams of data from topics in the Kafka cluster.
Examples showing how to use the consumer are given in the javadocs.
To use the consumer, you can use the following maven dependency:
<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.10.1.0</version> </dependency>
2.3 Streams API
The Streams API allows transforming streams of data from input topics to output topics.
Examples showing how to use this library are given in the javadocs
Additional documentation on using the Streams API is available here.
To use Kafka Streams you can use the following maven dependency:
<dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>0.10.1.0</version> </dependency>
2.4 Connect API
The Connect API allows implementing connectors that continually pull from some source data system into Kafka or push from Kafka into some sink data system.
Many users of Connect won't need to use this API directly, though, they can use pre-built connectors without needing to write any code. Additional information on using Connect is available here.
Those who want to implement custom connectors can see the javadoc.
Kafka design fundamentals
Kafka is neither a queuing platform where messages are received by a single consumer out of the consumer pool, nor a publisher-subscriber platform where messages are published to all the consumers. In a very basic structure, a producer publishes messages to a Kafka topic (synonymous with "messaging queue"). A topic is also considered as a message category or feed name to which messages are published. Kafka topics are created on a Kafka broker acting as a Kafka server. Kafka brokers also store the messages if required. Consumers then subscribe to the Kafka topic (one or more) to get the messages. Here, brokers and consumers use Zookeeper to get the state information and to track message offsets, respectively. This is described in the following diagram:
In the preceding diagram, a single node—single broker architecture is shown with a topic having four partitions. In terms of the components, the preceding diagram shows all the five components of the Kafka cluster: Zookeeper, Broker, Topic, Producer, and Consumer.
In Kafka topics, every partition is mapped to a logical log file that is represented as a set of segment files of equal sizes. Every partition is an ordered, immutable sequence of messages; each time a message is published to a partition, the broker appends the message to the last segment file. These segment files are flushed to disk after configurable numbers of messages have been published or after a certain amount of time has elapsed. Once the segment file is flushed, messages are made available to the consumers for consumption.
All the message partitions are assigned a unique sequential number called the offset, which is used to identify each message within the partition. Each partition is optionally replicated across a configurable number of servers for fault tolerance.
Each partition available on either of the servers acts as the leader and has zero or more servers acting as followers. Here the leader is responsible for handling all read and write requests for the partition while the followers asynchronously replicate data from the leader. Kafka dynamically maintains a set of in-sync replicas (ISR) that are caught-up to the leader and always persist the latest ISR set to ZooKeeper. In if the leader fails, one of the followers (in-sync replicas) will automatically become the new leader. In a Kafka cluster, each server plays a dual role; it acts as a leader for some of its partitions and also a follower for other partitions. This ensures the load balance within the Kafka cluster.
The Kafka platform is built based on what has been learned from both the traditional platforms and has the concept of consumer groups. Here, each consumer is represented as a process and these processes are organized within groups called consumer groups.
A message within a topic is consumed by a single process (consumer) within the consumer group and, if the requirement is such that a single message is to be consumed by multiple consumers, all these consumers need to be kept in different consumer groups. Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages.
Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
In line with Kafka's design, brokers are stateless, which means the message state of any consumed message is maintained within the message consumer, and the Kafka broker does not maintain a record of what is consumed by whom. If this is poorly implemented, the consumer ends up in reading the same message multiple times. If the message is deleted from the broker (as the broker doesn't know whether the message is consumed or not), Kafka defines the time-based SLA (service level agreement) as a message retention policy. In line with this policy, a message will be automatically deleted if it has been retained in the broker longer than the defined SLA period. This message retention policy empowers consumers to deliberately rewind to an old offset and re-consume data although, as with traditional messaging systems, this is a violation of the queuing contract with consumers.
Let's discuss the message delivery semantic Kafka provides between producer and consumer. There are multiple possible ways to deliver messages, such as:
Messages are never redelivered but may be lost
Messages may be redelivered but never lost
Messages are delivered once and only once
When publishing, a message is committed to the log. If a producer experiences a network error while publishing, it can never be sure if this error happened before or after the message was committed. Once committed, the message will not be lost as long as either of the brokers that replicate the partition to which this message was written remains available. For guaranteed message publishing, configurations such as getting acknowledgements and the waiting time for messages being committed are provided at the producer's end.
From the consumer point-of-view, replicas have exactly the same log with the same offsets, and the consumer controls its position in this log. For consumers, Kafka guarantees that the message will be delivered at least once by reading the messages, processing the messages, and finally saving their position. If the consumer process crashes after processing messages but before saving their position, another consumer process takes over the topic partition and may receive the first few messages, which are already processed.
-------------------
Kafka Storage
Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of equal sizes. Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file. Segment file is flushed to disk after configurable numbers of messages have been published or after a certain amount of time elapsed. Messages are exposed to consumer after it gets flushed.
Unlike traditional message system, a message stored in Kafka system doesn’t have explicit message ids.
Messages are exposed by the logical offset in the log. This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message ids to the actual message locations. Messages ids are incremental but not consecutive. To compute the id of next message adds a length of the current message to its logical offset.
Consumer always consumes messages from a particular partition sequentially and if the consumer acknowledges particular message offset, it implies that the consumer has consumed all prior messages. Consumer issues asynchronous pull request to the broker to have a buffer of bytes ready to consume. Each asynchronous pull request contains the offset of the message to consume. Kafka exploits the sendfile API to efficiently deliver bytes in a log segment file from a broker to a consumer.
----------------------
Kafka Broker
Unlike other message system, Kafka brokers are stateless. This means that the consumer has to maintain how much it has consumed. Consumer maintains it by itself and broker would not do anything. Such design is very tricky and innovative in itself.
It is very tricky to delete message from the broker as broker doesn't know whether consumer consumed the message or not. Kafka innovatively solves this problem by using a simple time-based SLA for the retention policy. A message is automatically deleted if it has been retained in the broker longer than a certain period.
This innovative design has a big benefit, as consumer can deliberately rewind back to an old offset and re-consume data. This violates the common contract of a queue, but proves to be an essential feature for many consumers.
Role of ZooKeeper.
A critical dependency of Apache Kafka is Apache Zookeeper, which is a distributed configuration and synchronization service. Zookeeper serves as the coordination interface between the Kafka brokers and consumers. The Kafka servers share information via a Zookeeper cluster. Kafka stores basic metadata in Zookeeper such as information about topics, brokers, consumer offsets (queue readers) and so on.
Since all the critical information is stored in the Zookeeper and it normally replicates this data across its ensemble, failure of Kafka broker / Zookeeper does not affect the state of the Kafka cluster. Kafka will restore the state, once the Zookeeper restarts. This gives zero downtime for Kafka. The leader election between the Kafka broker is also done by using Zookeeper in the event of leader failure.
-------------
Zookeeper: ZooKeeper serves as the coordination interface between the Kafka broker and consumers. The ZooKeeper overview given on the Hadoop Wiki site is as follows (http://wiki.apache.org/hadoop/ZooKeeper/ProjectDescription):"ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers (we call these registers znodes), much like a file system."The main differences between ZooKeeper and standard filesystems are that every znode can have data associated with it and znodes are limited to the amount of data that they can have. ZooKeeper was designed to store coordination data: status information, configuration, location information, and so on.
-------------
Zookeeper and Kafka
Consider a distributed system with multiple servers, each of which is responsible for holding data and performing operations on that data. Some potential examples are distributed search engine, distributed build system or known system like Apache Hadoop. One common problem with all these distributed systems is how would you determine which servers are alive and operating at any given point of time? Most importantly, how would you do these things reliably in the face of the difficulties of distributed computing such as network failures, bandwidth limitations, variable latency connections, security concerns, and anything else that can go wrong in a networked environment, perhaps even across multiple data centers? These types of questions are the focus of Apache ZooKeeper, which is a fast, highly available, fault tolerant, distributed coordination service. Using ZooKeeper you can build reliable, distributed data structures for group membership, leader election, coordinated workflow, and configuration services, as well as generalized distributed data structures like locks, queues, barriers, and latches. Many well-known and successful projects already rely on ZooKeeper. Just a few of them include HBase, Hadoop 2.0, Solr Cloud, Neo4J, Apache Blur (incubating), and Accumulo.
ZooKeeper is a distributed, hierarchical file system that facilitates loose coupling between clients and provides an eventually consistent view of its znodes, which are like files and directories in a traditional file system. It provides basic operations such as creating, deleting, and checking existence of znodes. It provides an event-driven model in which clients can watch for changes to specific znodes, for example if a new child is added to an existing znode. ZooKeeper achieves high availability by running multiple ZooKeeper servers, called an ensemble, with each server holding an in-memory copy of the distributed file system to service client read requests.
Figure 4 above shows typical ZooKeeper ensemble in which one server acting as a leader while the rest are followers. On start of ensemble leader is elected first and all followers replicate their state with leader. All write requests are routed through leader and changes are broadcast to all followers. Change broadcast is termed as atomic broadcast.
Usage of Zookepper in Kafka: As for coordination and facilitation of distributed system ZooKeeper is used, for the same reason Kafka is using it. ZooKeeper is used for managing, coordinating Kafka broker. Each Kafka broker is coordinating with other Kafka brokers using ZooKeeper. Producer and consumer are notified by ZooKeeper service about the presence of new broker in Kafka system or failure of the broker in Kafka system. As per the notification received by the Zookeeper regarding presence or failure of the broker producer and consumer takes decision and start coordinating its work with some other broker. Overall Kafka system architecture is shown below in Figure 5 below.