Data processing use cases, from transformation to analytics, perform tasks that require various combinations of queuing, streaming & lightweight processing steps. Until now, supporting all of those needs has required different systems for each task--stream processing engines, messaging queuing middleware, & streaming messaging systems. That has led to increased complexity for development & operations.
In this session, well discuss the need to unify these capabilities in a single system & how Apache Pulsar was designed to address that. Apache Pulsar is a next generation distributed pub-sub system that was developed & deployed at Yahoo. Streamlios Karthik Ramasamy, will explain how the architecture & design of Pulsar provides the flexibility to support developers & applications needing any combination of queuing, messaging, streaming & lightweight compute.
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
You will learn how Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
Linked In Stream Processing Meetup - Apache PulsarKarthik Ramasamy
Apache Pulsar is a fast, highly scalable, and flexible pub/sub messaging system. It provides guaranteed message delivery, ordering, and durability by backing messages with a replicated log storage. Pulsar's architecture allows for independent scalability of brokers and storage nodes. It supports multi-tenancy, geo-replication, and high throughput of over 1.8 million messages per second in a single partition.
High performance messaging with Apache PulsarMatteo Merli
Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees.
Matteo Merli and Sijie Guo from Streamlio gave a hands-on workshop on Apache Pulsar. #fast #durable #pubsub #messaging system. A low latency alternative to #kafka.
This is an overview of interesting features from Apache Pulsar. Keep in mind that by the time I did this presentation I did not have used Pulsar yet. It's just my first impressions from the list of features.
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...StreamNative
Nowadays, real-time computation is heavily used in cases such as online product recommendation, online payment fraud detection and etc.. In the streaming pipeline, Kafka is normally used to store a day/week data, but won't store years-long data, as in looking at the trend historically. So, a batch pipeline is needed for historical data computation. Thus, it's where the Lambda architecture comes in. Lambda has been proved to be effective, and a good balance of speed and reliability. We have been running many systems with Lambda architecture for many years. But the biggest detraction to Lambda architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and streaming layers. With that, we have to split our business logic into many segments across different places, which is a challenge to maintain as the business grows and it also increases communication overhead. Secondly, the data are duplicated in two different systems, and we have to move data among different systems for processing. With those challenges, we have been searching for alternatives and found Apache Pulsar a great fit. In this topic, I will show how we solve those problems with Apache Pulsar by making pulsar a unified storage backend for both batch and streaming pipeline, a solution that simplifies the s/w stack, lifts up our work efficiency and lowers the cost at the same time.
Pulsar is a distributed pub/sub messaging platform developed by Yahoo. It provides scalable messaging with persistence, ordering and delivery guarantees. Pulsar is used extensively at Yahoo, handling 100 billion messages per day across 80+ applications. It provides common use cases like messaging queues, notifications and feedback systems. Pulsar's architecture uses brokers for client interactions, Apache BookKeeper for durable storage, and Zookeeper for coordination. Future work includes adding encryption, globally consistent topics, and C++ client support.
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...Streamlio
Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. This presentation from Strata 2017 in New York provides an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
You will learn how Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
Linked In Stream Processing Meetup - Apache PulsarKarthik Ramasamy
Apache Pulsar is a fast, highly scalable, and flexible pub/sub messaging system. It provides guaranteed message delivery, ordering, and durability by backing messages with a replicated log storage. Pulsar's architecture allows for independent scalability of brokers and storage nodes. It supports multi-tenancy, geo-replication, and high throughput of over 1.8 million messages per second in a single partition.
High performance messaging with Apache PulsarMatteo Merli
Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees.
Matteo Merli and Sijie Guo from Streamlio gave a hands-on workshop on Apache Pulsar. #fast #durable #pubsub #messaging system. A low latency alternative to #kafka.
This is an overview of interesting features from Apache Pulsar. Keep in mind that by the time I did this presentation I did not have used Pulsar yet. It's just my first impressions from the list of features.
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...StreamNative
Nowadays, real-time computation is heavily used in cases such as online product recommendation, online payment fraud detection and etc.. In the streaming pipeline, Kafka is normally used to store a day/week data, but won't store years-long data, as in looking at the trend historically. So, a batch pipeline is needed for historical data computation. Thus, it's where the Lambda architecture comes in. Lambda has been proved to be effective, and a good balance of speed and reliability. We have been running many systems with Lambda architecture for many years. But the biggest detraction to Lambda architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and streaming layers. With that, we have to split our business logic into many segments across different places, which is a challenge to maintain as the business grows and it also increases communication overhead. Secondly, the data are duplicated in two different systems, and we have to move data among different systems for processing. With those challenges, we have been searching for alternatives and found Apache Pulsar a great fit. In this topic, I will show how we solve those problems with Apache Pulsar by making pulsar a unified storage backend for both batch and streaming pipeline, a solution that simplifies the s/w stack, lifts up our work efficiency and lowers the cost at the same time.
Pulsar is a distributed pub/sub messaging platform developed by Yahoo. It provides scalable messaging with persistence, ordering and delivery guarantees. Pulsar is used extensively at Yahoo, handling 100 billion messages per day across 80+ applications. It provides common use cases like messaging queues, notifications and feedback systems. Pulsar's architecture uses brokers for client interactions, Apache BookKeeper for durable storage, and Zookeeper for coordination. Future work includes adding encryption, globally consistent topics, and C++ client support.
Messaging, storage, or both? The real time story of Pulsar and Apache Distri...Streamlio
Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. This presentation from Strata 2017 in New York provides an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.
Nozomi from Yahoo! Japan gave a presentation how Yahoo! Japan uses Apache Pulsar to build their internal messaging platform for processing tens of billions of messages every day. He explains why Yahoo! Japan choose Pulsar and what are the use cases of Apache Pulsar and their best practices.
#PulsarBeijingMeetup
Apache Pulsar, Supporting the Entire Lifecycle of Streaming DataStreamNative
We have long stressed that there is more and more a need for unified messaging and streaming and that Apache Pulsar is the platform that better supports this vision and makes it possible, at a large scale. In his talk, Matteo Merli will show how we can take this messaging & streaming unified paradigm one step further, to fully take advantage of their integration. The result is a drastically simplified architecture: a single system that is able to support the data throughout its entire lifecycle, from when the event is happening down to the historical archiving. The ramifications of this shift are big, as we can see Pulsar is in the perfect spot to enable tighter integration between online and offline worlds.
Elastic Data Processing with Apache Flink and Apache PulsarStreamNative
More and more applications are using Flink for low-latency data processing. Flink unifies batch and stream processing using one computation engine. However in reality, in order to really unify batch and stream processing, it requires a data system offers one unified data representation for both batch and streaming data. Nowadays, streaming data is typically stored in a log storage or messaging system, while batch data is stored in distributed filesystem and object stores. That means that data scientists still need write two different computing jobs to access same data stored in different data systems.
Apache Pulsar is the next generation messaging and streaming data system. It was originally built at Yahoo, and has graduated from Apache Incubator and become a Top-Level-Project. Pulsar separates messaging serving and data storage into two layers. Such layered architecture provides high throughput and low-latency while ensuring high availability and scalability. Pulsar’s segment centric storage design along with layered architecture makes Pulsar a perfect unbounded streaming data system, which can well fit into Flink’s computation model.
In this talk, Sijie Guo from Apache Pulsar PMC, will introduce Pulsar and its layered architecture and segment-centric storage, detailing how this architecture can well integrate with Flink to provide elastic unified batch and stream processing.
Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from Apache Pulsar community will given an overview of Apache Pulsar and how it provides the unified data view to fully leverage Apache Flink unified computation runtime for elastic data processing. He will share the latest integrations between Apache Pulsar and Apache Flink, especially around effectively-once processing and schema integration.
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre ZembStreamNative
OVHcloud is the biggest European cloud provider. From dedicated servers to Managed Kubernetes, from VMware® based Hosted Private Cloud to OpenStack-based Public Cloud, we have over 1.4 million customers worldwide.
Internally, we have been running Apache Kafka for years, and despite all the skills obtained operating multiples clusters with millions of messages per second, we decided to shift and build the foundation of our 'topic-as-a-service' product called ioStream on Apache Pulsar.
In this talk, you will have the insights of why we decided to use Apache Pulsar instead of Apache Kafka as the core of ioStream. We will tell you our journey to use Apache Pulsar, from our deployments to the management, what did work and what did not.
Apache Kafka is a distributed streaming platform. It provides a high-throughput distributed messaging system that can handle trillions of events daily. Many large companies use Kafka for application logging, metrics collection, and powering real-time analytics. The current version is 0.8.2 and upcoming versions will include a new consumer, security features, and support for transactions.
Query Pulsar Streams using Apache FlinkStreamNative
This document discusses querying Apache Pulsar streams using Apache Flink. It provides an overview of Pulsar and how it can be used for pub/sub messaging and streaming applications. It then describes how Flink connectors allow querying Pulsar streams through Flink's streaming and table APIs, including support for Pulsar schemas, exactly-once processing, and integration with the Pulsar catalog. The integration provides a unified data processing stack of Pulsar for streaming data storage and Flink for querying and analysis.
Presented at SF Big Analytics Meetup
Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
Serverless Event Streaming with Pulsar FunctionsStreamNative
The last few years have seen the emergence of Serverless as a paradigm for event streaming. Its very simple programming model has attracted developers in droves. At the same time, its ability to elastically scale has simplified operations significantly. Combined together with the ubiquity of their presence across all cloud providers, serverless today has become the leading choice to do event processing at scale for a lot of companies.
In this talk, Sijie Guo from StreamNative will explore how the serverless paradigm is applied to event streaming in Apache Pulsar, a next-generation event streaming system. Pulsar provides native support for serverless functions where the events are processed as soon as they arrive in a streaming manner and that provides flexible deployment options (thread, process, container). He will describe how these serverless functions make data engineering easier and share the real world usage of Pulsar Functions.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Integrating Apache Pulsar with Big Data EcosystemStreamNative
In Apache Pulsar Beijing Meetup, Yijieshen gave a presentation of the current state of Apache Pulsar integrating with Big Data Ecosystem. He explains why and how Pulsar fits into current big data computing and query engines, and how Pulsar integrates with Spark, Flink and Presto for unified data processing system.
I Heart Log: Real-time Data and Apache KafkaJay Kreps
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.
This document provides an overview of Kafka and Spark Streaming. It discusses key concepts of Kafka like brokers, topics, producers and consumers. It also explains how Spark Streaming works using micro-batches and time window APIs. Finally, it discusses reliability aspects and references for further reading.
A Unified Platform for Real-time Storage and ProcessingStreamNative
In this presentation, Yijie Shen presents how to build a unified platform for real-time storage and processing using Apache Pulsar and Apache Spark. He demonstrates the solution using Apache Pulsar as the Stream Storage and Apache Spark for Processing, and deep-dives into the implementation details of the integration between Apache Pulsar and Apache Spark.
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
Apache Pulsar has a distinct architecture from other messaging systems. There is a clear separation of the compute layer that does message processing and dispatching, from the storage layer that handles persistent message storage, using Apache Bookkeeper. This separation of concerns leads to a very efficient design, in terms of performance and cost.
Messaging systems that provide guaranteed delivery, when used in production use cases, impose on the underlying storage, demands that are very different from simple benchmark scenarios that test write throughput. Pulsar, with both I/O isolation and separation of concerns, performs better than other messaging systems in production use cases. The strategy of I/O isolation provides better performance from each storage node at less cost, and the separation between computing and storage means that compute nodes can be scaled independently from storage. Irrespective of the choice of storage, Pulsar can be configured to get the best performance for any of those storage configurations.
This paper also discusses how some of the latest technologies like NVMe and Persistent Memory can be leveraged at a very low cost overhead, by Pulsar, without any architectural or design changes, with some data from real use cases. The fundamental choice of using Bookkeeper as the storage layer for Pulsar is validated from our experience.
This document discusses messaging queues and compares Kafka and Amazon SQS. It begins by explaining what a messaging queue is and provides examples of software that can be used, including Kafka, SQS, SNS, and RabbitMQ. It then discusses why messaging queues are useful by allowing for asynchronous and failed processing. The document proceeds to provide details on Kafka, including that it is a distributed streaming platform used by companies like LinkedIn, Twitter, and Netflix. It defines Kafka terminology and discusses how producers and consumers work. Finally, it compares features of SQS and Kafka like order of messages, delivery guarantees, retention, security, costs, and throughput.
Apache Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics, in a fault-tolerant and scalable way. It is used for building real-time data pipelines and streaming apps. Producers write data to topics which are committed to disks across partitions and replicated for fault tolerance. Consumers read data from topics in a decoupled manner based on offsets. Kafka can process streaming data in real-time and at large volumes with low latency and high throughput.
This document discusses the evolution of Kafka clusters at AppsFlyer over time. The initial cluster had 4 brokers and handled hundreds of millions of messages with low partitioning and replication. A new cluster was designed with more brokers, replication across availability zones, and higher partitioning to support billions of messages. However, this led to issues like uneven leader distribution and failures. Various solutions were implemented like increasing brokers, splitting topics, and hardware upgrades. Ongoing testing and monitoring helped identify more problems and improvements around replication, partitioning, and automation. Key lessons learned included balancing replication and leaders, supporting dynamic changes, and thorough testing of failure scenarios.
Designing Modern Streaming Data ApplicationsArun Kejariwal
Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data.
In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.
Topics include:
* An introduction to streaming
* Common data processing patterns
* Different types of end-to-end stream processing architectures
* How to seamlessly move data across data different frameworks
* Case studies: Healthcare and the IoT
* Data sketches for mining insights from data streams
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...HostedbyConfluent
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrumgaard | Current 2022
At home, I monitor the temperature, humidity, gas levels, ozone, air quality, and other features around my desk.
Let's bring this to the different spots around the conference including lunch tables, vendor booths, hotel rooms, and more. I need to know about these readings now, not when I get back home from the conference. We need to get these sensor readings immediately in case we need to turn on a fan or move to another area. We will also see if my talk produces a lot of hot air!?!??
My setup is pretty simple, a raspberry pi, a breakout garden sensor mount, and as many sensors as I am willing to fly to Austin. The software stack is Python and Java, Apache Pulsar, MQTT, HTML, JQuery, and Apache Kafka.
https://dzone.com/articles/five-sensors-real-time-with-pulsar-and-python-on-a
https://www.datainmotion.dev/2022/04/flip-py-pi-enviroplus-using-apache.html
https://dzone.com/articles/pulsar-in-python-on-pi
Nozomi from Yahoo! Japan gave a presentation how Yahoo! Japan uses Apache Pulsar to build their internal messaging platform for processing tens of billions of messages every day. He explains why Yahoo! Japan choose Pulsar and what are the use cases of Apache Pulsar and their best practices.
#PulsarBeijingMeetup
Apache Pulsar, Supporting the Entire Lifecycle of Streaming DataStreamNative
We have long stressed that there is more and more a need for unified messaging and streaming and that Apache Pulsar is the platform that better supports this vision and makes it possible, at a large scale. In his talk, Matteo Merli will show how we can take this messaging & streaming unified paradigm one step further, to fully take advantage of their integration. The result is a drastically simplified architecture: a single system that is able to support the data throughout its entire lifecycle, from when the event is happening down to the historical archiving. The ramifications of this shift are big, as we can see Pulsar is in the perfect spot to enable tighter integration between online and offline worlds.
Elastic Data Processing with Apache Flink and Apache PulsarStreamNative
More and more applications are using Flink for low-latency data processing. Flink unifies batch and stream processing using one computation engine. However in reality, in order to really unify batch and stream processing, it requires a data system offers one unified data representation for both batch and streaming data. Nowadays, streaming data is typically stored in a log storage or messaging system, while batch data is stored in distributed filesystem and object stores. That means that data scientists still need write two different computing jobs to access same data stored in different data systems.
Apache Pulsar is the next generation messaging and streaming data system. It was originally built at Yahoo, and has graduated from Apache Incubator and become a Top-Level-Project. Pulsar separates messaging serving and data storage into two layers. Such layered architecture provides high throughput and low-latency while ensuring high availability and scalability. Pulsar’s segment centric storage design along with layered architecture makes Pulsar a perfect unbounded streaming data system, which can well fit into Flink’s computation model.
In this talk, Sijie Guo from Apache Pulsar PMC, will introduce Pulsar and its layered architecture and segment-centric storage, detailing how this architecture can well integrate with Flink to provide elastic unified batch and stream processing.
Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from Apache Pulsar community will given an overview of Apache Pulsar and how it provides the unified data view to fully leverage Apache Flink unified computation runtime for elastic data processing. He will share the latest integrations between Apache Pulsar and Apache Flink, especially around effectively-once processing and schema integration.
Building a Messaging Solutions for OVHcloud with Apache Pulsar_Pierre ZembStreamNative
OVHcloud is the biggest European cloud provider. From dedicated servers to Managed Kubernetes, from VMware® based Hosted Private Cloud to OpenStack-based Public Cloud, we have over 1.4 million customers worldwide.
Internally, we have been running Apache Kafka for years, and despite all the skills obtained operating multiples clusters with millions of messages per second, we decided to shift and build the foundation of our 'topic-as-a-service' product called ioStream on Apache Pulsar.
In this talk, you will have the insights of why we decided to use Apache Pulsar instead of Apache Kafka as the core of ioStream. We will tell you our journey to use Apache Pulsar, from our deployments to the management, what did work and what did not.
Apache Kafka is a distributed streaming platform. It provides a high-throughput distributed messaging system that can handle trillions of events daily. Many large companies use Kafka for application logging, metrics collection, and powering real-time analytics. The current version is 0.8.2 and upcoming versions will include a new consumer, security features, and support for transactions.
Query Pulsar Streams using Apache FlinkStreamNative
This document discusses querying Apache Pulsar streams using Apache Flink. It provides an overview of Pulsar and how it can be used for pub/sub messaging and streaming applications. It then describes how Flink connectors allow querying Pulsar streams through Flink's streaming and table APIs, including support for Pulsar schemas, exactly-once processing, and integration with the Pulsar catalog. The integration provides a unified data processing stack of Pulsar for streaming data storage and Flink for querying and analysis.
Presented at SF Big Analytics Meetup
Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
Serverless Event Streaming with Pulsar FunctionsStreamNative
The last few years have seen the emergence of Serverless as a paradigm for event streaming. Its very simple programming model has attracted developers in droves. At the same time, its ability to elastically scale has simplified operations significantly. Combined together with the ubiquity of their presence across all cloud providers, serverless today has become the leading choice to do event processing at scale for a lot of companies.
In this talk, Sijie Guo from StreamNative will explore how the serverless paradigm is applied to event streaming in Apache Pulsar, a next-generation event streaming system. Pulsar provides native support for serverless functions where the events are processed as soon as they arrive in a streaming manner and that provides flexible deployment options (thread, process, container). He will describe how these serverless functions make data engineering easier and share the real world usage of Pulsar Functions.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Integrating Apache Pulsar with Big Data EcosystemStreamNative
In Apache Pulsar Beijing Meetup, Yijieshen gave a presentation of the current state of Apache Pulsar integrating with Big Data Ecosystem. He explains why and how Pulsar fits into current big data computing and query engines, and how Pulsar integrates with Spark, Flink and Presto for unified data processing system.
I Heart Log: Real-time Data and Apache KafkaJay Kreps
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.
This document provides an overview of Kafka and Spark Streaming. It discusses key concepts of Kafka like brokers, topics, producers and consumers. It also explains how Spark Streaming works using micro-batches and time window APIs. Finally, it discusses reliability aspects and references for further reading.
A Unified Platform for Real-time Storage and ProcessingStreamNative
In this presentation, Yijie Shen presents how to build a unified platform for real-time storage and processing using Apache Pulsar and Apache Spark. He demonstrates the solution using Apache Pulsar as the Stream Storage and Apache Spark for Processing, and deep-dives into the implementation details of the integration between Apache Pulsar and Apache Spark.
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
Apache Pulsar has a distinct architecture from other messaging systems. There is a clear separation of the compute layer that does message processing and dispatching, from the storage layer that handles persistent message storage, using Apache Bookkeeper. This separation of concerns leads to a very efficient design, in terms of performance and cost.
Messaging systems that provide guaranteed delivery, when used in production use cases, impose on the underlying storage, demands that are very different from simple benchmark scenarios that test write throughput. Pulsar, with both I/O isolation and separation of concerns, performs better than other messaging systems in production use cases. The strategy of I/O isolation provides better performance from each storage node at less cost, and the separation between computing and storage means that compute nodes can be scaled independently from storage. Irrespective of the choice of storage, Pulsar can be configured to get the best performance for any of those storage configurations.
This paper also discusses how some of the latest technologies like NVMe and Persistent Memory can be leveraged at a very low cost overhead, by Pulsar, without any architectural or design changes, with some data from real use cases. The fundamental choice of using Bookkeeper as the storage layer for Pulsar is validated from our experience.
This document discusses messaging queues and compares Kafka and Amazon SQS. It begins by explaining what a messaging queue is and provides examples of software that can be used, including Kafka, SQS, SNS, and RabbitMQ. It then discusses why messaging queues are useful by allowing for asynchronous and failed processing. The document proceeds to provide details on Kafka, including that it is a distributed streaming platform used by companies like LinkedIn, Twitter, and Netflix. It defines Kafka terminology and discusses how producers and consumers work. Finally, it compares features of SQS and Kafka like order of messages, delivery guarantees, retention, security, costs, and throughput.
Apache Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics, in a fault-tolerant and scalable way. It is used for building real-time data pipelines and streaming apps. Producers write data to topics which are committed to disks across partitions and replicated for fault tolerance. Consumers read data from topics in a decoupled manner based on offsets. Kafka can process streaming data in real-time and at large volumes with low latency and high throughput.
This document discusses the evolution of Kafka clusters at AppsFlyer over time. The initial cluster had 4 brokers and handled hundreds of millions of messages with low partitioning and replication. A new cluster was designed with more brokers, replication across availability zones, and higher partitioning to support billions of messages. However, this led to issues like uneven leader distribution and failures. Various solutions were implemented like increasing brokers, splitting topics, and hardware upgrades. Ongoing testing and monitoring helped identify more problems and improvements around replication, partitioning, and automation. Key lessons learned included balancing replication and leaders, supporting dynamic changes, and thorough testing of failure scenarios.
Designing Modern Streaming Data ApplicationsArun Kejariwal
Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data.
In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.
Topics include:
* An introduction to streaming
* Common data processing patterns
* Different types of end-to-end stream processing architectures
* How to seamlessly move data across data different frameworks
* Case studies: Healthcare and the IoT
* Data sketches for mining insights from data streams
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrum...HostedbyConfluent
Let’s Monitor Conditions at the Conference With Timothy Spann & David Kjerrumgaard | Current 2022
At home, I monitor the temperature, humidity, gas levels, ozone, air quality, and other features around my desk.
Let's bring this to the different spots around the conference including lunch tables, vendor booths, hotel rooms, and more. I need to know about these readings now, not when I get back home from the conference. We need to get these sensor readings immediately in case we need to turn on a fan or move to another area. We will also see if my talk produces a lot of hot air!?!??
My setup is pretty simple, a raspberry pi, a breakout garden sensor mount, and as many sensors as I am willing to fly to Austin. The software stack is Python and Java, Apache Pulsar, MQTT, HTML, JQuery, and Apache Kafka.
https://dzone.com/articles/five-sensors-real-time-with-pulsar-and-python-on-a
https://www.datainmotion.dev/2022/04/flip-py-pi-enviroplus-using-apache.html
https://dzone.com/articles/pulsar-in-python-on-pi
(Current22) Let's Monitor The Conditions at the ConferenceTimothy Spann
(Current22) Let's Monitor The Conditions at the Conference
Let's Monitor The Conditions at the Conference
Session Time11:15 am - 12:00 pm Session DateWednesday, 5 October 2022 Session Type:In-Person Location:Ballroom G
Session Description:
At home, I monitor the temperature, humidity, gas levels, ozone, air quality, and other features around my desk. Let's bring this to the different spots around the conference including lunch tables, vendor booths, hotel rooms, and more. I need to know about these readings now, not when I get back home from the conference. We need to get these sensor readings immediately in case we need to turn on a fan or move to another area. We will also see if my talk produces a lot of hot air!? My setup is pretty simple, a raspberry pi, a breakout garden sensor mount, and as many sensors as I am willing to fly to Austin. The software stack is Python and Java, Apache Pulsar, MQTT, HTML, JQuery, and Apache Kafka.
Timothy Spann, StreamNative
Developer Advocate
Tim Spann is a Developer Advocate @ StreamNative where he works with Apache Pulsar, Apache Flink, Apache NiFi, Apache MXNet, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC.
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Nitin S
Scaling search platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem. BloomReach has implemented Sc2, which is an elastic Solr infrastructure for Big Data applications, supporting heterogeneous workloads and hosted in the cloud. It dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing, latency guarantees, and application-specific performance tuning. In addition, it provides various high availability features such as differential real-time streaming, disaster recovery, context aware replication, and automatic shard and replica rebalancing, all with a zero downtime guarantee for all consumers. This infrastructure currently serves hundreds of millions of documents in millisecond response times with a load ranging in the order of 200-300K QPS.
This presentation will describe an innovate implementation of scaling Solr in an elastic fashion. It will review the architecture and take a deep dive into how each of these components interact to make the infrastructure truly elastic, real time, and robust while serving latency needs.
The document discusses Solr Compute Cloud (SC2), an elastic Solr infrastructure developed by BloomReach to address challenges of scaling search platforms for big data applications. SC2 dynamically provisions Solr clusters in the cloud for pipelines and indexing jobs, providing isolation. It ensures latency guarantees, dynamic scaling, high availability and disaster recovery. SC2 addresses issues BloomReach faced with a shared cluster approach like throughput limitations, stability problems and indexing challenges.
This document discusses Typesafe's Reactive Platform and Apache Spark. It describes Typesafe's Fast Data strategy of using a microservices architecture with Spark, Kafka, HDFS and databases. It outlines contributions Typesafe has made to Spark, including backpressure support, dynamic resource allocation in Mesos, and integration tests. The document also discusses Typesafe's customer support and roadmap, including plans to introduce Kerberos security and evaluate Tachyon.
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
Solr Compute Cloud (SC2) is an elastic Solr infrastructure that allows for dynamic provisioning of Solr clusters on demand. This allows each search pipeline or job to have its own isolated cluster, improving stability, throughput, and cost optimization. The key benefits of SC2 are pipeline isolation, dynamic scaling, production cluster safeguards, and built-in high availability and disaster recovery features through technologies like the Solr HAFT service.
Princeton Dec 2022 Meetup_ StreamNative and Cloudera StreamingTimothy Spann
This document provides an overview of streaming data platforms for cloud-native event-driven applications, focusing on Apache Pulsar, Apache NiFi, and Apache Flink. It includes an agenda for a presentation on these topics, background on the presenter Tim Spann, and details about each technology and how they can work together.
BigDataSpain 2016: Introduction to Apache ApexThomas Weise
Apache Apex is an open source stream processing platform, built for large scale, high-throughput, low-latency, high availability and operability. With a unified architecture it can be used for real-time and batch processing. Apex is Java based and runs natively on Apache Hadoop YARN and HDFS.
We will discuss the key features of Apache Apex and architectural differences from similar platforms and how these differences affect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, low latency SLA, high throughput and large scale ingestion.
Apex APIs and libraries of operators and examples focus on developer productivity. We will present the programming model with examples and how custom business logic can be easily integrated based on the Apex operator API.
We will cover integration with connectors to sources/destinations (including Kafka, JMS, SQL, NoSQL, files etc.), scalability with advanced partitioning, fault tolerance and processing guarantees, computation and scheduling model, state management, windowing and dynamic changes. Attendees will also learn how these features affect time to market and total cost of ownership and how they are important in existing Apex production deployments.
https://www.bigdataspain.org/
This document provides an overview of Kafka and event-driven architecture. It discusses traditional SOA approaches, event-driven architecture with microservices, and using Kafka as an event streaming platform. Key concepts of Kafka are explained, including topics, partitions, producers, brokers, consumers, and how messages are published and consumed in Kafka.
Streaming in Practice - Putting Apache Kafka in Productionconfluent
This presentation focuses on how to integrate all these components into an enterprise environment and what things you need to consider as you move into production.
We will touch on the following topics:
- Patterns for integrating with existing data systems and applications
- Metadata management at enterprise scale
- Tradeoffs in performance, cost, availability and fault tolerance
- Choosing which cross-datacenter replication patterns fit with your application
- Considerations for operating Kafka-based data pipelines in production
BloomReach developed an elastic Solr infrastructure called Solr Compute Cloud (SC2) to address the challenges of scaling their search platform. SC2 allows search pipelines and indexing jobs to dynamically provision isolated Solr clusters from an API to run in, improving throughput, stability and availability. It utilizes a Solr HAFT service to replicate data between clusters and provide disaster recovery by cloning clusters. This elastic approach isolates workloads, allows individual scaling and prevents performance issues caused by shared clusters.
Pulsar - flexible pub-sub for internet scaleMatteo Merli
Pub-Sub messaging is a very convenient abstraction that allows system and application developers to decouple components and let them communicate, by acting as durable buffer for transient data, or as a persistent log from where to recover after crashes. This talk will present an overview of Apache Pulsar, the reasons that led to its development and how it enabled many teams at Yahoo and to build scalable and reliable applications. Apache Pulsar has become the defacto pub-sub messaging at Yahoo serving 100+ applications and processing 100’s of billions of messages for over 3+ years.
In this talk, we will explore in detail different categories of use cases that highlight how Pulsar can be applied to solve a broad range of problems thanks to its flexible messaging model that supports both queuing and streaming semantics with a focus on durability and transaction guarantees.
Intro to Apache Kafka I gave at the Big Data Meetup in Geneva in June 2016. Covers the basics and gets into some more advanced topics. Includes demo and source code to write clients and unit tests in Java (GitHub repo on the last slides).
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
This document discusses various Azure Platform Services including storage, caching, relaying, queuing, and topics. Storage in Azure provides blobs, drives, tables and queues for structured storage needs. Caching services improve application performance. Service Bus provides relaying for connectivity between applications and queuing/topics for messaging with publish/subscribe capabilities. Platform as a Service (PaaS) allows building and hosting applications on Azure's scalable infrastructure.
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
This talk will be a deep dive into ingesting unbounded file data and streaming data from Kafka into Hadoop. We will also cover data enrichment and dimensional compute. Customer use-case and reference architecture.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
The document discusses optimizing an Apache Pulsar deployment to handle 10 PB of data per day for a large customer. It estimates the initial cluster size needed using different storage options in Google Cloud Platform. It then describes four optimizations made - eliminating the journal, using direct I/O, compression, and improving the C++ client - and recalculates the cluster size after each optimization. The optimized deployment uses 200 VMs each with 24 local SSDs to meet the requirements.
The engineering teams within Splunk have been using several technologies Kinesis, SQS, RabbitMQ and Apache Kafka for enterprise wide messaging for the past few years but have recently made the decision to pivot toward Apache Pulsar, migrating both existing use cases and embedding it into new cloud-native service offerings such as the Splunk Data Stream Processor (DSP).
Creating Data Fabric for #IOT with Apache PulsarKarthik Ramasamy
In this presentation, we go into details about how a distributed data fabric using Apache Pulsar can be created in an IoT environment - acquire, filter, move and distribute from myriad of devices to cloud.
This document discusses exactly once and effectively once semantics in Heron stream processing. It explains that exactly once is not truly possible and effectively once uses an optimistic approach with distributed snapshots and markers to provide exactly once semantics. The document compares optimistic and pessimistic approaches, and describes how Heron implements the optimistic approach using state managers and a checkpoint manager with pluggable state stores. It also discusses challenges with bolt state, sink state, and ensuring idempotent or transactional writes.
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
Across diverse segments in industry, there has been a shift in focus from big data to fast data, stemming, in part, from the deluge of high-velocity data streams as well as the need for instant data-driven insights, and there has been a proliferation of messaging and streaming frameworks that enterprises utilize to satisfy the needs of various applications.
Drawing on their experience operating streaming systems at Twitter scale, Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. They also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, they explore the interplay between storage and stream processing and speculate about future developments.
Topics include:
Basic requirements of stream processing
Streaming and one-pass algorithms
Different types of streaming architectures
An in-depth review of streaming frameworks
Deploying and operating stream processing applications
Lessons learned from building a real-time stack using Apache Pulsar and Apache Heron at Twitter Scale
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperKarthik Ramasamy
Enterprises are increasingly building applications to take advantage of real-time workloads, and Kubernetes is fast becoming the de-facto scheduler and orchestrating system for distributed systems. In this session, Karthik Ramasamy will show how easy it is to deploy Apache Pulsar for queuing and streaming and Twitter Heron for stream processing using Kubernetes and show how end-to-end data pipelines are built. Pulsar and Heron are at the core of Streamlio's end-to-end real-time platform, which is ideally suited to enterprises building next generation real time pipelines
Data pipelines are hard to build and maintain. This is due to complexity of big data open source ecosystem that has numerous software each specializing in solving one piece of the puzzle. In this talk, we will focus on three key open source software Apache Pulsar, Apache Heron and Apache BookKeeper and how are integrated to make it easy to build data pipelines.
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. In order
to meet this challenge, Twitter designed an end to end real-time stack consisting of DistributedLog, the distributed and replicated messaging system system, and Heron, the streaming system for real time computation. DistributedLog is a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s publish-subscribe system. Twitter Heron is the next generation streaming system built from ground up to address our scalability and reliability needs. Both the systems have been in production for nearly two years and is widely used at Twitter in a range of diverse applications such as search ingestion pipeline, ad analytics, image classification and more. These slides will describe Heron and DistributedLog in detail, covering a few use cases in-depth and sharing the operating experiences and challenges of running large-scale real time systems at scale.
This paper describes the use of Storm at Twitter. Storm is a realtime fault-tolerant and distributed stream data processing system.Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka.topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results
from an empirical evaluation demonstrating the resilience of
Storm in dealing with machine failures. Storm is under active
development at Twitter and we also present some potential
directions for future work.
Storm is a realtime fault-tolerant and distributed stream data processing system. Storm is used to run various critical computations in Twitter at scale, and in real-time. This talk was presented at the SIGMOD 2014 conference that accepted our paper Storm@Twitter. It describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This presentation also describes how queries (aka. topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results from an empirical evaluation demonstrating the resilience of Storm in dealing with machine failures.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
4. EVENT/STREAM DATA PROCESSING
4
✦ Events are analyzed and processed as they arrive
✦ Decisions are timely, contextual and based on fresh data
✦ Decision latency is eliminated
✦ Data in motion
Ingest/
Buffer
Analyze Act
15. APACHE PULSAR
15
Bookie Bookie Bookie
Broker Broker Broker
Producer Consumer
SERVING
Brokers can be added independently
Traffic can be shifted quickly across brokers
STORAGE
Bookies can be added independently
New bookies will ramp up traffic quickly
16. APACHE PULSAR - BROKER
16
✦ Broker is the only point of interaction for clients (producers and consumers)
✦ Brokers acquire ownership of group of topics and “serve” them
✦ Broker has no durable state
✦ Provides service discovery mechanism for client to connect to right broker
25. Multi-tiered storage and serving
25
Partition
Broker Broker Broker
. . . . . . . . . . . .
Processing
(brokers)
Warm
Storage
Cold
Storage
Tailing reads: served from
in-memory cache
Catch-up reads: served
from persistent storage
layer
Historical reads: served
from cold storage
26. PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE?
26
Legacy Architectures
# Storage co-resident with processing
# Partition-centric
# Cumbersome to scale--data
redistribution, performance impact
Logical
View
Apache Pulsar
# Storage decoupled from processing
# Partitions stored as segments
# Flexible, easy scalability
Partition
Processing
& Storage
Segment 1 Segment 3Segment 2 Segment n
Partition
Broker
Partition
(primary)
Broker
Partition
(copy)
Broker
Partition
(copy)
Broker Broker Broker
Segment 1
Segment 2
Segment n
. . .
Segment 2
Segment 3
Segment n
. . .
Segment 3
Segment 1
Segment n
. . .
Segment 1
Segment 2
Segment n
. . .
Processing
(brokers)
Storage
27. PARTITIONS VS SEGMENTS - WHY SHOULD YOU CARE?
27
✦ In Kafka, partitions are assigned to brokers “permanently”
✦ A single partition is stored entirely in a single node
✦ Retention is limited by a single node storage capacity
✦ Failure recovery and capacity expansion require expensive “rebalancing”
✦ Rebalancing has a big impact over the system, affecting regular traffic
32. • Two independent clusters,
primary and standby
• Configured tenants and
namespaces replicate to standby
• Data published to primary is
asynchronously replicated to
standby
• Producers and consumers
restarted in second datacenter
upon primary failure
Asynchronous replication example
32
Producers
(active)
Datacenter 1
Consumers
(active)
Pulsar Cluster
(primary)
Datacenter 2
Producers
(standby)
Consumers
(standby)
Pulsar Cluster
(standby)
Pulsar
replication
ZooKeeper ZooKeeper
33. ZooKeeper
• Each topic owned by one
broker at a time, i.e. in one
datacenter
• ZooKeeper cluster spread
across multiple locations
• Broker commits writes to
bookies in both datacenters
• In event of datacenter failure,
broker in surviving datacenter
assumes ownership of topic
Synchronous replication example
33
Producers
Datacenter 1
Consumers
Pulsar Cluster
Datacenter 2
Producers
Consumers
37. PULSAR PRODUCER
37
PulsarClient client = PulsarClient.create(
“http://broker.usw.example.com:8080”);
Producer producer = client.createProducer(
“persistent://my-property/us-west/my-namespace/my-topic”);
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});
38. PULSAR CONSUMER
38
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-property/us-west/my-namespace/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
System.out.println("Received message: " + msg.getData());
// Acknowledge the message so that it can be deleted by broker
consumer.acknowledge(msg);
}
39. SCHEMA REGISTRY
39
✦ Provides type safety to applications built on top of Pulsar
✦ Two approaches
✦ Client side - type safety enforcement up to the application
✦ Server side - system enforces type safety and ensures that producers and consumers remain synced
✦ Schema registry enables clients to upload data schemas on a topic basis.
✦ Schemas dictate which data types are recognized as valid for that topic
40. PULSAR SCHEMAS - HOW DO THEY WORK?
40
✦ Enforced at the topic level
✦ Pulsar schemas consists of
✦ Name - Name refers to the topic to which the schema is applied
✦ Payload - Binary representation of the schema
✦ Schema type - JSON, Protobuf and Avro
✦ User defined properties - Map of strings to strings (application specific - e.g git hash of the schema)
41. SCHEMA VERSIONING
41
PulsarClient client = PulsarClient.builder()
.serviceUrl(“http://broker.usw.example.com:6650")
.build()
Producer<SensorReading> producer = client.newProducer(JSONSchema.of(SensorReading.class))
.topic(“sensor-data”)
.sendTimeout(3, TimeUnit.SECONDS)
.create()
Scenario What happens
No schema exists for the topic Producer is created using the given schema
Schema already exists; producer
connects using the same schema
that’s already stored
Schema is transmitted to the broker, determines that it is already stored
Schema already exists; producer
connects using a new schema that is
compatible
Schema is transmitted, compatibility determined and stored as new schema
43. HOW TO PROCESS DATA MODELED AS STREAMS
43
✦ Consume data as it is produced (pub/sub)
✦ Light weight compute - transform and react to data as it arrives
✦ Heavy weight compute - continuous data processing
✦ Interactive query of stored streams
49. TRADITIONAL REAL TIME SYSTEMS
49
DEVELOPER EXPERIENCE
✦ Powerful API but complicated
✦ Does everyone really need to learn functional programming?
✦ Configurable and scalable but management overhead
✦ Edge systems have resource and management constraints
50. TRADITIONAL REAL TIME SYSTEMS
50
OPERATIONAL EXPERIENCE
✦ Multiple systems to operate
✦ IoT deployments routinely have thousands of edge systems
✦ Semantic differences
✦ Mismatch and duplication between systems
✦ Creates developer and operator friction
51. LESSONS LEARNT - USE CASES
51
✦ Data transformations
✦ Data classification
✦ Data enrichment
✦ Data routing
✦ Data extraction and loading
✦ Real time aggregation
✦ Microservices
Significant set of processing tasks are exceedingly simple
52. EMERGENCE OF CLOUD - SERVERLESS
52
✦ Simple function API
✦ Functions are submitted to the system
✦ Runs per events
✦ Composition APIs to do complex things
✦ Wildly popular
53. SERVERLESS VS STREAMING
53
✦ Both are event driven architectures
✦ Both can be used for analytics and data serving
✦ Both have composition APIs
๏ Configuration based for serverless
๏ DSL based for streaming
✦ Serverless typically does not guarantee ordering
✦ Serverless is pay per action
54. STREAM NATIVE COMPUTE USING FUNCTIONS
54
✦ Simplest possible API -function or a procedure
✦ Support for multi language
✦ Use of native API for each language
✦ Scale developers
✦ Use of message bus native concepts - input and output as topics
✦ Flexible runtime - simple standalone applications vs managed system applications
APPLYING INSIGHT GAINED FROM SERVERLESS
55. PULSAR FUNCTIONS
55
SDK LESS API
import java.util.function.Function;
public class ExclamationFunction implements Function<String, String> {
@Override
public String apply(String input) {
return input + "!";
}
}
56. PULSAR FUNCTIONS
56
SDK API
import org.apache.pulsar.functions.api.PulsarFunction;
import org.apache.pulsar.functions.api.Context;
public class ExclamationFunction implements PulsarFunction<String, String> {
@Override
public String process(String input, Context context) {
return input + "!";
}
}
57. PULSAR FUNCTIONS
57
✦ Function executed for every message of input topic
✦ Support for multiple topics as inputs
✦ Function output goes into output topic - can be void topic as well
✦ SerDe takes care of serialization/deserialization of messages
๏ Custom SerDe can be provided by the users
๏ Integration with schema registry
58. PROCESSING GUARANTEES
58
✦ ATMOST_ONCE
๏ Message acked to Pulsar as soon as we receive it
✦ ATLEAST_ONCE
๏ Message acked to Pulsar after the function completes
๏ Default behavior - don’t want people to loose data
✦ EFFECTIVELY_ONCE
๏ Uses Pulsar’s inbuilt effectively once semantics
✦ Controlled at runtime by user
59. DEPLOYING FUNCTIONS - BROKER
59
Broker 1
Worker
Function
wordcount-1
Function
transform-2
Broker 1
Worker
Function
transform-1
Function
dataroute-1
Broker 1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
60. DEPLOYING FUNCTIONS - WORKER NODES
60
Worker
Function
wordcount-1
Function
transform-2
Worker
Function
transform-1
Function
dataroute-1
Worker
Function
wordcount-2
Function
transform-3
Node 1 Node 2 Node 3
Broker 1 Broker 2 Broker 3
Node 4 Node 5 Node 6
61. DEPLOYING FUNCTIONS - KUBERNETES
61
Function
wordcount-1
Function
transform-1
Function
transform-3
Pod 1 Pod 2 Pod 3
Broker 1 Broker 2 Broker 3
Pod 7 Pod 8 Pod 9
Function
dataroute-1
Function
wordcount-2
Function
transform-2
Pod 4 Pod 5 Pod 6
62. BUILT-IN STATE MANAGEMENT IN FUNCTIONS
62
✦ Functions can store state in inbuilt storage
๏ Framework provides a simple library to store and retrieve state
✦ Support server side operations like counters
✦ Simplified application development
๏ No need to standup an extra system
63. DISTRIBUTED STATE IN FUNCTIONS
63
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
64. PULSAR - DATA IN AND OUT
64
✦ Users can write custom code using Pulsar producer and consumer API
✦ Challenges
๏ Where should the application to publish data or consume data from Pulsar?
๏ How should the application to publish data or consume data from Pulsar?
✦ Current systems have no organized and fault tolerant way to run applications that ingress and egress
data from and to external systems
65. PULSAR IO TO THE RESCUE
65
Apache Pulsar ClusterSource Sink
70. APACHE PULSAR VS. APACHE KAFKA
70
Mul$-tenancy
A single cluster can support many
tenants and use cases
Seamless Cluster Expansion
Expand the cluster without any
down $me
High throughput & Low Latency
Can reach 1.8 M messages/s in a
single par$$on and publish latency
of 5ms at 99pct
Durability
Data replicated and synced to disk
Geo-replica$on
Out of box support for geographically
distributed applica$ons
Unified messaging model
Support both Topic & Queue
seman$c in a single model
Tiered Storage
Hot/warm data for real $me access and
cold event data in cheaper storage
Pulsar Func$ons
Flexible light weight compute
Highly scalable
Can support millions of topics, makes data
modeling easier
71. APACHE PULSAR IN PRODUCTION @SCALE
71
4+ years
Serves 2.3 million topics
700 billion messages/day
500+ bookie nodes
200+ broker nodes
Average latency < 5 ms
99.9% 15 ms (strong durability guarantees)
Zero data loss
150+ applica6ons
Self served provisioning
Full-mesh cross-datacenter replica6on - 8+ data centers
73. 73
✓ Understanding How Pulsar Works
https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-
works
✓ How To (Not) Lose Messages on Apache Pulsar Cluster
https://jack-vanlightly.com/blog/2018/10/21/how-to-not-lose-messages-on-an-
apache-pulsar-cluster
MORE READINGS
74. MORE READINGS
74
✓ Unified queuing and streaming
https://streaml.io/blog/pulsar-streaming-queuing
✓ Segment centric storage
https://streaml.io/blog/pulsar-segment-based-architecture
✓ Messaging, Storage or Both
https://streaml.io/blog/messaging-storage-or-both
✓ Access patterns and tiered storage
https://streaml.io/blog/access-patterns-and-tiered-storage-in-apache-pulsar
✓ Tiered Storage in Apache Pulsar
https://streaml.io/blog/tiered-storage-in-apache-pulsar