1) Yahoo Japan uses Apache Pulsar as a centralized messaging platform connecting various internal services.
2) Pulsar is now being used to build a large scale log pipeline where computing platforms publish logs/metrics and monitoring platforms consume them.
3) This architecture leverages Pulsar to decouple producers and consumers, enabling scalability and resiliency across the platform.
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
We will introduce HerdDB a distributed database written in Java.
We will see how a distributed database can be built using Apache BookKeeper as write-ahead commit log.
Interactive querying of streams using Apache Pulsar_Jerry pengStreamNative
As applications become more reliant on real-time data, streaming/messaging platforms have become more and more popular and crucial to any data pipeline. Currently, many streaming/messaging platforms are only used to access the most recent events from streams of data, however, there is tremendous value that can be unlocked if the full history of streams can be queried in an interactive fashion. Pulsar SQL is a query layer built on top of Apache Pulsar (a next-gen messaging platform), that enables users to dynamically query all streams, old and new, stored inside of Pulsar. Thus, users can unlock insights from querying both new and historical streams of data in a single system. Pulsar SQL leverages Presto and Apache Pulsar’s unique architecture to execute queries in a highly scalable fashion irregardless of the number of partitions of topics that make up the streams. In this talk, we will examine the use cases and advantages of being able to interactively query events within an streaming messaging platform and how Pulsar enables users to do that in the most user-friendly and efficient manner.
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...StreamNative
Kafka-on-Pulsar has been one of the most anticipated features in the Pulsar ecosystem. The Kafka-on-Pulsar project was initiated by StreamNative and the OVHCloud team quickly joined the project to collaborate on its development. Kafka-on-Pulsar enables Kafka applications to leverage Pulsar’s powerful features, such as streamlined operations with enterprise-grade multi-tenancy, without modifying code.
In this webinar, Sijie Guo, from StreamNative, and Pierre Zemb, from OVHCloud, will introduce KoP and discuss the following:
1. What are the key benefits?
2. What is the protocol handler and how does it work?
3. How KoP is implemented?
4. What are the new use cases it unlocks?
5. Watch a Live Demo!
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
Apache Pulsar has a distinct architecture from other messaging systems. There is a clear separation of the compute layer that does message processing and dispatching, from the storage layer that handles persistent message storage, using Apache Bookkeeper. This separation of concerns leads to a very efficient design, in terms of performance and cost.
Messaging systems that provide guaranteed delivery, when used in production use cases, impose on the underlying storage, demands that are very different from simple benchmark scenarios that test write throughput. Pulsar, with both I/O isolation and separation of concerns, performs better than other messaging systems in production use cases. The strategy of I/O isolation provides better performance from each storage node at less cost, and the separation between computing and storage means that compute nodes can be scaled independently from storage. Irrespective of the choice of storage, Pulsar can be configured to get the best performance for any of those storage configurations.
This paper also discusses how some of the latest technologies like NVMe and Persistent Memory can be leveraged at a very low cost overhead, by Pulsar, without any architectural or design changes, with some data from real use cases. The fundamental choice of using Bookkeeper as the storage layer for Pulsar is validated from our experience.
Nozomi from Yahoo! Japan gave a presentation how Yahoo! Japan uses Apache Pulsar to build their internal messaging platform for processing tens of billions of messages every day. He explains why Yahoo! Japan choose Pulsar and what are the use cases of Apache Pulsar and their best practices.
#PulsarBeijingMeetup
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
You will learn how Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
We will introduce HerdDB a distributed database written in Java.
We will see how a distributed database can be built using Apache BookKeeper as write-ahead commit log.
Interactive querying of streams using Apache Pulsar_Jerry pengStreamNative
As applications become more reliant on real-time data, streaming/messaging platforms have become more and more popular and crucial to any data pipeline. Currently, many streaming/messaging platforms are only used to access the most recent events from streams of data, however, there is tremendous value that can be unlocked if the full history of streams can be queried in an interactive fashion. Pulsar SQL is a query layer built on top of Apache Pulsar (a next-gen messaging platform), that enables users to dynamically query all streams, old and new, stored inside of Pulsar. Thus, users can unlock insights from querying both new and historical streams of data in a single system. Pulsar SQL leverages Presto and Apache Pulsar’s unique architecture to execute queries in a highly scalable fashion irregardless of the number of partitions of topics that make up the streams. In this talk, we will examine the use cases and advantages of being able to interactively query events within an streaming messaging platform and how Pulsar enables users to do that in the most user-friendly and efficient manner.
Introducing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pu...StreamNative
Kafka-on-Pulsar has been one of the most anticipated features in the Pulsar ecosystem. The Kafka-on-Pulsar project was initiated by StreamNative and the OVHCloud team quickly joined the project to collaborate on its development. Kafka-on-Pulsar enables Kafka applications to leverage Pulsar’s powerful features, such as streamlined operations with enterprise-grade multi-tenancy, without modifying code.
In this webinar, Sijie Guo, from StreamNative, and Pierre Zemb, from OVHCloud, will introduce KoP and discuss the following:
1. What are the key benefits?
2. What is the protocol handler and how does it work?
3. How KoP is implemented?
4. What are the new use cases it unlocks?
5. Watch a Live Demo!
Pulsar Storage on BookKeeper _Seamless EvolutionStreamNative
Apache Pulsar has a distinct architecture from other messaging systems. There is a clear separation of the compute layer that does message processing and dispatching, from the storage layer that handles persistent message storage, using Apache Bookkeeper. This separation of concerns leads to a very efficient design, in terms of performance and cost.
Messaging systems that provide guaranteed delivery, when used in production use cases, impose on the underlying storage, demands that are very different from simple benchmark scenarios that test write throughput. Pulsar, with both I/O isolation and separation of concerns, performs better than other messaging systems in production use cases. The strategy of I/O isolation provides better performance from each storage node at less cost, and the separation between computing and storage means that compute nodes can be scaled independently from storage. Irrespective of the choice of storage, Pulsar can be configured to get the best performance for any of those storage configurations.
This paper also discusses how some of the latest technologies like NVMe and Persistent Memory can be leveraged at a very low cost overhead, by Pulsar, without any architectural or design changes, with some data from real use cases. The fundamental choice of using Bookkeeper as the storage layer for Pulsar is validated from our experience.
Nozomi from Yahoo! Japan gave a presentation how Yahoo! Japan uses Apache Pulsar to build their internal messaging platform for processing tens of billions of messages every day. He explains why Yahoo! Japan choose Pulsar and what are the use cases of Apache Pulsar and their best practices.
#PulsarBeijingMeetup
How Orange Financial combat financial frauds over 50M transactions a day usin...JinfengHuang3
You will learn how Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreStreamNative
Apache Pulsar is a distributed and open-source pub-sub messaging system. It offers many advantages over Kafka, such as multi-tenant, geo-replication, decoupled storage or even SQL and FaaS directly integrated. The only thing missing for wide adoption is support for the de-facto standard for streaming: Kafka. And this is how our story begins.
In this talk, Sijie Guo from StreamNative and Pierre Zemb from OVHcloud will share the journey on building Kafka-on-Pulsar (KoP) to bring native Kafka protocol support to Pulsar. Before joining the force on building KoP, OVHcloud implemented a Kafka proxy in Rust capable of transforming the Kafka protocol to that Pulsar on the fly and encountered some challenges. After realizing that StreamNative was working on bringing the Kafka protocol natively to Pulsar broker via a pluggable protocol handler mechanism. OVHCloud joined forces with StreamNative to work on brining Kafka protocol support to Pulsar brokers.
At the end of this talk, you will know more about the inner workings of Kafka and Pulsar. You'll also get feedback from both companies from their initial proofs of concepts and the current implementation.
Scaling customer engagement with apache pulsarStreamNative
Iterable's platform is used by marketers to reach hundreds of millions of users every day, and those numbers are quickly growing. Iterable's infrastructure is built with pub-sub messaging at it's core, so the reliability, scalability and flexibility provided by that system are business critical.
In this talk we'll discuss why Iterable chose Pulsar as a pub-sub messaging system, as well as how Iterable is taking advantage of some of more recently added features in Pulsar. We'll also talk about some of the challenges we encountered, where we think Pulsar can improve, and some contributions we've made to the open source community around Pulsar.
Lessons from managing a Pulsar cluster (Nutanix)StreamNative
In this presentation, we will cover:
- How to performance test and optimize a Pulsar cluster. We will present how we load tested Pulsar with locust and, following this, how we tuned our configurations for our use cases.
- Event sourcing pattern with Apache Pulsar. Avro schema usage, compatibility choices and schema evolution on pulsar topics that worked for us.
- Bonus: How we source Apache Flink from apache pulsar and run our workflows.
By attending this webinar, you can expect to come away with:
- How to performance test a Pulsar cluster for your use case.
- How to leverage the highly configurable broker and Bookkeeper to suit your needs.
- Event sourcing patterns on top of Apache Pulsar.
- Avro schema usage, compatibility choices, and evolution.
- Familiarise with pulsar connector for Flink and possible use cases.
In Apache Pulsar Meetup, Jia Zhai from StreamNative presents KoP (Kafka-on-Pulsar) which bring native Kafka protocol support on Pulsar broker. He gave a demo about how to use Kafka clients and Pulsar clients can work seamlessly on same data, and how Kafka Connectors can work on a Pulsar cluster.
Query Pulsar Streams using Apache FlinkStreamNative
Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from the Apache Pulsar community will share the latest integrations between Apache Pulsar and Apache Flink. He will explain how Apache Flink can integrate and leverage Pulsar’s built-in efficient schemas to allow users of Flink SQL query Pulsar streams in realtime.
Serverless Event Streaming with Pulsar FunctionsStreamNative
The last few years have seen the emergence of Serverless as a paradigm for event streaming. Its very simple programming model has attracted developers in droves. At the same time, its ability to elastically scale has simplified operations significantly. Combined together with the ubiquity of their presence across all cloud providers, serverless today has become the leading choice to do event processing at scale for a lot of companies.
In this talk, Sijie Guo from StreamNative will explore how the serverless paradigm is applied to event streaming in Apache Pulsar, a next-generation event streaming system. Pulsar provides native support for serverless functions where the events are processed as soon as they arrive in a streaming manner and that provides flexible deployment options (thread, process, container). He will describe how these serverless functions make data engineering easier and share the real world usage of Pulsar Functions.
Pulsar is a great technology, but it is also a new, less well-known technology competing against incumbent technologies, which is always a bit of a tough sell.
In this talk, we will go over the whole end-to-end process of how we researched, advocated, built, integrated, and established Apache Pulsar at Instructure in less than a year. We will share details of how Pulsar's capabilities differentiate it, how we deploy Pulsar, and how we focused on an ecosystem of tools to accelerate adoption. We will also discuss one major motivating use case of change-data-capture for hundreds of databases servers at scale.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...StreamNative
Nowadays, real-time computation is heavily used in cases such as online product recommendation, online payment fraud detection and etc.. In the streaming pipeline, Kafka is normally used to store a day/week data, but won't store years-long data, as in looking at the trend historically. So, a batch pipeline is needed for historical data computation. Thus, it's where the Lambda architecture comes in. Lambda has been proved to be effective, and a good balance of speed and reliability. We have been running many systems with Lambda architecture for many years. But the biggest detraction to Lambda architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and streaming layers. With that, we have to split our business logic into many segments across different places, which is a challenge to maintain as the business grows and it also increases communication overhead. Secondly, the data are duplicated in two different systems, and we have to move data among different systems for processing. With those challenges, we have been searching for alternatives and found Apache Pulsar a great fit. In this topic, I will show how we solve those problems with Apache Pulsar by making pulsar a unified storage backend for both batch and streaming pipeline, a solution that simplifies the s/w stack, lifts up our work efficiency and lowers the cost at the same time.
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...StreamNative
The highest message delivery guarantee that Apache Pulsar provides is 'exactly-once', producing at a single partition via Idempotent Producer. Users are guaranteed that every message produced to a single partition via an Idempotent Producer will be persisted exactly once, without data loss. However, there is no 'atomicity' when a producer attempts to produce messages to multiple partitions. From the consumer side, acknowledgment is a best-effort operation, which results in message redelivery, hence consumer will receive duplicate messages. Pulsar only guarantees 'at-least-once' consumption for consumers. It creates inconvenience and brings in complexity when you use Pulsar to build mission-critical services (such as billing services).
We introduce Transaction support in Pulsar 2.8.0 release, to simplify the process of building reliable and fault resilient services using Apache Pulsar and Pulsar Functions. It provides the capability to achieve end-to-end exactly-once for streaming jobs in other stream processing engines.
This presentation deep dives into the details of Pulsar transaction and how Pulsar transaction is applied to Pulsar Functions and other processing engines to achieve transactional event streaming. We will cover how Pulsar transaction works and how Pulsar Functions offers transaction support using Pulsar transaction.
Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from Apache Pulsar community will given an overview of Apache Pulsar and how it provides the unified data view to fully leverage Apache Flink unified computation runtime for elastic data processing. He will share the latest integrations between Apache Pulsar and Apache Flink, especially around effectively-once processing and schema integration.
At Clever Cloud, we are working on extremely light virtual machines to run WebAssembly binaries. As it’s WASM, we can write code using a lot of languages. We use a custom unikernel to run this WASM as Function-as-a-Service, using one VM per function execution. These VM can run on events from messages coming through Pulsar, or from HTTP invocation, the run is on-demand as only the consumers stay up. This can be a new model: Pulsar functions for real isolation in multi-tenancy use cases. This talk will show the use case, explain the virtualization underneath and demonstrate the multi-tenancy use case.
How Apache Pulsar Helps Tencent Process Tens of Billions of Transactions Effi...StreamNative
As the largest provider of Internet products and services in China, Tencent serves billions of users and over a million merchants—and these numbers are growing fast! Tencent’s enterprises generate a huge volume of financial transactions, placing a tremendous load on their billing service, which processes hundreds of millions of dollars in revenue each day.
Because Tencent had been unable to scale its current billing service to handle its rapidly growing business, the possibility of data loss had become an escalating concern. To ensure data consistency, the company decided to redesign its system’s transaction processing pipeline. After evaluating the pros and cons of several messaging systems, Tencent chose to implement its billing service using Apache Pulsar. As a result, Tencent can now run their billing service on a very large scale with virtually no data loss.
In this talk, Ningguo Chen, the Chief Architect from Tencent Billing will share their journal of adopting Pulsar in their core transaction processing engine to process tens of billions of events every day. He will also discuss the problems they have encountered in using Pulsar and the improvements they have made for meeting their scale.
Integrating Apache Pulsar with Big Data EcosystemStreamNative
In Apache Pulsar Beijing Meetup, Yijieshen gave a presentation of the current state of Apache Pulsar integrating with Big Data Ecosystem. He explains why and how Pulsar fits into current big data computing and query engines, and how Pulsar integrates with Spark, Flink and Presto for unified data processing system.
I Heart Log: Real-time Data and Apache KafkaJay Kreps
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.
Learning plan
Skillsets required for Koha maintenance
Hardware requirements
Software requirements
Koha release schedules
Types of Koha implementation
Methods of Koha installation
How to update with changes in Koha.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreStreamNative
Apache Pulsar is a distributed and open-source pub-sub messaging system. It offers many advantages over Kafka, such as multi-tenant, geo-replication, decoupled storage or even SQL and FaaS directly integrated. The only thing missing for wide adoption is support for the de-facto standard for streaming: Kafka. And this is how our story begins.
In this talk, Sijie Guo from StreamNative and Pierre Zemb from OVHcloud will share the journey on building Kafka-on-Pulsar (KoP) to bring native Kafka protocol support to Pulsar. Before joining the force on building KoP, OVHcloud implemented a Kafka proxy in Rust capable of transforming the Kafka protocol to that Pulsar on the fly and encountered some challenges. After realizing that StreamNative was working on bringing the Kafka protocol natively to Pulsar broker via a pluggable protocol handler mechanism. OVHCloud joined forces with StreamNative to work on brining Kafka protocol support to Pulsar brokers.
At the end of this talk, you will know more about the inner workings of Kafka and Pulsar. You'll also get feedback from both companies from their initial proofs of concepts and the current implementation.
Scaling customer engagement with apache pulsarStreamNative
Iterable's platform is used by marketers to reach hundreds of millions of users every day, and those numbers are quickly growing. Iterable's infrastructure is built with pub-sub messaging at it's core, so the reliability, scalability and flexibility provided by that system are business critical.
In this talk we'll discuss why Iterable chose Pulsar as a pub-sub messaging system, as well as how Iterable is taking advantage of some of more recently added features in Pulsar. We'll also talk about some of the challenges we encountered, where we think Pulsar can improve, and some contributions we've made to the open source community around Pulsar.
Lessons from managing a Pulsar cluster (Nutanix)StreamNative
In this presentation, we will cover:
- How to performance test and optimize a Pulsar cluster. We will present how we load tested Pulsar with locust and, following this, how we tuned our configurations for our use cases.
- Event sourcing pattern with Apache Pulsar. Avro schema usage, compatibility choices and schema evolution on pulsar topics that worked for us.
- Bonus: How we source Apache Flink from apache pulsar and run our workflows.
By attending this webinar, you can expect to come away with:
- How to performance test a Pulsar cluster for your use case.
- How to leverage the highly configurable broker and Bookkeeper to suit your needs.
- Event sourcing patterns on top of Apache Pulsar.
- Avro schema usage, compatibility choices, and evolution.
- Familiarise with pulsar connector for Flink and possible use cases.
In Apache Pulsar Meetup, Jia Zhai from StreamNative presents KoP (Kafka-on-Pulsar) which bring native Kafka protocol support on Pulsar broker. He gave a demo about how to use Kafka clients and Pulsar clients can work seamlessly on same data, and how Kafka Connectors can work on a Pulsar cluster.
Query Pulsar Streams using Apache FlinkStreamNative
Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from the Apache Pulsar community will share the latest integrations between Apache Pulsar and Apache Flink. He will explain how Apache Flink can integrate and leverage Pulsar’s built-in efficient schemas to allow users of Flink SQL query Pulsar streams in realtime.
Serverless Event Streaming with Pulsar FunctionsStreamNative
The last few years have seen the emergence of Serverless as a paradigm for event streaming. Its very simple programming model has attracted developers in droves. At the same time, its ability to elastically scale has simplified operations significantly. Combined together with the ubiquity of their presence across all cloud providers, serverless today has become the leading choice to do event processing at scale for a lot of companies.
In this talk, Sijie Guo from StreamNative will explore how the serverless paradigm is applied to event streaming in Apache Pulsar, a next-generation event streaming system. Pulsar provides native support for serverless functions where the events are processed as soon as they arrive in a streaming manner and that provides flexible deployment options (thread, process, container). He will describe how these serverless functions make data engineering easier and share the real world usage of Pulsar Functions.
Pulsar is a great technology, but it is also a new, less well-known technology competing against incumbent technologies, which is always a bit of a tough sell.
In this talk, we will go over the whole end-to-end process of how we researched, advocated, built, integrated, and established Apache Pulsar at Instructure in less than a year. We will share details of how Pulsar's capabilities differentiate it, how we deploy Pulsar, and how we focused on an ecosystem of tools to accelerate adoption. We will also discuss one major motivating use case of change-data-capture for hundreds of databases servers at scale.
Jay Kreps is a Principal Staff Engineer at LinkedIn where he is the lead architect for online data infrastructure. He is among the original authors of several open source projects including a distributed key-value store called Project Voldemort, a messaging system called Kafka, and a stream processing system called Samza. This talk gives an introduction to Apache Kafka, a distributed messaging system. It will cover both how Kafka works, as well as how it is used at LinkedIn for log aggregation, messaging, ETL, and real-time stream processing.
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...StreamNative
Nowadays, real-time computation is heavily used in cases such as online product recommendation, online payment fraud detection and etc.. In the streaming pipeline, Kafka is normally used to store a day/week data, but won't store years-long data, as in looking at the trend historically. So, a batch pipeline is needed for historical data computation. Thus, it's where the Lambda architecture comes in. Lambda has been proved to be effective, and a good balance of speed and reliability. We have been running many systems with Lambda architecture for many years. But the biggest detraction to Lambda architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and streaming layers. With that, we have to split our business logic into many segments across different places, which is a challenge to maintain as the business grows and it also increases communication overhead. Secondly, the data are duplicated in two different systems, and we have to move data among different systems for processing. With those challenges, we have been searching for alternatives and found Apache Pulsar a great fit. In this topic, I will show how we solve those problems with Apache Pulsar by making pulsar a unified storage backend for both batch and streaming pipeline, a solution that simplifies the s/w stack, lifts up our work efficiency and lowers the cost at the same time.
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...StreamNative
The highest message delivery guarantee that Apache Pulsar provides is 'exactly-once', producing at a single partition via Idempotent Producer. Users are guaranteed that every message produced to a single partition via an Idempotent Producer will be persisted exactly once, without data loss. However, there is no 'atomicity' when a producer attempts to produce messages to multiple partitions. From the consumer side, acknowledgment is a best-effort operation, which results in message redelivery, hence consumer will receive duplicate messages. Pulsar only guarantees 'at-least-once' consumption for consumers. It creates inconvenience and brings in complexity when you use Pulsar to build mission-critical services (such as billing services).
We introduce Transaction support in Pulsar 2.8.0 release, to simplify the process of building reliable and fault resilient services using Apache Pulsar and Pulsar Functions. It provides the capability to achieve end-to-end exactly-once for streaming jobs in other stream processing engines.
This presentation deep dives into the details of Pulsar transaction and how Pulsar transaction is applied to Pulsar Functions and other processing engines to achieve transactional event streaming. We will cover how Pulsar transaction works and how Pulsar Functions offers transaction support using Pulsar transaction.
Both Apache Pulsar and Apache Flink share a similar view on how the data and the computation level of an application can be “streaming-first” with batch as a special case streaming. With Apache Pulsar’s Segmented-Stream storage and Apache Flink’s steps to unify batch and stream processing workloads under one framework, there are numerous ways of integrating the two technologies to provide elastic data processing at massive scale, and build a real streaming warehouse.
In this talk, Sijie Guo from Apache Pulsar community will given an overview of Apache Pulsar and how it provides the unified data view to fully leverage Apache Flink unified computation runtime for elastic data processing. He will share the latest integrations between Apache Pulsar and Apache Flink, especially around effectively-once processing and schema integration.
At Clever Cloud, we are working on extremely light virtual machines to run WebAssembly binaries. As it’s WASM, we can write code using a lot of languages. We use a custom unikernel to run this WASM as Function-as-a-Service, using one VM per function execution. These VM can run on events from messages coming through Pulsar, or from HTTP invocation, the run is on-demand as only the consumers stay up. This can be a new model: Pulsar functions for real isolation in multi-tenancy use cases. This talk will show the use case, explain the virtualization underneath and demonstrate the multi-tenancy use case.
How Apache Pulsar Helps Tencent Process Tens of Billions of Transactions Effi...StreamNative
As the largest provider of Internet products and services in China, Tencent serves billions of users and over a million merchants—and these numbers are growing fast! Tencent’s enterprises generate a huge volume of financial transactions, placing a tremendous load on their billing service, which processes hundreds of millions of dollars in revenue each day.
Because Tencent had been unable to scale its current billing service to handle its rapidly growing business, the possibility of data loss had become an escalating concern. To ensure data consistency, the company decided to redesign its system’s transaction processing pipeline. After evaluating the pros and cons of several messaging systems, Tencent chose to implement its billing service using Apache Pulsar. As a result, Tencent can now run their billing service on a very large scale with virtually no data loss.
In this talk, Ningguo Chen, the Chief Architect from Tencent Billing will share their journal of adopting Pulsar in their core transaction processing engine to process tens of billions of events every day. He will also discuss the problems they have encountered in using Pulsar and the improvements they have made for meeting their scale.
Integrating Apache Pulsar with Big Data EcosystemStreamNative
In Apache Pulsar Beijing Meetup, Yijieshen gave a presentation of the current state of Apache Pulsar integrating with Big Data Ecosystem. He explains why and how Pulsar fits into current big data computing and query engines, and how Pulsar integrates with Spark, Flink and Presto for unified data processing system.
I Heart Log: Real-time Data and Apache KafkaJay Kreps
This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.
Learning plan
Skillsets required for Koha maintenance
Hardware requirements
Software requirements
Koha release schedules
Types of Koha implementation
Methods of Koha installation
How to update with changes in Koha.
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMYahoo!デベロッパーネットワーク
LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/
What will $0.08 get you with storage? Typically, not much. But, on $0.08 will change the way you think about storage and cause you to question everything storage vendors have told you. Find out more in this presentation
The Implementing AI: High Performance Architectures webinar, hosted by KTN and eFutures, was the fourth event in the Implementing AI summer webinar series.
Every business is increasing the use of artificial intelligence to gain efficiency and to make better decisions. These new demands for data processing are not well delivered by traditional computer architectures. Enterprises, developers, data scientists, and researchers need new platforms that unify all AI workloads, simplifying infrastructure and accelerating ROI. This has led to the development of high performance and specialised hardware devices to meet these new demands.
The focus of this webinar was the impact of processing AI data on data centres - particularly from the technology perspective. The webinar had four presentations from experts, covering the opportunities, implementation techniques and Case Studies, followed by a panel Q&A session.
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam K Dey | Current 2022
Robinhood’s mission is to democratize finance for all. Data driven decision making is key to achieving this goal. Data needed are hosted in various OLTP databases. Replicating this data near real time in a reliable fashion to data lakehouse powers many critical use cases for the company. In Robinhood, CDC is not only used for ingestion to data-lake but is also being adopted for inter-system message exchanges between different online micro services. .
In this talk, we will describe the evolution of change data capture based ingestion in Robinhood not only in terms of the scale of data stored and queries made, but also the use cases that it supports. We will go in-depth into the CDC architecture built around our Kafka ecosystem using open source system Debezium and Apache Hudi. We will cover online inter-system message exchange use-cases along with our experience running this service at scale in Robinhood along with lessons learned.
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
Apache Pulsar Development 101 with PythonTimothy Spann
Apache Pulsar Development 101 with Python PS2022_Ecosystem_v0.0
There is always the fear a speaker cannot make it. So just in case, since I was the MC for the ecosystem track I put together a talk just in case.
Here it is. Never seen or presented.
“Quantum” Performance Effects: beyond the CoreC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2Sbd5Ws.
Sergey Kuksenko talks about how (and how much) CPU microarchitecture details may have an influence on applications performance. Could it be visible by end-users? How to avoid misjudgment when estimating code performance? CPU is huge (not in size) that is why the talk is limited to those parts which located out of computational core (mostly caches and memory access). Filmed at qconsf.com.
Sergey Kuksenko works as Java Performance Engineer at Oracle. His primary goal is making Oracle JVM faster digging into JVM runtime, JIT compilers, class libraries and etc. His favorite area is an interaction of Java with modern hardware what he is doing since 2005 when he worked at Intel in Apache Harmony Performance team.
Dive to get an idea about Apache Kafka.
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and later on became a part of
the Apache project.
Updates on webSpoon and other innovations from Hitachi R&DHiromu Hota
Updates on webSpoon and introduction of SpoonGit (Git client integrated with Spoon) at PCM17 (10th Pentaho Community Meeting in Mainz, Germany, Nov 11, 2017)
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise ...Filipe Miranda
New Generation of IBM Power Systems Delivering value with Red Hat Enterprise Linux - Learn about the new IBM Power8 architecture, about Red Hat Enterprise Linux 7 for Power Systems and additional information on EnterpriseDB on how to migrate from Oracle to PostgreSQL.
UPDATED!
This presentation provides an overview of DataCore's Software-defined Storage Platform and insights into DataCore's latest world-record setting performance achievements on the SPC-1 benchmark. DataCore Parallel I/O, which is at the heart of DataCore's technology, is a unique approach to increasing storage performance by orders of magnitude without the need to acquire more and more hardware.
What to Expect for Big Data and Apache Spark in 2017 Databricks
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, Matei Zaharia will cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, he will talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
Speaker: Matei Zaharia
Video: http://go.databricks.com/videos/spark-summit-east-2017/what-to-expect-big-data-apache-spark-2017
This talk was originally presented at Spark Summit East 2017.
ApacheCon @Home 2020
StreamPipes is an open source self-service IoT toolbox to enable non-technical users to connect, analyze and explore IoT data streams.
https://streampipes.apache.org/
Similar to Large scale log pipeline using Apache Pulsar_Nozomi (20)
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022StreamNative
So, you are a responsible software engineer building microservices for Apache Kafka, and life is good. Eventually, you hear the community talking about the outstanding experience they are having with Apache Pulsar features. They talk about infinite event stream retention, a rebalance-free architecture, native support for event processing, and multi-tenancy. Exciting, right? Most people would want to migrate their code to Pulsar. Especially when you know that Pulsar also supports Kafka clients natively via the protocol handler known as KoP — which enables the Kafka client APIs on Pulsar. But, as said before, you are responsible; and you don't believe in fairy tales, just like you don't believe that migrations like this happen effortlessly. This session will discuss the architecture behind protocol handlers, what it means having one enabled on Pulsar, and how the KoP works. It will detail the effort required to migrate a microservice written for Kafka to Pulsar, and whether the code need to change for this.
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative
This talk describes Klaviyo’s internal messaging system, an asynchronous application framework built around Pulsar that provides a set of high-quality tools for building business-critical asynchronous data flows in unreliable environments. This framework includes: a pulsar ORM and schema migrator for topic configuration; a retry/replay system; a versioned schema registry; a consumer framework oriented around preventing message loss and in hostile environments while maximizing observability; an experimental “online schema change” for topics; and more. Development of this system was informed by lessons learned during heavy use of datastores like RabbitMQ and Kafka, and frameworks like Celery, Spark, and Flink. In addition to the capabilities of this system, this talk will also cover (sometimes painful) lessons learned about the process of converting a heterogenous async-computing environment onto Pulsar and a unified model.
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...StreamNative
In this talk, learn how Toast leverages our Envoy control-plane to manage blue-green deploys of Pulsar consumers, and how this has helped drive adoption across the engineering organization. Dive into the history of Pulsar at Toast, starting from its introduction in 2019 to provide event-driven architecture across a rapidly scaling restaurant software platform. We will detail some of the hurdles that we encountered gaining buy-in across a diverse set of teams, and dive deep into how we enforce best practices and integrate with our service control plane.
Distributed Database Design Decisions to Support High Performance Event Strea...StreamNative
Event streaming architectures launched a reexamination of applications and systems architectures across the board. We live in a world where answers are needed now in a constant real-time flow. Yet beyond the event streaming system itself, what are the corequisites to ensure our large scale distributed database systems can keep pace with this always-on, always-current real time flow of data? What are the requirements and expectations for this next tech cycle?
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022StreamNative
Pulsar Functions is a succinct framework provided by Apache Pulsar to conduct real-time data processing. Its use cases include ETL pipeline, event-driven applications, and simple data analytics. While Pulsar Functions already provides an extremely simple programming interface, we want to further lower the barrier for users to access real-time data. Since SQL is one of the universal languages in the technology world and well accepted by the vast majority of data engineers, we decided to add a SQL expressing layer on top of Pulsar Functions runtime. In this talk, we will discuss the architecture and implementation of this new service. We will see how SQL syntax, Pulsar Functions, and Function Mesh can work together to deliver a unique user development experience for real-time data jobs in the cloud environment. We will also walk through use cases like filtering, routing, and projecting messages as well as integrating with the Pulsar IO Connectors framework.
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
Starting with version 2.10, the Apache ZooKeeper dependency has been eliminated and replaced with a pluggable framework that enables you to reduce the infrastructure footprint of Apache Pulsar by leveraging alternative metadata and coordination systems based on your deployment environment. In this talk, walk through the steps required to utilize the existing etcd service running inside Kubernetes to act as Pulsar's metadata store, thereby eliminating the need to run ZooKeeper entirely, leaving you with a Zookeeper-less Pulsar.
Apache Pulsar is a highly available, distributed messaging system that provides guarantees of no message loss and strong message ordering with predictable read and write latency. In this talk, learn how this can be validated for Apache Pulsar Kubernetes deployments. Various failures are injected using Chaos Mesh to simulate network and other infrastructure failure conditions. There are many questions that are asked about failure scenarios, but it could be hard to find answers to these important questions. When a failure happens, how long does it take to recover? Does it cause unavailability? How does it impact throughput and latency? Are the guarantees of no message loss and strong message ordering kept, even when components fail? If a complete availability zone fails, is the system configured correctly to handle AZ failures? This talk will help you find answers to these questions and apply the tooling and practices to your own testing and validation.
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...StreamNative
Despite what the Ghostbusters said, we’re going to go ahead and cross (or, join) the streams. This session covers getting started with streaming data pipelines, maximizing Pulsar’s messaging system alongside one of the most flexible streaming frameworks available, Apache Flink. Specifically, we’ll demonstrate the use of Flink SQL, which provides various abstractions and allows your pipeline to be language-agnostic. So, if you want to leverage the power of a high-speed, highly customizable stream processing engine without the usual overhead and learning curves of the technologies involved (and their interconnected relationships), then this talk is for you. Watch the step-by-step demo to build a unified batch and streaming pipeline from scratch with Pulsar, via the Flink SQL client. This means you don’t need to be familiar with Flink, (or even a specific programming language). The examples provided are built for highly complex systems, but the talk itself will be accessible to any experience level.
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022StreamNative
Apache Pulsar depends upon message acknowledgments to provide at-least-once or exactly-once processing guarantees. With these guarantees, any transmission between the broker and its producers and consumers requires an acknowledgment. But what happens if an acknowledgment is not received? Resending the message introduces the potential of duplicate processing and increases the likelihood of out or order processing. Therefore, it is critical to understand the Pulsar message redelivery semantics in order to prevent either of these conditions. In this talk, we will walk you through the redelivery semantics of Apache Pulsar, and highlight some of the control mechanisms available to application developers to control this behavior. Finally, we will present best practices for configuring message redelivery to suit various use cases.
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.
Understanding Broker Load Balancing - Pulsar Summit SF 2022StreamNative
Pulsar is a horizontally scalable messaging system, so the traffic in a logical cluster must be balanced across all the available Pulsar brokers as evenly as possible, in order to ensure full utilization of the broker layer. You can use multiple settings and tools to control the traffic distribution which requires a bit of context to understand how the traffic is managed in Pulsar. In this talk, we will walk you through the load balancing capabilities of Apache Pulsar, and highlight some of the control mechanisms available to control the distribution of load across the Pulsar brokers. Finally, we will discuss the various loading shedding strategies that are available. At the end of the talk, you will have a better understanding of how Pulsar's broker level auto-balancing works, and how to properly configure it to meet your workload demands.
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative
This talk describes Klaviyo’s internal messaging system, an asynchronous application framework built around Pulsar that provides a set of high-quality tools for building business-critical asynchronous data flows in unreliable environments. This framework includes: a pulsar ORM and schema migrator for topic configuration; a retry/replay system; a versioned schema registry; a consumer framework oriented around preventing message loss and in hostile environments while maximizing observability; an experimental “online schema change” for topics; and more. Development of this system was informed by lessons learned during heavy use of datastores like RabbitMQ and Kafka, and frameworks like Celery, Spark, and Flink. In addition to the capabilities of this system, this talk will also cover (sometimes painful) lessons learned about the process of converting a heterogenous async-computing environment onto Pulsar and a unified model.
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022StreamNative
In today’s world, we are seeing a big shift toward the Cloud. With this shift comes a big shift in the expectations we have for a messaging system, especially when the messaging system is presented as managed service in a large-scale, multi-tenant environment. For any large-scale enterprise, it’s very important to evaluate messaging system and be confident before expanding complex distributed data systems like Apache Pulsar from on-premise to elastically scalable, fully managed services on cloud services. We must consider aspects such as: migration from and integration with large-scale on-premise clusters, security, cost efficiency, and the cloud friendliness of the architecture, modeling cost and capacity, tenant isolation, deployment robustness, availability, monitoring, etc. Not every messaging system is built to be cloud-native and run as a managed service with cost efficiency. We have been running large-scale Apache Pulsar at Yahoo for the last 8 years on various platforms and hardware configurations while meeting application SLAs and serving more than 1M topics in a cluster. In this talk, we will talk about Pulsar’s journey in Yahoo! from an on-premise platform to a hybrid cloud and on-premise system. We will talk about Pulsar’s architecture and features that make Pulsar a good cloud-native messaging-system choice for any enterprise.
Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative
Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person.
Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022StreamNative
Our services team creates, builds, and maintains the as a service offering for base platform services within our organization. Several thousand applications use these custom services daily generating more than 700 million requests per minute. One of these services was our publish / subscriber offering, BQ with custom SDK and custom metrics based on Apache Pulsar. BQ is the core communication service within our organization, having more 200M RPM. All the core processes of the organization depend on this service for operation: the CDC of any of our RDBMS or NoSQL offering, all the eventing efforts of the organization, async communication between apps, notification systems, etc. The backend of the solution was Apache Pulsar running on EC2 on AWS and on top of that we built several components as wrappers of the actual backend, creating our own SDKs and abstractions and in many ways extending the features provided by Pulsar. We had a multi-cluster setup 100% on AWS, with custom Pulsar Docker images running on large ASG setups, along with our own wrapping and admin APIs and DBs. All of this in turn transformed the solution into a volatile solution.
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative
There is an increasing need to unleash analytical capabilities directly to the end-users to democratize decision-making. User-Facing Analytics is a new frontier that will shape the products of tomorrow and push the limits of existing technology. It demands a solution that will scale to millions of users to provide fast, real-time insights. In this session, Xiang will talk about his journey to build Apache Pinot to tackle the analytics problem space with the architectural changes and technology inventions made over the past decade. He will also talk about how other big data companies such as LinkedIn, Uber, and Stripe power their user-facing analytical applications.
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative
Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person.
Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.
Welcome and Opening Remarks - Pulsar Summit SF 2022StreamNative
Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person.
Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
Milvus is an open-source vector database that leverages a novel data fabric to build and manage vector similarity search applications. As the world's most popular vector database, it has already been adopted in production by thousands of companies around the world, including Lucidworks, Shutterstock, and Cloudinary. With the launch of Milvus 2.0, the community aims to introduce a cloud-native, highly scalable and extendable vector similarity solution, and the key design concept is log as data.
Milvus relies on Pulsar as the log pub/sub system. Pulsar helps Milvus to reduce system complexity by loosely decoupling each micro service, making the system stateless by disaggregating log storage and computation, which also makes the system further extendable. We will introduce the overview design, the implementation details of Milvus and its roadmap in this topic.
Takeaways:
1) Get a general idea about what is a vector database and its real-world use cases.
2) Understand the major design principles of Milvus 2.0.
3) Learn how to build a complex system with the help of a modern log system like Pulsar.
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...StreamNative
MQTT (Message Queuing Telemetry Transport,) is a message protocol based on the pub/sub model with the advantages of compact message structure, low resource consumption, and high efficiency, which is suitable for IoT applications with low bandwidth and unstable network environments.
This session will introduce MQTT on Pulsar, which allows developers users of MQTT transport protocol to use Apache Pulsar. I will share the architecture, principles and future planning of MoP, to help you understand Apache Pulsar's capabilities and practices in the IoT industry.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Large scale log pipeline using Apache Pulsar_Nozomi
1. Large scale log pipeline using
Apache Pulsar
Yahoo Japan Corporation
Nozomi Kurihara
June, 18th, 2020
2. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 2
Who am I?
Nozomi Kurihara
• Software engineer at Yahoo! JAPAN (April 2012 ~)
• Working on internal messaging platform using Apache Pulsar
• Committer of Apache Pulsar
• (Hobby: Board / video games!)
3. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Agenda
3
1. Apache Pulsar at Yahoo! JAPAN
- About Yahoo! JAPAN
- Why Pulsar was chosen
- Architecture and performance
- Use cases
2. Large scale log pipeline
4. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 4
Apache Pulsar at Yahoo! JAPAN
5. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 5
Yahoo! JAPAN
https://www.yahoo.co.jp/
6. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 6
Yahoo! JAPAN – 3 numbers
100+ 150,000+ 49,010,000+
image: aflo
login users per month
(2019/06)
servers
(real)
services
7. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 7
Pulsar at Yahoo! JAPAN
• We use Apache Pulsar as a centralized messaging platform for 3.5 years
• 1 Pulsar maintainer team and a lot of teams (services) use Pulsar as a “tenant”
Producer
Service A
Consumer
Producer Consumer
Producer Consumer
Topic B
Topic A
Pulsar team
Pulsar
Service B
Service C
Topic C
8. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 8
Pulsar at Yahoo! JAPAN - Users
More and more services start to use Pulsar!
• 270+ tenants
• 4400+ topics
• ~50K publishes/s
• ~150K consumes/s
Typical use cases:
• Notification
• Job queueing
• Log pipeline
9. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 9
Pulsar community in Japan
TechBlog
- https://techblog.yahoo.co.jp/entry/20200312818173/
- https://techblog.yahoo.co.jp/entry/20200413827977/
- https://techblog.yahoo.co.jp/entry/2020060330002394/
Apache Pulsar Meetup Japan (in Tokyo)
- https://japan-pulsar-user-group.connpass.com/
10. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 10
Why Pulsar was chosen
11. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 11
Why did Yahoo! JAPAN choose Pulsar?
Large number of customers
Large number of services
Sensitive/mission-critical messages
Multiple data centers
→ High performance & scalability
→ Multi-tenancy
→ Security & Durability
→ Geo-replication
Pulsar meets all requirements!
12. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 12
Multi-tenancy
Share 1 Pulsar with all YJ services → low hardware and labor costs
Service A
MQ ConsumerProducer
Service B
MQ ConsumerProducer
Service C
MQ ConsumerProducer
Service A
topic ConsumerProducer
Service B
topic ConsumerProducer
Service C
topic ConsumerProducer
Pulsar team
13. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 13
Multi-tenancy – self-service
Users can create/configure/delete their topics by themselves
→ management of topics is delegated to users
Internal Web UI tool to manage topics (will be replaced with pulsar-manager):
Create tenant
Create namespace See topic stats
14. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 14
Architecture and performance
15. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
East
Broker
Bookie ZK
WebSocket
Proxy
15
Clusters in Yahoo! JAPAN
West
Broker
Bookie ZK
WebSocket
Proxy
Geo-replication
For each cluster:
• 20 WS proxies
• 15 Brokers
• 10 Bookies
• 5 ZKs
16. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 16
Performance – experimental settings
CPU Memory Disk NIC
Broker 2.00GHz / 2CPU 768GB SATA SSD 240GB x2(RAID1) 10GBaseT
Bookie 2.00GHz / 2CPU 768GB Journal: SATA SSD 240GB x2(RAID1)
Ledger: SATA HDD 10TB x12(RAID1+0)
10GBaseT
• Pulsar version: 2.3.2(Broker) / 2.4.1(Client)
• Tool: openmessaging-benchmark
• Message size: 1 KB
• partition: 1, 16, 32
• rate(attempted): 100000, 500000
• Server spec:
17. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 17
Performance – experimental results
- 16, 32 partitions achieves 500,000 msg/s whereas 1 partition does not
- max publish rate with 1 partition looks 200,000 msg/s
18. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 18
Tuning example (Bookie)
Problem:
• More users increases, more writes to SSD
• That reduces lifespan of SSD (actually we saw frequent failure of SSD)
Solution:
Increase journalMaxGroupWaitMSec from 1 to 2
→ Write decreased by 30% at the sacrifice of the least latency
CPU Memory Disk NIC
Broker 2.00GHz / 2CPU 768GB SATA SSD 240GB x2(RAID1) 10GBaseT
Bookie 2.00GHz / 2CPU 768GB Journal: SATA SSD 240GB x2(RAID1)
Ledger: SATA HDD 10TB x12(RAID1+0)
10GBaseT
19. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 19
Use cases
20. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 20
Case 1 – Notification of contents update
Various contents files pushed from partner companies to Yahoo! JAPAN
Notification sent to topic when contents are updated
Once services receive notification, fetch contents from file server
Producer
Consumer
Topic
Service A
Pulsar
①send notification
③fetch content files
Consumer
Service B
Consumer
Service CPartner
Companies
weather, map, news etc.
FTP server
ftpd
②receive notification
21. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 21
Case 2 – Job queuing in mail service
Asynchronously execute heavy jobs like indexing of mail
Producers register jobs to Pulsar
Consumers take jobs from Pulsar at their own pace
Producer
Consumer
Producer
Topic Handler for indexing
Mail BE server
Mail BE server
Pulsar
request
Register a job
Re-register if it fails
Take and process a job
22. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 22
Case 3 – Kafka alternative
We have an internal FaaS system using Apache OpenWhisk
Problem: FaaS team had to maintain Apache Kafka
Solution: migrate from Kafka to our internal Pulsar
Pulsar Kafka Wrapper needs only a few configuration changes (.pom, topic name, etc.)
<dependency>
- <groupId>org.apache.kafka</groupId>
- <artifactId>kakfa-clients</artifactId>
- <version>0.10.2.1</version>
+ <groupId>org.apache.pulsar</groupId>
+ <artifactId>pulsar-client-kafka</artifactId>
+ <version>2.4.0</version>
</dependency>
23. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 23
Large scale log pipeline
24. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 24
Situation
…
Service developers
deploy
monitor
logs/
metrics
PaaS CaaSFaaS
25. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 25
Yamas
• Metrics monitoring / alerting platform (SaaS)
• Originally developed in Verizon media
• Will be open-sourced soon!
26. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 26
Scale
• Amount of total logs: 1.4~3.8 TB/h
• Peek traffics: 10+ Gbps
• Number of PFs will increase more and more
27. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 27
Legacy architecture
Computing PFs
app
PaaS…
…
Monitoring PFs
Splunk
Yamas
Yamas
agent
Splunk
agent
app
app
app
app
CaaS
Yamas
agent
Splunk
agent
app
app
app
L Need to install dedicated “agent” for each Monitoring PFs
L Difficult to scale out
L Traffic spikes directly influence Monitoring PFs
28. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 28
Motivation
Remove dedicated agent for each monitoring PF:
- No need specific knowledge and extra components
- Easier trouble shooting
Decouple sender/receiver PFs by introducing message queueing layer:
- Scalability
- Resiliency
29. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 29
New architecture
Computing PFs
app
PaaS…
…
Monitoring PFs
Splunk
Yamas
Splunk topic
app
app
app
app
CaaS
Pulsar
producer
app
app
app
Pulsar
Yamas topic
Pulsar
producer
Pulsar
consumer
Pulsar
consumer
J Single library
J Easy to scale out
J Traffic spikes are mitigated by queueing layer
30. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 30
Topic design – 3 patterns
PaaS
Pulsar
CaaS
PaaS
CaaS
Splunk
Yamas
①Producer-centric
②Consumer-centric
Messages are filtered/transformed at Consumer-side:
J Producers donʼt care about Consumers
L Consumers care about Producers
Splunk
Pulsar
Yamas
PaaS
CaaS
Splunk
Yamas
Messages are filtered/transformed at Producer-side:
J Consumers donʼt care about Producers
L Producers care about Consumers
③Function
Splunk
Pulsar
Yamas
PaaS
CaaS
Splunk
Yamas
Messages are filtered/transformed at Function-side:
J Both Producers and Consumers donʼt care about each other
L Extra loads: traffic, computing, storage etc.
PaaS
CaaS
func
31. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 31
Topic format and message format
{consumer_pf}/{region}/{message_type}-{num}
splunk/west/log-0
Pulsar (west)
yamas/west/metric-0
splunk/west/log-1
splunk/west/metric-0
……
splunk
yamas
…
west
east
log
metric
…
splunk/east/log-0
Pulsar (east)
yamas/east/metric-0
splunk/east/log-1
splunk/east/metric-0
………
{
"time": "2018-10-25T08:36:47.000Z",
"producer": "paas-producer.example.com",
"origin": "app.space.org.cluster.dc.nwseg",
"domain": "paas",
"body": {
"message": "hello splunk”,
…
}
}
Pulsar
producer
32. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 32
Use case: Pulsar stats on Yamas
YamasPulsar
Yamas topic
Pulsar
producer
/admin/v2/broker-stats/topics
33. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 33
Conclusion
34. Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Conclusion
34
Conclusion:
• Yahoo! JAPAN uses Pulsar as a centralized platform for various services
• Recently we start to use Pulsar as a large scale log pipeline where
computing PFs publish their logs/metrics and monitoring PFs consume
• Pulsar plays an important role to connect various PFs and make whole
system scalable and resilient
Future plan:
• More Producer PFs and Consumer PFs
• Visualize SLI (message delivery rate, latency etc.)