SlideShare a Scribd company logo
1 of 41
Download to read offline
© 2020 SPLUNK INC.
Interactive Querying of
Streams Using Apache
Pulsar™
Jerry Peng
Pulsar Summit | June 2020
Principal Software Engineer | jerryp@splunk.com
Apache {Pulsar, Heron, Storm} committer and PMC member
© 2020 SPLUNK INC.
Agenda 1) General use cases
2) Existing architectures
3) Apache Pulsar overview
4) Pulsar SQL
5) Concrete use case (Zhaoping.com)
6) Demo!
7) Questions?
© 2020 SPLUNK INC.
What are Streams?
Continuous flows of data…
Almost all data originate in this form
© 2020 SPLUNK INC.
Interactive Querying of Streams?
Querying both latest and historical data
© 2020 SPLUNK INC.
How is it useful?
● Speed (i.e. data-driven processing)
○ Act faster
● Accuracy
○ In many contexts the wrong decision may be made if you do not have visibility
that includes the most current data
○ For example, historical data is useful to predict a user is interested in buying a
particular item, but if my analytics don’t also know that the user just purchased
that item two minutes ago they’re going to make the wrong recommendation
● Simplification
○ Single place to go to access current and historical data
© 2020 SPLUNK INC.
Debugging
● Errors and Exception
● Troubleshooting systems and
networks
● Have we seen these errors before?
General use cases
© 2020 SPLUNK INC.
Monitoring (Audit logs)
● Answering the “What, When, Who,
Why”
● Suspicious access patterns
● Example
○ Auditing CDC logs in financial institutions
General use cases
© 2020 SPLUNK INC.
Exploring
● Raw or enriched data
● Really simplifies access if data is all in
one location
General use cases
© 2020 SPLUNK INC.
Lots of use cases
● Data analytics
● Business Intelligence
● Real-time dashboards
● etc…
General use cases
© 2020 SPLUNK INC.
Stream processing patterns
ComputeMessaging
Storage
Data Ingestion Data Processing / Querying
Results StorageData Storage
Data
Serving
© 2020 SPLUNK INC.
Existing Solutions
HDFS
Messaging Real-time compute
Storage
Data Stream
Querying
Cloud
Storage
Apache Hadoop MR, Apache Spark, Presto, etc.
Cloud
Pub/Sub Apache Storm, Apache Flink, Apache Heron, etc.
© 2020 SPLUNK INC.
Problems with existing solutions
● Multiple Systems
● Duplication of data
○ Data consistency. Where is the source of truth?
● Latency between data ingestion and when data is queryable
© 2020 SPLUNK INC.
THIS IS WHERE APACHE PULSAR
AND PULSAR SQL COMES IN…
© 2020 SPLUNK INC.
Apache Pulsar™
Flexible Messaging + Streaming System
backed by a durable log storage
© 2020 SPLUNK INC.
Apache Pulsar as a Event Store
1
5
© 2020 SPLUNK INC.
Apache Pulsar Overview
© 2020 SPLUNK INC.
Architecture
Multi-layer, scalable architecture
Independent layers for processing, serving and storage
Messaging and processing built on Apache Pulsar
Storage built on Apache BookKeeper
Consumer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Messaging
Broker Broker Broker
Bookie Bookie Bookie Bookie Bookie
Event storage
Function Processing
WorkerWorker
© 2020 SPLUNK INC.
Segment Centric
Storage
● In addition to partitioning, messages are
stored in segments (based on time and
size)
● Segments are independent from each
others and spread across all storage
nodes
● What this means for Pulsar SQL?
○ Allows SQL engine to read multiple
bookies and leverage disk I/O and
bandwidth of multiple machines
even if the data is in one partition
© 2020 SPLUNK INC.
Writes
● Every segment/ledger has an ensemble
● Each entry in ledger has a
○ Write quorum
■ Nodes of the ensemble to which it is written (usually all)
○ Ack quorum
■ Nodes of the write quorum that must respond for that
entry to be acknowledged (usually a majority)
● What this means for Pulsar SQL?
○ Allows users to configure the number of
replicas SQL engine can read from
○ Trade off between read bandwidth and
storage cost
© 2020 SPLUNK INC.
Apache Bookkeeper™ Internals
● Separate IO path for reads and writes
● Optimized for writing, tailing reads,
catch-up reads
● What this means for Pulsar SQL?
○ Queries often involving scanning
the data.
○ Read-a-head cache in BK allows
for fast sequential reads
© 2020 SPLUNK INC.
Tiered Storage
Unlimited topic storage capacity
Achieves the true “stream-storage”:
keep the raw data forever in stream
form
© 2020 SPLUNK INC.
Tiered Storage
● Leverage cloud storage services to offload cold data — Completely transparent to clients
● Extremely cost effective — Backends (S3) (Coming GCS, HDFS)
● Example: Retain all data for 1 month — Offload all messages older than 1 day to S3
● What this means for Pulsar SQL?
○ Pulsar SQL can query not only data in store in Bookies but also offloaded
into a cloud storage service
2
2
© 2020 SPLUNK INC.
Schema Registry
● Store information on the data structure —
Stored in BookKeeper
● Enforce data types on topic
● Allow for compatible schema evolutions
● JSON, Avro, and Protobuf supported
● What this means for Pulsar SQL?
○ Allows data to be structured so
that it becomes queryable by a
SQL language
© 2020 SPLUNK INC.
Pulsar SQL
Interactive SQL queries over data stored in Pulsar
Query old and real-time data
2
4
© 2020 SPLUNK INC.
Pulsar SQL / 2
● Based on Presto by Facebook — https://prestodb.io/
● Presto is a distributed query execution engine
● Fetches the data from multiple sources (HDFS, S3, MySQL, …)
● Full SQL compatibility
2
5
© 2020 SPLUNK INC.
Pulsar SQL / 3
● Pulsar connector for Presto
○ Read data directly from BookKeeper — bypass Pulsar Broker
■ Can also read data offloaded to Tiered Storage (S3, GCS, etc.)
○ Many-to-many data reads
■ Data is split even on a single partition — multiple workers can read data in
parallel from single Pulsar partition
■ Time based indexing — Use “publishTime” in predicates to reduce data
being read from disk
2
6
© 2020 SPLUNK INC.
Pulsar SQL Architecture
© 2020 SPLUNK INC.
Benefits
● Do not need to move data into another
system for querying
● Read data in parallel
○ Performance not impacted by
partitioning
○ Increase throughput by increasing
write quorum
● Newly arrived data able to be queried
immediately
© 2020 SPLUNK INC.
Compared to other message buses?
● Other messaging platforms have Presto integrations
● Typically uses a consumer to read data from brokers
● Topic/partition served by a single broker (limiting disk IO and
network bandwidth)
© 2020 SPLUNK INC.
User interaction
Connect with CLI client
$./bin/pulsar sql
List Pulsar cluster
presto> show catalogs;
Catalog
---------
pulsar
system
(2 rows)
List Pulsar namespaces
presto> show schemas in pulsar;
Schema
-----------------------
information_schema
public/default
public/functions
sample/standalone/ns1
List Pulsar topics
presto> show tables in pulsar."public/default";
Table
----------------
generator_test
(1 row)
Pulsar SQL
© 2020 SPLUNK INC.
User interaction
Query data in topic
presto> select * from pulsar."public/default".generator_test;
firstname | middlename | lastname | email | username | password | telephonenumber | age | companyemail |
-------------+-------------+-------------+----------------------------------+--------------+----------+-----------------+-----+-------------------------------------+
Genesis | Katherine | Wiley | genesis.wiley@gmail.com | genesisw | y9D2dtU3 | 959-197-1860 | 71 | genesis.wiley@interdemconsulting.eu |
Brayden | | Stanton | brayden.stanton@yahoo.com | braydens | ZnjmhXik | 220-027-867 | 81 | brayden.stanton@supermemo.eu |
Benjamin | Julian | Velasquez | benjamin.velasquez@yahoo.com | benjaminv | 8Bc7m3eb | 298-377-0062 | 21 | benjamin.velasquez@hostesltd.biz |
Michael | Thomas | Donovan | donovan@mail.com | michaeld | OqBm9MLs | 078-134-4685 | 55 | michael.donovan@memortech.eu |
Brooklyn | Avery | Roach | brooklynroach@yahoo.com | broach | IxtBLafO | 387-786-2998 | 68 | brooklyn.roach@warst.biz |
Skylar | | Bradshaw | skylarbradshaw@yahoo.com | skylarb | p6eC6cKy | 210-872-608 | 96 | skylar.bradshaw@flyhigh.eu |
.
.
.
Pulsar SQL
© 2020 SPLUNK INC.
Demo
© 2020 SPLUNK INC.
Performance
Setup
• 3 Nodes
• 12 CPU cores
• 128 GB RAM
• 2 X 1.2 TB NVMe disks
Results
• JSON (Compressed)
• ~60 Millions Rows / Second
• Avro (Compressed)
• ~50 Million Rows / Second
© 2020 SPLUNK INC.
Improving query efficiency
● Query by partition
○ Scanning a large amounts of data may be costly and time-consuming
○ If the data is keyed and hashed to a specific partition, you can simply query the specify partition
○ For example ,if you have tweets keyed by author ingested in Pulsar
SELECT tweet.author, tweet.content
WHERE tweet.author = “jerry” AND __partition__ = 1
FROM pulsar.”public/default”.tweets
● Query by publish time
○ Ledgers/segments are naturally sorted by publish time
○ Only data within publish time will be read
○ Select a range of publish times to minimize the data that needs to be read
SELECT tweet.author, tweet.content
WHERE tweet.author = “jerry” AND __partition__ = 1 AND __publish_time__ > timestamp '2020-06-15 09:00:00'
FROM pulsar.”public/default”.tweets
© 2020 SPLUNK INC.
Case study: Job search analytics at
zhaopin.com
© 2020 SPLUNK INC.
Background
● About ZhaoPin
○ Chinese job search website (Linkedin, Indeed, etc)
● Background
○ ZhaoPin is already a heavy user of Apache Pulsar
○ Using Pulsar to power their enterprise event bus
■ Data involving job position searches, job posts, and resume searches
Case study: Job search analytics at zhaopin.com
Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
© 2020 SPLUNK INC.
● Debugging search results
○ “When the search results do not meet expectations…”
● Analyzing and improving search results
○ “Analyze the search criteria associated with a position that a job
seeker applied for, such as when the position was first exposed
to that user, in order to improve the search service.”
● Analyzing search logs
○ “Analyze search logs from different perspectives and generate
charts that summarize data in different ways, such as by city,
vocation, or keyword ranking. In this way, the search service can
be improved by making it more specific.”
Use cases
Case study: Job search analytics at zhaopin.com
Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
© 2020 SPLUNK INC.
● ZhaoPin already using Pulsar
● Pulsar SQL allows queries using SQL syntax
● Pulsar SQL can save a large amount of data and is
easy to scale up
Why Pulsar SQL
Case study: Job search analytics at zhaopin.com
Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
© 2020 SPLUNK INC.
Quick Start guide:
https://pulsar.apache.org/docs/en/sql-getting-started/
How to get started?
© 2020 SPLUNK INC.
● Performance tuning
● Store data in columnar format
○ Improve compression ratio
○ Materialize relevant columns
● Support different indices
Future work
© 2020 SPLUNK INC.
Questions?
Email: jerryp@splunk
4
1

More Related Content

What's hot

Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
StreamNative
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
StreamNative
 

What's hot (20)

Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! Japan
 
Five years of operating a large scale globally replicated Pulsar installation...
Five years of operating a large scale globally replicated Pulsar installation...Five years of operating a large scale globally replicated Pulsar installation...
Five years of operating a large scale globally replicated Pulsar installation...
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
 
Building a FaaS with pulsar
Building a FaaS with pulsarBuilding a FaaS with pulsar
Building a FaaS with pulsar
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
Transaction preview of Apache Pulsar
Transaction preview of Apache PulsarTransaction preview of Apache Pulsar
Transaction preview of Apache Pulsar
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)
 
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
Exactly-Once Made Easy: Transactional Messaging in Apache Pulsar - Pulsar Sum...
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
 
Using the JMS 2.0 API with Apache Pulsar - Pulsar Virtual Summit Europe 2021
Using the JMS 2.0 API with Apache Pulsar - Pulsar Virtual Summit Europe 2021Using the JMS 2.0 API with Apache Pulsar - Pulsar Virtual Summit Europe 2021
Using the JMS 2.0 API with Apache Pulsar - Pulsar Virtual Summit Europe 2021
 
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
How Splunk Mission Control leverages various Pulsar subscription types_Pranav...
 
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Scaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/dayScaling Apache Pulsar to 10 PB/day
Scaling Apache Pulsar to 10 PB/day
 
Take Kafka-on-Pulsar to Production at Internet Scale: Improvements Made for P...
Take Kafka-on-Pulsar to Production at Internet Scale: Improvements Made for P...Take Kafka-on-Pulsar to Production at Internet Scale: Improvements Made for P...
Take Kafka-on-Pulsar to Production at Internet Scale: Improvements Made for P...
 
Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...
Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...
Interactive Analytics on Pulsar with Pulsar SQL - Pulsar Virtual Summit Europ...
 
Query Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache FlinkQuery Pulsar Streams using Apache Flink
Query Pulsar Streams using Apache Flink
 
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...Scenic City Summit (2021):  Real-Time Streaming in any and all clouds, hybrid...
Scenic City Summit (2021): Real-Time Streaming in any and all clouds, hybrid...
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
 

Similar to Interactive querying of streams using Apache Pulsar_Jerry peng

Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 

Similar to Interactive querying of streams using Apache Pulsar_Jerry peng (20)

PSUG 1 - 2024-01-22 - Onboarding Best Practices
PSUG 1 - 2024-01-22 - Onboarding Best PracticesPSUG 1 - 2024-01-22 - Onboarding Best Practices
PSUG 1 - 2024-01-22 - Onboarding Best Practices
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing SystemApache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing System
 
Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013Real world business workflow with SharePoint designer 2013
Real world business workflow with SharePoint designer 2013
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
How Splunk Is Using Pulsar IO
How Splunk Is Using Pulsar IOHow Splunk Is Using Pulsar IO
How Splunk Is Using Pulsar IO
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
 
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKSPostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
 

More from StreamNative

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
StreamNative
 

More from StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 

Recently uploaded

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
RafigAliyev2
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 

Recently uploaded (20)

一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Fuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertaintyFuzzy Sets decision making under information of uncertainty
Fuzzy Sets decision making under information of uncertainty
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentation
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 

Interactive querying of streams using Apache Pulsar_Jerry peng

  • 1. © 2020 SPLUNK INC. Interactive Querying of Streams Using Apache Pulsar™ Jerry Peng Pulsar Summit | June 2020 Principal Software Engineer | jerryp@splunk.com Apache {Pulsar, Heron, Storm} committer and PMC member
  • 2. © 2020 SPLUNK INC. Agenda 1) General use cases 2) Existing architectures 3) Apache Pulsar overview 4) Pulsar SQL 5) Concrete use case (Zhaoping.com) 6) Demo! 7) Questions?
  • 3. © 2020 SPLUNK INC. What are Streams? Continuous flows of data… Almost all data originate in this form
  • 4. © 2020 SPLUNK INC. Interactive Querying of Streams? Querying both latest and historical data
  • 5. © 2020 SPLUNK INC. How is it useful? ● Speed (i.e. data-driven processing) ○ Act faster ● Accuracy ○ In many contexts the wrong decision may be made if you do not have visibility that includes the most current data ○ For example, historical data is useful to predict a user is interested in buying a particular item, but if my analytics don’t also know that the user just purchased that item two minutes ago they’re going to make the wrong recommendation ● Simplification ○ Single place to go to access current and historical data
  • 6. © 2020 SPLUNK INC. Debugging ● Errors and Exception ● Troubleshooting systems and networks ● Have we seen these errors before? General use cases
  • 7. © 2020 SPLUNK INC. Monitoring (Audit logs) ● Answering the “What, When, Who, Why” ● Suspicious access patterns ● Example ○ Auditing CDC logs in financial institutions General use cases
  • 8. © 2020 SPLUNK INC. Exploring ● Raw or enriched data ● Really simplifies access if data is all in one location General use cases
  • 9. © 2020 SPLUNK INC. Lots of use cases ● Data analytics ● Business Intelligence ● Real-time dashboards ● etc… General use cases
  • 10. © 2020 SPLUNK INC. Stream processing patterns ComputeMessaging Storage Data Ingestion Data Processing / Querying Results StorageData Storage Data Serving
  • 11. © 2020 SPLUNK INC. Existing Solutions HDFS Messaging Real-time compute Storage Data Stream Querying Cloud Storage Apache Hadoop MR, Apache Spark, Presto, etc. Cloud Pub/Sub Apache Storm, Apache Flink, Apache Heron, etc.
  • 12. © 2020 SPLUNK INC. Problems with existing solutions ● Multiple Systems ● Duplication of data ○ Data consistency. Where is the source of truth? ● Latency between data ingestion and when data is queryable
  • 13. © 2020 SPLUNK INC. THIS IS WHERE APACHE PULSAR AND PULSAR SQL COMES IN…
  • 14. © 2020 SPLUNK INC. Apache Pulsar™ Flexible Messaging + Streaming System backed by a durable log storage
  • 15. © 2020 SPLUNK INC. Apache Pulsar as a Event Store 1 5
  • 16. © 2020 SPLUNK INC. Apache Pulsar Overview
  • 17. © 2020 SPLUNK INC. Architecture Multi-layer, scalable architecture Independent layers for processing, serving and storage Messaging and processing built on Apache Pulsar Storage built on Apache BookKeeper Consumer Producer Producer Producer Consumer Consumer Consumer Messaging Broker Broker Broker Bookie Bookie Bookie Bookie Bookie Event storage Function Processing WorkerWorker
  • 18. © 2020 SPLUNK INC. Segment Centric Storage ● In addition to partitioning, messages are stored in segments (based on time and size) ● Segments are independent from each others and spread across all storage nodes ● What this means for Pulsar SQL? ○ Allows SQL engine to read multiple bookies and leverage disk I/O and bandwidth of multiple machines even if the data is in one partition
  • 19. © 2020 SPLUNK INC. Writes ● Every segment/ledger has an ensemble ● Each entry in ledger has a ○ Write quorum ■ Nodes of the ensemble to which it is written (usually all) ○ Ack quorum ■ Nodes of the write quorum that must respond for that entry to be acknowledged (usually a majority) ● What this means for Pulsar SQL? ○ Allows users to configure the number of replicas SQL engine can read from ○ Trade off between read bandwidth and storage cost
  • 20. © 2020 SPLUNK INC. Apache Bookkeeper™ Internals ● Separate IO path for reads and writes ● Optimized for writing, tailing reads, catch-up reads ● What this means for Pulsar SQL? ○ Queries often involving scanning the data. ○ Read-a-head cache in BK allows for fast sequential reads
  • 21. © 2020 SPLUNK INC. Tiered Storage Unlimited topic storage capacity Achieves the true “stream-storage”: keep the raw data forever in stream form
  • 22. © 2020 SPLUNK INC. Tiered Storage ● Leverage cloud storage services to offload cold data — Completely transparent to clients ● Extremely cost effective — Backends (S3) (Coming GCS, HDFS) ● Example: Retain all data for 1 month — Offload all messages older than 1 day to S3 ● What this means for Pulsar SQL? ○ Pulsar SQL can query not only data in store in Bookies but also offloaded into a cloud storage service 2 2
  • 23. © 2020 SPLUNK INC. Schema Registry ● Store information on the data structure — Stored in BookKeeper ● Enforce data types on topic ● Allow for compatible schema evolutions ● JSON, Avro, and Protobuf supported ● What this means for Pulsar SQL? ○ Allows data to be structured so that it becomes queryable by a SQL language
  • 24. © 2020 SPLUNK INC. Pulsar SQL Interactive SQL queries over data stored in Pulsar Query old and real-time data 2 4
  • 25. © 2020 SPLUNK INC. Pulsar SQL / 2 ● Based on Presto by Facebook — https://prestodb.io/ ● Presto is a distributed query execution engine ● Fetches the data from multiple sources (HDFS, S3, MySQL, …) ● Full SQL compatibility 2 5
  • 26. © 2020 SPLUNK INC. Pulsar SQL / 3 ● Pulsar connector for Presto ○ Read data directly from BookKeeper — bypass Pulsar Broker ■ Can also read data offloaded to Tiered Storage (S3, GCS, etc.) ○ Many-to-many data reads ■ Data is split even on a single partition — multiple workers can read data in parallel from single Pulsar partition ■ Time based indexing — Use “publishTime” in predicates to reduce data being read from disk 2 6
  • 27. © 2020 SPLUNK INC. Pulsar SQL Architecture
  • 28. © 2020 SPLUNK INC. Benefits ● Do not need to move data into another system for querying ● Read data in parallel ○ Performance not impacted by partitioning ○ Increase throughput by increasing write quorum ● Newly arrived data able to be queried immediately
  • 29. © 2020 SPLUNK INC. Compared to other message buses? ● Other messaging platforms have Presto integrations ● Typically uses a consumer to read data from brokers ● Topic/partition served by a single broker (limiting disk IO and network bandwidth)
  • 30. © 2020 SPLUNK INC. User interaction Connect with CLI client $./bin/pulsar sql List Pulsar cluster presto> show catalogs; Catalog --------- pulsar system (2 rows) List Pulsar namespaces presto> show schemas in pulsar; Schema ----------------------- information_schema public/default public/functions sample/standalone/ns1 List Pulsar topics presto> show tables in pulsar."public/default"; Table ---------------- generator_test (1 row) Pulsar SQL
  • 31. © 2020 SPLUNK INC. User interaction Query data in topic presto> select * from pulsar."public/default".generator_test; firstname | middlename | lastname | email | username | password | telephonenumber | age | companyemail | -------------+-------------+-------------+----------------------------------+--------------+----------+-----------------+-----+-------------------------------------+ Genesis | Katherine | Wiley | genesis.wiley@gmail.com | genesisw | y9D2dtU3 | 959-197-1860 | 71 | genesis.wiley@interdemconsulting.eu | Brayden | | Stanton | brayden.stanton@yahoo.com | braydens | ZnjmhXik | 220-027-867 | 81 | brayden.stanton@supermemo.eu | Benjamin | Julian | Velasquez | benjamin.velasquez@yahoo.com | benjaminv | 8Bc7m3eb | 298-377-0062 | 21 | benjamin.velasquez@hostesltd.biz | Michael | Thomas | Donovan | donovan@mail.com | michaeld | OqBm9MLs | 078-134-4685 | 55 | michael.donovan@memortech.eu | Brooklyn | Avery | Roach | brooklynroach@yahoo.com | broach | IxtBLafO | 387-786-2998 | 68 | brooklyn.roach@warst.biz | Skylar | | Bradshaw | skylarbradshaw@yahoo.com | skylarb | p6eC6cKy | 210-872-608 | 96 | skylar.bradshaw@flyhigh.eu | . . . Pulsar SQL
  • 32. © 2020 SPLUNK INC. Demo
  • 33. © 2020 SPLUNK INC. Performance Setup • 3 Nodes • 12 CPU cores • 128 GB RAM • 2 X 1.2 TB NVMe disks Results • JSON (Compressed) • ~60 Millions Rows / Second • Avro (Compressed) • ~50 Million Rows / Second
  • 34. © 2020 SPLUNK INC. Improving query efficiency ● Query by partition ○ Scanning a large amounts of data may be costly and time-consuming ○ If the data is keyed and hashed to a specific partition, you can simply query the specify partition ○ For example ,if you have tweets keyed by author ingested in Pulsar SELECT tweet.author, tweet.content WHERE tweet.author = “jerry” AND __partition__ = 1 FROM pulsar.”public/default”.tweets ● Query by publish time ○ Ledgers/segments are naturally sorted by publish time ○ Only data within publish time will be read ○ Select a range of publish times to minimize the data that needs to be read SELECT tweet.author, tweet.content WHERE tweet.author = “jerry” AND __partition__ = 1 AND __publish_time__ > timestamp '2020-06-15 09:00:00' FROM pulsar.”public/default”.tweets
  • 35. © 2020 SPLUNK INC. Case study: Job search analytics at zhaopin.com
  • 36. © 2020 SPLUNK INC. Background ● About ZhaoPin ○ Chinese job search website (Linkedin, Indeed, etc) ● Background ○ ZhaoPin is already a heavy user of Apache Pulsar ○ Using Pulsar to power their enterprise event bus ■ Data involving job position searches, job posts, and resume searches Case study: Job search analytics at zhaopin.com Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
  • 37. © 2020 SPLUNK INC. ● Debugging search results ○ “When the search results do not meet expectations…” ● Analyzing and improving search results ○ “Analyze the search criteria associated with a position that a job seeker applied for, such as when the position was first exposed to that user, in order to improve the search service.” ● Analyzing search logs ○ “Analyze search logs from different perspectives and generate charts that summarize data in different ways, such as by city, vocation, or keyword ranking. In this way, the search service can be improved by making it more specific.” Use cases Case study: Job search analytics at zhaopin.com Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
  • 38. © 2020 SPLUNK INC. ● ZhaoPin already using Pulsar ● Pulsar SQL allows queries using SQL syntax ● Pulsar SQL can save a large amount of data and is easy to scale up Why Pulsar SQL Case study: Job search analytics at zhaopin.com Source: https://streamnative.io/blog/tech/2020-05-07-zhaopin-tech-blog/
  • 39. © 2020 SPLUNK INC. Quick Start guide: https://pulsar.apache.org/docs/en/sql-getting-started/ How to get started?
  • 40. © 2020 SPLUNK INC. ● Performance tuning ● Store data in columnar format ○ Improve compression ratio ○ Materialize relevant columns ● Support different indices Future work
  • 41. © 2020 SPLUNK INC. Questions? Email: jerryp@splunk 4 1