Why Splunk Chose Pulsar_Karthik Ramasamy

© 2019 SPLUNK INC.
Why Splunk Chose Pulsar
June 2019
Karthik Ramasamy
Splunk

© 2020 SPLUNK INC.
Karthik
Ramasamy
Senior Director of Engineering
@karthikz
streaming @splunk | ex-CEO of @streamlio | co-creator of @heronstreaming | ex @Twitter | Ph.D

During the course of this presentation, we may make forward-looking statements
regarding future events or plans of the company. We caution you that such statements
reflect our current expectations and estimates based on factors currently known to us
and that actual events or results may differ materially. The forward-looking statements
made in the this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, it may not contain current or
accurate information. We do not assume any obligation to update any forward-
looking statements made herein.
In addition, any information about our roadmap outlines our general product direction
and is subject to change at any time without notice. It is for informational purposes only,
and shall not be incorporated into any contract or other commitment. Splunk undertakes
no obligation either to develop the features or functionalities described or to include any
such feature or functionality in a future release.
Splunk, Splunk>, Data-to-Everything, D2E and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the
United States and other countries. All other brand names, product names or trademarks belong to their respective owners. © 2020
Splunk Inc. All rights reserved
Forward-
Looking
Statements
© 2020 SPLUNK INC.

© 2019 SPLUNK INC.
Agenda 1) Introduction to Splunk
2) Streaming system requirements
3) How Pulsar satisfies the requirements?
4) Apache Pulsar at Splunk
5) Questions?

© 2020 SPLUNK INC.
Cloud 5G IoT AI
Mobility Virtualization Robotic Process 
Automation
Blockchain VR
Platforms
New technologies are  
enabling and fueling digitization

© 2020 SPLUNK INC.
Data is Transforming Everything
The way we work, live and play

© 2020 SPLUNK INC.
Data 
LakesMaster Data
Management
ETL
Point Data
Management  
Solutions
Data 
Silos
Business
Processes
The  
Data-to-Everything
Platform
IT
Security
DevOps

© 2019 SPLUNK INC.
Core of Emerging Use Cases
Streaming data
transformation
Data
distribution
Real-time analytics
Real-time monitoring and
notiﬁcations
IoT analytics
!
Event-driven workﬂows
Messaging / Streaming Systems
Interactive applications
Log processing and
analytics

© 2020 SPLUNK INC.
Streaming System Requirements
DurabilityScalability
Fault
Tolerance
High
Availability
Sharing &
Isolation
Messaging
Models
Client
Languages
Persistence Type Safety
Deployment in
k8s

© 2020 SPLUNK INC.
Streaming System Requirements
AdoptionEcosystem Community Licensing
Disaster
Recovery
Operability TCO Observability

© 2019 SPLUNK INC.
Requirement #1 - Scalability
✦ Trafﬁc can wildly vary while the system in production
✦ System need to scale up with no effect to publish/consume throughput and latency
✦ Support for linear increase/decrease in publish/consume throughput as new nodes are added
✦ Automatic spreading out load to new machines as new nodes are added
✦ Scalability across different dimensions - serving and storage

© 2019 SPLUNK INC.
Scalability
Consumer
Producer
Producer
Producer
Consumer
Consumer
Consumer
Messaging
Broker Broker Broker
Bookie Bookie Bookie Bookie Bookie
Event storage
Function Processing
WorkerWorker
✦ Independent layers for processing, serving and storage
✦ Messaging and processing built on Apache Pulsar
✦ Storage built on Apache BookKeeper

© 2019 SPLUNK INC.
Requirement #2 - Durability
✦ Splunk applications have different types of durability
✦ Persistent Durability - No data loss in the presence of nodes failures or entire cluster failure - e.g security &
compliance
✦ Replicated Durability - No data loss in the presence of limited nodes failures - e.g, machine logs
✦ Transient Durability - Data loss in the presence of failures - e.g metrics data

© 2019 SPLUNK INC.
Durability
Bookie
Bookie
BookieBrokerProducer
Journal
Journal
Journal
fsync
fsync
fsync

© 2019 SPLUNK INC.
Requirement #3 - Fault Tolerance
✦ Ability of the system to function under component failures
✦ Ideally without any manual intervention up to a certain degree

© 2019 SPLUNK INC.
Pulsar Fault Tolerance
Segment 1
Segment 2
Segment n
. . .
Segment 2
Segment 3
Segment n
. . .
Segment 3
Segment 1
Segment n
. . .
Segment 1
Segment 2
Segment n
. . .
Storage
Broker
Serving
Broker Broker
✦ Broker Failure
✦ Topic reassigned to available broker based on load
✦ Can construct the previous state consistently
✦ No data needs to be copied
✦ Bookie Failure
✦ Immediate switch to a new node
✦ Background process copies segments to other bookies
to maintain replication factor

© 2019 SPLUNK INC.
Requirement #4 - High Availability
✦ System should continue to function in the cloud or on-prem in following conditions, if applicable
✦ When two nodes/instances fail
✦ When an availability zone or a rack fails

© 2019 SPLUNK INC.
Pulsar High Availability
Segment 1
Segment 2
Segment n
. . .
Segment 2
Segment 3
Segment n
. . .
Segment 3
Segment 1
Segment n
. . .
Storage
Broker
Serving
Broker Broker
✦ Node Failures
✦ Broker failures
✦ Bookie failures
✦ Handled similar to respective component failures
✦ Zone/Rack Failures
✦ Bookies provide rack awareness
✦ Broker replicate data to different racks/zones
✦ In the presence of zone/rack failure, data is available
in other zones
Zone A Zone B Zone C

© 2019 SPLUNK INC.
Requirement #5 - Sharing and Isolation
✦ System should have the capabilities to
✦ Share many applications on the same cluster for cost and manageability purposes
✦ Isolate different applications on their own machines in the same cluster when needed

© 2019 SPLUNK INC.
Sharing and Isolation
Apache Pulsar Cluster
Product
Safety
ETL
Fraud
Detection
Topic-1
Account History
Topic-2
User Clustering
Topic-1
Risk Classification
MarketingCampaigns
ETL
Topic-1
Budgeted Spend
Topic-2
Demographic Classification
Topic-1
Location Resolution
Data
Serving
Microservice
Topic-1
Customer Authentication
10 TB
7 TB
5 TB
✦ Software isolation
Storage quotas, flow control, back pressure, rate limiting
✦ Hardware isolation
Constrain some tenants on a subset of brokers/bookies

© 2019 SPLUNK INC.
Requirement #6 - Client Languages
Apache Pulsar Cluster
Java
Python
Go
C++ C
Officially supported by the project

© 2019 SPLUNK INC.
Requirement #7 - Multiple Messaging Models
✦ Splunk applications require different consuming models
✦ Collect once and deliver once capability (e.g) process S3 ﬁle and ingest into index
✦ Receive data once and deliver many times (e.g) multiple pipelines sharing same data for different
types of processing
✦ Avoid two systems, if possible - from cost and operations perspective
✦ Avoid any additional infra-level code, if possible, that emulates one semantics on top of another
system

© 2020 SPLUNK INC.
Pulsar Messaging Models
• Shared Subscription
• Key Shared Subscription
Messaging Queuing
• Exclusive Subscription
• Failover Subscription
Native support avoids two systems and extra infrastructure code
that requires maintenance

© 2019 SPLUNK INC.
Requirement #8 - Persistence
Producer
Producer
Producer
Consumer
Consumer
Cold storage
Hot storage
Topic
✦ Offload cold data to lower-cost storage (e.g.
cloud storage, HDFS)
✦ Manual or automatic (configurable threshold)
✦ Transparent to publishers and consumers
✦ Allows near-infinite event storage at low cost
(e.g) compliance and security

© 2019 SPLUNK INC.
Requirement #9 - Type Safety
✦ Splunk applications are varied
✦ One class requires fixed schema
✦ Another class requires fixed schema with evolution
✦ Other class requires flexibility for no schema or handled at the application level
✦ Avoid bringing another system for schema management
✦ Support for multiple different types -

© 2019 SPLUNK INC.
Pulsar Schema Registry
✦ Provides type safety to applications built on top of Pulsar
✦ Server side - system enforces type safety and ensures that producers and consumers remain synced
✦ Schema registry enables clients to upload data schemas on a topic basis.
✦ Schemas dictate which data types are recognized as valid for that topic

© 2019 SPLUNK INC.
Requirement #10 - Ease of Deployment in k8s
✦ Splunk uses k8s for orchestration
✦ System should be easily deployable in k8s
✦ Surface area of the system exposed outside k8s should be minimal - one single end point backed by
✦ Should be able to segregate the nodes receiving external trafﬁc
✦ Should be ﬂexible to deploy from CI/CD pipelines for testing and development

© 2019 SPLUNK INC.
Pulsar Deployment in k8s
Segment 1
Segment 2
Segment n
. . .
Segment 2
Segment 3
Segment n
. . .
Segment 3
Segment 1
Segment n
. . .
Segment 1
Segment 2
Segment n
. . .
S
LB
Proxy Proxy Proxy
Segment 1
Segment 2
Segment n
. . .
Segment 2
Segment 3
Segment n
. . .
Segment 3
Segment 1
Segment n
. . .
Segment 1
Segment 2
Segment n
. . .
S
LB
Proxy Proxy Proxy
Aggregated Deployment Segregated Deployment

© 2019 SPLUNK INC.
Requirement #11 - Operability
✦ System should be online and continue to serve production trafﬁc in the following scenarios
✦ OS upgrades
✦ Security patches
✦ Disk swapping
✦ Upgrading
✦ Self adjusting components
✦ Bookies turn themselves into readonly when 90% of disk is full
✦ Load manager to balance trafﬁc across brokers

© 2019 SPLUNK INC.
Requirement #12 - Disaster Recovery
✦ Critical enterprise data ﬂows through Splunk products
✦ Customer expect continuous availability in cloud / on-premise
✦ Required to handle data center failures seamlessly
✦ Pulsar provides both
✦ Asynchronous Replication
✦ Synchronous Replication

© 2019 SPLUNK INC.
Disaster Recovery - Async Replication
✦ Two independent clusters, primary/
standby or primary/primary
conﬁguration
✦ Conﬁgured tenants and namespaces
replicate to standby
✦ Data published to primary is
asynchronously replicated to standby
✦ Producers and consumers restarted in
second datacenter upon primary failure
✦ With replicated subscriptions,
consumers start close to where they
left off
Producers
(active)
Datacenter A
Consumers
(active)
Pulsar Cluster
(primary)
Datacenter B
Producers
(standby)
Consumers
(standby)
Pulsar Cluster
(standby)
Pulsar
replication
ZooKeeper ZooKeeper

© 2019 SPLUNK INC.
Requirement #13 - Performance & TCO
✦ Splunk application requirements are very varied
✦ real-time (< 10 ms)
✦ near real-time (< few mins)
✦ high throughput (ability to handle multi PB/day in a single cluster)
✦ Conducted a detailed performance study comparing with Kafka

© 2019 SPLUNK INC.
Performance
✦ Pulsar provides consistently 5x-50x lower in latency
✦ Pulsar uses 20-30% less brokers + bookies as it efﬁciently exploits available disk bandwidth
✦ Pulsar uses 50–60% less CPU cores with complete control of memory
✦ Pulsar single partition throughput is 5x higher and 5x-50x lower in latency

© 2019 SPLUNK INC.
Pulsar is 1.5-2x lower in capex
cost with 5-50x improvement in
latency and 2-3x lower in opex
due to layered architecture

© 2019 SPLUNK INC.
Requirement #14 - Observability
✦ When in production, we need visibility about overall health of the system and its components
✦ System should expose detailed relevant metrics
✦ Should be able to easy to debug and troubleshoot

© 2019 SPLUNK INC.
Pulsar Observability
✦ System overview metrics
✦ Messaging metrics
✦ Topic metrics
✦ Function metrics
✦ Broker metrics
✦ Bookie metrics
✦ Proxy metrics
✦ JVM metrics
✦ Log metrics
✦ Zookeeper metrics
✦ Container metrics
✦ Host metrics

© 2019 SPLUNK INC.
Requirement #18 - Licensing
✦ Apache License 2.0
✦ Afﬁliated with vendor neutral institutions - Apache/CNCF
✦ Avoid vendor controlled components, if needed
✦ Vendor could change the license later

© 2019 SPLUNK INC.
Apache Pulsar vs Apache Kafka
Multi-tenancy
A single cluster can support many
tenants and use cases
Seamless Cluster Expansion
Expand the cluster without any
down time
High throughput & Low Latency
Can reach 1.8 M messages/s in a
single partition and publish
latency of 5ms at 99pct
Durability
Data replicated and synced to disk
Geo-replication
Out of box support for
geographically distributed
applications
Uniﬁed messaging model
Support both Topic & Queue
semantic in a single model
Tiered Storage
Hot/warm data for real time access and
cold event data in cheaper storage
Pulsar Functions
Flexible light weight compute
Highly scalable
Can support millions of topics, makes
data modeling easier
Licensing
Apache 2.0 - no vendor speciﬁc
licensing
Multiprotocol Handlers
Support for AMPQ, MQTT and
Kafka
OSS
Several core features of Pulsar are in
Apache as compared to Kafka

© 2019 SPLUNK INC.
Apache Pulsar at Splunk
✦ Apache Pulsar as a service running in production processing several billions of messages/day
✦ Apache Pulsar is integrated as the message bus with Splunk DSP 1.1.0 - core streaming product
✦ Apache Pulsar is being introduced in other initiatives as well.

© 2019 SPLUNK INC.
Splunk DSP
A real time stream processing solution that collects, processes and delivers data to Splunk and other
destinations in milliseconds
Splunk Data Stream Processor
Detect Data Patterns or Conditions
Mask Sensitive Data
Aggregate Format
Normalize Transform
Filter Enhance
Turn Raw Data Into 
High-value Information
Protect Sensitive Data
Distribute Data To Splunk 
Or Other Destinations
Data 
Warehouse
Public 
Cloud
Message 
Bus

© 2020 SPLUNK INC.
Data driven decision making is
challenged with multiple
instances, subsidiaries, 
on-premise + cloud/multi-cloud
InsightsData VisibilityControl
Massive amounts of data
make it hard to collect,
protect and deliver the right
data to the right users and
systems
Generate business-
critical insights faster to
remain competitive in
data-driven environment
DSP solves the
challenges

© 2019 SPLUNK INC.
DSP Architecture
HEC
S2S
Batch
Apache Pulsar
Stream Processing
Engine
External
Systems
REST Client
Forwarders
Data Source
Splunk
Indexer
Apache Pulsar is at the core of DSP

© 2020 SPLUNK INC.
Closing Remarks
Future Work
✦ Auto-partitioning
✦ Pluggable metadata store
✦ Enhancing the state store
Current Work
✦ Improved Go client
✦ Support for batch connectors
✦ Pulsar k8s operator
✦ Critical bug fixes
Splunk is committed to advancing Apache Pulsar - as it is used by our core products and cloud services

Why Splunk Chose Pulsar_Karthik Ramasamy

More Related Content

What's hot

Similar to Why Splunk Chose Pulsar_Karthik Ramasamy

More from StreamNative

Recently uploaded

Why Splunk Chose Pulsar_Karthik Ramasamy