Scaling Apache Pulsar to 10 PB/day

© 2019 SPLUNK INC.
Scaling Apache Pulsar to 10 PB/day
June 2021
Karthik Ramasamy
Splunk

© 2020 SPLUNK INC.
Karthik
Ramasamy
Senior Director of Engineering
@karthikz
streaming @splunk | ex-CEO of @streamlio | co-creator of @heronstreaming | ex @Twitter | Ph.D

During the course of this presentation, we may make forward‐looking statements
regarding future events or plans of the company. We caution you that such statements
reflect our current expectations and estimates based on factors currently known to us
and that actual events or results may differ materially. The forward-looking statements
made in the this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, it may not contain current or
accurate information. We do not assume any obligation to update any forward‐
looking statements made herein.

In addition, any information about our roadmap outlines our general product direction
and is subject to change at any time without notice. It is for informational purposes only,
and shall not be incorporated into any contract or other commitment. Splunk undertakes
no obligation either to develop the features or functionalities described or to include any
such feature or functionality in a future release.

Splunk, Splunk>, Data-to-Everything, D2E and Turn Data Into Doing are trademarks and registered trademarks of Splunk Inc. in the
United States and other countries. All other brand names, product names or trademarks belong to their respective owners. © 2020
Splunk Inc. All rights reserved
Forward-

Looking

Statements
© 2020 SPLUNK INC.

© 2019 SPLUNK INC.
Agenda 1) Introduction to Splunk & DSP

2) Requirements, Use Cases & Deployment

3) Initial Cluster Size Estimation

4) Optimizations

5) Conclusion

© 2020 SPLUNK INC.
Data
 
Lakes
Master Data
Management
ETL
Point Data
Management
 
Solutions
Data
 
Silos
Business
Processes
The
 
Data-to-Everything
Platform
IT
Security
DevOps

© 2019 SPLUNK INC.
Splunk DSP
A real time stream processing solution that collects, processes and delivers data to Splunk and other
destinations in milliseconds
Splunk Data Stream Processor
Detect Data Patterns or Conditions
Mask Sensitive Data
Aggregate Format
Normalize Transform
Filter Enhance
Turn Raw Data Into
 
High-value Information

Protect Sensitive Data
Distribute Data To Splunk
 
Or Other Destinations
Data
 
Warehouse
Public
 
Cloud
Message
 
Bus

© 2019 SPLUNK INC.
DSP - Bird’s Eye View
HEC
S2S
Batch
Apache Pulsar
Stream Processing
Engine
External
Systems
REST Client
Forwarders
Data Source
Splunk

Indexer
Apache Pulsar is at the core of DSP

© 2019 SPLUNK INC.
Customer Requirements & Deployment

© 2019 SPLUNK INC.
Customers
✦ DSP is deployed at several customer
s

✦ Some of those customers are marquee customers with large volume of dat
a

✦ One such marquee customer is in
fi
nance and payments

© 2019 SPLUNK INC.
Use Cases
✦ Microservices and applications emit log
s

✦ Logs contain rich informatio
n

✦ Process these logs and extract monitoring & tracing informatio
n

✦ Filter these logs depending on log volume and if there is high value - justifying retentio
n

✦ Compute real time business metrics

© 2019 SPLUNK INC.
Data Requirements
✦ Environment - Google Cloud Platform
✦ Use of n1-standard-32 VM
s

✦ Raw data ingestion of 10 PB/day that translates ~120 GB/se
c

✦ Data retention of 3 hour
s

✦ Need to handle the entire traf
fi
c load when a zone fails

© 2019 SPLUNK INC.
DSP Ingest Cluster
DSP Compute Cluster
DSP Compute Cluster
DSP Compute Cluster
DSP Deployment
Log
Publisher
Log
Publisher
Log
Publisher
Apache Pulsar

Cluster
Pipeline 1
Pipeline 2
Pipeline 3
Splunk
Enterprise
Splunk
Observability

© 2019 SPLUNK INC.
DSP Deployment
✦ Separation of ingestion and computatio
n

✦ Pipeline isolation and no noisy neighbor issue
s

✦ Troubleshooting single pipeline gets easie
r

✦ Might not need over provisioning except for peak load + fudge factor (as compared to deploying a single
cluster)

© 2019 SPLUNK INC.
VM Configuration - n1-standard-32
✦ 32 vCPU
s

✦ 120 GB of memor
y

✦ Max number of PDs (EBS equivalent) - 12
8

✦ Max total PD size - 257 T
B

✦ Max egress network bandwidth - 32 Gbps (4 GBps
)

✦ Max 24 L-SSDs for a total of 9 TB

© 2020 SPLUNK INC.
Storage Options in GCP
P-SSD
P-HDD L-SSD

© 2019 SPLUNK INC.
Initial Estimation

© 2019 SPLUNK INC.
Apache Pulsar Requirements
✦ Replica factor of
3

✦ Need to handle 120 GBps of raw traf
fi
c

✦ Need to handle 360 GBps of storage write bandwidt
h

✦ With journal required write bandwidth 720 GBp
s

✦ Total storage required for retention - 3.9 P
B

✦ Total ingress network bandwidth - 480 GBp
s

✦ Total egress network bandwidth - 1200 GBps

© 2019 SPLUNK INC.
Pulsar Cluster Size Estimation
✦ Size of a Pulsar Cluster for a given workload depends on three parameters -

✦ Storage Density - Aggregate storage capacity needed in the cluster and proportional to retention of dat
a

✦ Storage Bandwidth - Aggregate write throughput and read throughput needed for data ingestion and
consumption. Heavily depends on storage medi
a

✦ Network Bandwidth - Aggregate network bandwidth available in the cluster for input traf
fi
c and output
traf
fi
c.

© 2019 SPLUNK INC.
Estimating Storage Bandwidth
Bookie
Bookie
Bookie
Broker
Producer
Journal
Data
Journal
Data
Journal
Data
Consumer 1X data
2X data
3X data
2X data
2X data
6X data

© 2019 SPLUNK INC.
Estimating Network Bandwidth
Bookie
Bookie
Bookie
Broker
Producer
Journal
egress
ingress
egress
ingress
Data
Journal
Data
Journal
Data
ingress
ingress
egress
egress
egress
egress
egress
Consumer
egress
egress
egress
1X ingress

4X egress
1X ingress

2X egress
4X ingress

10X egress
1X ingress

2X egress
1X ingress

2X egress

© 2020 SPLUNK INC.
Estimating VMs using P-HDD
0
1000
2000
3000
4000
Storage Bandwidth Storage Density Network Bandwidth
300
444
3686
VMs w/journal
Max of 200 MB/sec write
throughput per VM
Max of 9 TB

per instance
Max of 4 GBps egress
and ingress bandwidth
Dominated by
Storage Bandwidth

© 2020 SPLUNK INC.
Estimating VMs using P-SDD
0
500
1000
1500
2000
300
444
1843
VMs w/journal
throughput per VM
Max of 9 TB

per instance
Dominated by
Storage Bandwidth

© 2019 SPLUNK INC.
Estimating Network Bandwidth using L-HDD
Bookie
Bookie
Bookie
Broker
Producer
Journal
ingress
egress
ingress
Data
Journal
Data
Journal
Data
ingress
ingress
Consumer
egress
egress
egress
1X ingress

4X egress
1X ingress
1X ingress
1X ingress
4X ingress

5X egress

© 2020 SPLUNK INC.
Estimating VMs using L-SDD
0
225
450
675
900
120
444
868
VMs w/journal
throughput per VM
Max of 9 TB

per instance
Dominated by
Storage Bandwidth

© 2020 SPLUNK INC.
Estimation of VMs - Comparison
0
1000
2000
3000
4000
P-HDD P-SSD L-SSD
868
1843
3686
VMs w/journal

© 2019 SPLUNK INC.
Optimizations

© 2019 SPLUNK INC.
Optimization #1 - Eliminating Journal
✦ Different types of durabilit
y

✦ Persistent Durability - No data loss in the presence of nodes failures or entire cluster failur
e

✦ Replicated Durability - No data loss in the presence of limited nodes failure
s

✦ Transient Durability - Data loss in the presence of failure
s

Since all the data is machine logs, we implemented replicated durability

© 2019 SPLUNK INC.
Replicated Durability
Bookie
Bookie
Bookie
Broker
Producer
Data
Data
Data

© 2020 SPLUNK INC.
Estimating VMs
0 ms
1000 ms
2000 ms
3000 ms
4000 ms
P-HDD P-SSD L-SSD
444
922
1843
868
1843
3686
VMs w/journal VMs w/o journal
Dominated by
Storage Bandwidth
Dominated by
Storage Bandwidth
Dominated by
Storage Density

© 2019 SPLUNK INC.
Optimization #2 - Direct I/O
✦ Overhead of page cache in container environment is pretty hig
h

✦ Kernel needs to keep track of the usage quota per container for the page cach
e

✦ These translate into maintaining additional data structures and lookups (older kernel had n^2 lookup time for
getting pages in & out
)

✦ Bypassed page cache for BookKeeper entry log, using JNI
:

✦ We already have in memory caches (write and read-ahead
)

✦ We have better control on what to cache and when to evic
t

✦ Avoid double buffering

© 2020 SPLUNK INC.
Performance of Direct I/O
0 MB/s
400 MB/s
800 MB/s
1200 MB/s
1600 MB/s
P-HDD P-SSD L-SSD
1600
600
300
850
400
200
Before Direct I/O After Direct I/O

© 2020 SPLUNK INC.
Estimating VMs
0 ms
1000 ms
2000 ms
3000 ms
4000 ms
P-HDD P-SSD L-SSD
444
614
1228
444
922
1843
868
1843
3686
VMs w/journal VMs w/o journal VMs with direct i/o
Dominated by
Storage Bandwidth
Dominated by
Storage Bandwidth Dominated by
Storage Density

© 2019 SPLUNK INC.
Optimization #3 - Use of Compression
Bookie
Bookie
Bookie
Broker
Producer
Data
Data
Data
Compressed data
Consumer Compressed data
C
U

© 2019 SPLUNK INC.
Employing compression
✦ Compression ratio of 4
x

✦ Need to handle 360 GBps —> 90 GBps of storage write bandwidt
h

✦ Total storage required for retention - 3.9 PB —> 975 T
B

✦ Total ingress network bandwidth - 480 GBps —> 120 GBps

✦ Total egress network bandwidth - 1200 GBps —> 240 GBps

© 2020 SPLUNK INC.
Estimating VMs
0 ms
1000 ms
2000 ms
3000 ms
4000 ms
P-HDD P-SSD L-SSD
111
154
308
444
614
1228
444
922
1843
868
1843
3686
VMs w/journal VMs w/o journal VMs with direct i/o VMs with compression
Dominated by
Storage Density

© 2019 SPLUNK INC.
Surviving Zone Failure
Segment 1
Segment 2
Segment n
. 
. 
.
Segment 2
Segment 3
Segment n
. 
. 
.
Segment 3
Segment 1
Segment n
. 
. 
.
Storage
Broker
Serving
Broker Broker
✦ Zone/Rack Failure
s

✦ Bookies provide rack awarenes
s

✦ Broker replicate data to different racks/zone
s

✦ In the presence of zone/rack failure, data is available
in other zone
s

✦ One zone failure means two zones should be capable of
handling the entire traf
fi
c

✦ Requires 50% additional VM
s

Zone A Zone B Zone C

© 2020 SPLUNK INC.
Estimating VMs
0 ms
1000 ms
2000 ms
3000 ms
4000 ms
P-HDD P-SSD L-SSD
222
308
616
111
154
308
444
614
1228
444
922
1843
868
1843
3686
VMs w/journal VMs w/o journal VMs with direct i/o VMs with compression VMs to survive zone failure

© 2019 SPLUNK INC.
Optimization #4 - C++ Client CPU & Memory Usage
✦ Better round robin across partitions - maximizing the batch size per partitio
n

✦ Having bigger batches reduces the cpu usage for client, broker and bookie
s

✦ Increases the compression facto
r

✦ Reduced client memory usage
s

✦ Optimizations to minimize memory allocation overhead

✦ Implemented memory limit in C++ produce
r

✦ Simpli
fi
es the user con
fi
guration — One single setting instead of multiple queue sizes and complex
math

© 2019 SPLUNK INC.
OSS Contributions
✦ Github Pull Request 8283 - C++ Client is allocating buffer bigger than neede
d

✦ Github Pull Request 8331 - C++ Client back-pressure is done on batches rather than number of message
s

✦ Github Pull Request 8395 - C++ Implement batch aware producer router

✦ Github Pull Request 9679 - C++ Implemented memory limit in C++ producer

© 2019 SPLUNK INC.
Future Work
✦ Typical operations involve

✦ Upgrade to new version of Pulsa
r

✦ Upgrade to new OS versio
n

✦ Apply new security patches to O
S

✦ New Pulsar version - In-place upgrade
s

✦ OS & security patches - Applying one VM at a time is too slow for large cluster

Scaling Apache Pulsar to 10 PB/day

More Related Content

What's hot

Similar to Scaling Apache Pulsar to 10 PB/day

More from Karthik Ramasamy

Recently uploaded

Scaling Apache Pulsar to 10 PB/day