Key considerations in productionizing streaming applications

Key Consideration in
productionizing
Streaming Application
Vikram Agrawal, Prateek Srivastava

00Copyright 2018 © Qubole
● Stream Processing Paradigm
● Deep-dive into Structured Streaming
● Productionizing Streaming Application
● Streaming Lens
Agenda

Data Processing Architecture
○ Data is pushed into Flat files, HDFS or databases
○ ETL Batch jobs to process raw data for various end goals

Stream Processing
○ Message buses such as Kafka/Kinesis/RabbitMQ are part of the architecture
○ Business need to process your data in real-time instead of nightly batch job

Stream Processing Use Cases
● Real time transformation like aggregations, deduplication
● Data enrichment using joins with other table/stream
● Ingest into a data-lake (such as s3) for further processing or archival
● Ingest into a data-warehouse (Redshift, ES) for ad-hocs analysis
● Real time dashboard/reporting (Druid etc)
● CEP rule processing or Model Scoring (Fraud Detection etc)

How to decide the Streaming Engine
● SLAs and use-cases
○ Latency
■ Ingestion/reporting use-cases can tolerate few secs latency
■ Model scoring has tighter requirement (in ms)
○ Throughput
■ Current and future Incoming data rate
○ Complexity of Analytics
■ Real time transformation requirements - join, format conversion vs filter,
selection
● Community Support
○ Technical skills required to adopt new technology
● Production readiness
○ Time required to build streaming Application
○ Fault Tolerance - Exactly/Atleast Once Delivery Guarantees
●

Why Spark Structured Streaming
● Latency
○ Micro-batch Execution for “sub-secs to few secs” is GA
○ Continuous Execution for “ms” latency is in Beta.
● Functionality
○ Built on top of Spark dataFrame APIs and takes advantage of SQL core engine
code & memory optimizations
○ Stream-stream join, stream-batch joins, late data handling, sliding window
aggregation, data format conversion, de-duplication etc
○ Connectors to and from various sources and sinks
○ Exactly/Atleast once semantics
● Throughput
○ Scalable and Mature Processing engine
○ Can easily handle 10s of million records per second
● API abstractions
○ Developer friendly - interoperability between batch and streaming code

Spark’s Functionality

Structured Streaming - under the hood
Abstractions of Repeated Queries
• Data Streams as unbounded
Table
• Streaming query is a batch-like
operation on this table
• After user specified trigger
interval, repeat the query on the
new records in the data stream.

Micro Batch Model
Input Data Source
Provider (say
Kafka)
determines range
of records for the
batch.
Spark creates an
optimized plan for
the execution
Plan is converted
into task and
executed by
workers. Actual
data read from
Source and write
into final
destination
happens in the
execution phase

Stateless Streaming - Ingest in S3
Batch 1
Batch 2
Batch 3
[1-4]
[5-8]
[9-10]
File 1
File 2
File 3
Micro batch consist of new records in each batch

Micro batch consists of New Input Records & Previous micro-batches’ sum saved in
a state store
Stateful Streaming - Running Sum
Batch 1
Batch 2
Batch 3
State
= 10
[1-4]
[5-8]
[9-10]
State
= 36
State
= 55

Productionizing Streaming Applications

Productionizing Streaming Application
Ease of
composition and
experimentation
Data Accuracy
and Consistency
Higher
Performance
Replay/Reprocess
Data
Lower TCO
Optimized for faster
downstream
processing
PortabilityMonitoring,
Insights & Alerts

● What should be the right cluster configuration for my streaming job?
● Data Ingestion rate is variable. How can I autoscale my cluster?
● How can I know if my streaming application is healthy?
● How should I partition my input data source?
● Time lag between the last processed event and tip of the input stream is
increasing. What can I do?
Problem Statement

● Performance tuning tool for Apache Spark
● Introduced a concept of critical path of a spark job to understand its
scalability limit
● Open-sourced by Qubole
● https://github.com/qubole/sparklens
Spark Lens

Spark Lens in Structured Streaming =
Streaming Lens
● Batch Running Time: Actual Time taken to process a micro batch
● Trigger Interval: Specified by the user while writing streaming query. Can
be proxied as SLA.
● Critical Path Time: Time to complete the microbatch if we had
provisioned infinite executors.

Approach
● Sampling and Analyzing some Microbatches at regular intervals can
give a fair idea of the health of the streaming pipeline.
● Trigger Interval is a measure of the SLA which the pipeline is expected
to meet. Batch running time should be safely lower than trigger interval.
● If Critical Time is safely lower than Trigger Interval, throwing more
resources at the application can help in meeting the SLA specified by
trigger interval.

Trigger Interval vs batch processing time vs Critical Path
SLA
Under
Utilized
Over-utilized.Ups
cale to achieve
SLA
Autoscale cannot
help. Repartition
Desired zone

Condition I Condition II Pipeline State
Batch Running Time <
0.4 * Trigger Interval
OVERPROVISIONED or
UNDER-UTILIZED
Time < Batch Running
Time < 0.8 * Trigger
Interval
DESIRED
Batch Running Time >
Critical Time < 0.7 *
Trigger Interval
UNDER-PROVISIONED
or OVER- UTILIZED
Batch Running Time >
Critical Time >= 0.7 *
Trigger Interval
UNHEALTHY
StreamingLens Heuristic

Pipeline State Inference Recommendations
OVERPROVISIONED ● Stream may be lagging due to
inaccurately configured source
properties or trigger interval.
● Cluster may be over
provisioned.
● If stream is lagging, increase load on source by
increasing thresholds like maxOffsetsPerTrigger (for
Kafka) or maxFilesPerTrigger (for File Source)
● Reduce the value of trigger interval if required.
● If stream is not lagging, downscale the cluster if
required to reduce costs.
DESIRED - -
UNDER-PROVISIONED Tasks are getting queued up. We
can increase no. of parallely
running task to meet the Trigger
Interval.
Increase the number of executors.
UNHEALTHY ● Increasing executors won’t be
helpful.
● Need to increase parallelism
and create more tasks.
● Possibility of skew.
Recommendation depends on Source
● For Kafka Source, increase Kafka Partitions.
● For Kinesis source, increase Kinesis Shards.
● If query has aggregations, increasing shuffle
partitions may be helpful.

● Query Operations: Aggregation Based on Timestamp
● Executors: Single 8 core executor
● Shuffle Partitions: 100
● Trigger Interval: 60 secs
● Rate: 5000 rows per second
Setup 1

Insight
Cluster is over-provisioned.
Recommendation
Recommendation:
1. Downscale ( if cant reduce number of executors, pick lower capacity
machine) and/or
2. Reduce Trigger Time (Get more real-time updates) and/or
3. Process more data (Check your configs, increase ingestion rate etc)
Next Step: Try increasing the input data rate

● Executors: Single 8 core executor
● Rate: 20000 rps
Setup 2

Insight
Cluster is under-provisioned with a high risk of meeting SLA
Recommendation
Recommendation:
1. UpScale or
2. Have smaller tasks ~ more partitions
3. Process same task in lesser amount of time - Pick better machine
Next Step: Increase number of executors, Increase shuffle partition

● Executors: Three 8 core executor
● Rate: 20000 rows per second
Setup 3

● Open Source StreamingLens
● Things to do
○ Incorporate “time lag” in our recommendation
○ Convert Recommendation → Action by implementing SLA aware streaming
autoscaling for better cost control
Next steps
Contributions will be welcome

● Spark Lens - https://github.com/qubole/sparklens
● Kinesis Data Source - https://github.com/qubole/kinesis-sql
● S3-SQS Input Data Source for Better Performance -
https://github.com/apache/bahir/pull/91
● RocksDb State Storage - https://github.com/itsvikramagr/rocksdb-state-storage
Other open source contributions

Key considerations in productionizing streaming applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Key considerations in productionizing streaming applications

Similar to Key considerations in productionizing streaming applications (20)

More from KafkaZone

More from KafkaZone (7)

Recently uploaded

Recently uploaded (20)

Key considerations in productionizing streaming applications