Hai Lu
Staff Software Engineer
@ Stream Infra
Samza Portable Runner for Beam
1
2
3
4
Agenda
Introduction: About Apache
Samza
Deep Dive: Samza Portable Runner
for Beam
Use Cases of Stream
Processing in Python
Future Work
Apache Samza
Samza is a distributed stream
processing framework for
building stateful applications
that process data at scale in
real-time.
• Developed at LinkedIn in
2013
• Apache Top-Level project
since Dec 2014
• Users: LinkedIn, Intuit, Slack,
TripAdvisor, Optimizely,
Redfin, VMWare ...
Apache Samza Stats
Messages
Over 1.5 trillion
messages processed
daily at LinkedIn
Jobs
3k jobs running at
LinkedIn in prod
Availability
99.99% availability for
running Samza on
Yarn
1.5T+ 3K+ 99.99%
Apache Samza: operating at large scale
Samza focuses on operating at a very large scale:
● YARN HA
● Incremental checkpointing and state changelog
● Host-affinity
● Standby container
Stream Processing Ecosystem at LinkedIn
App Services
Kafka
Espresso
Oracle
MySQL
Brooklin
Venice
Pinot
Couchbase
HDFS
Ingestion
Processing
Near Real Time Processing
(Apache Samza & Beam)
Change
Results
Samza & Beam
Streaming API & Model
• State-of-the-art data processing API &
model
• Portability framework in support of multiple
languages
• SQL & DSL, Libraries for ML use cases
Streaming Engine
• Large-scale distributed stream processing
• Scalable and durable local state
• Fault-tolerance and fast recovery
1
2
3
4
Agenda
Introduction: About Apache
Samza
Deep Dive: Samza Portable Runner
for Beam
Use Cases of Stream
Processing in Python
Future Work
Samza Portable
Runner: Motivation
Make Stream Processing
available everywhere (Python,
Go, etc)!
• Stream Processing more
accessible. Many internal
projects at LinkedIn are built
in Python
• Support machine learning
use cases where Python is
dominant.
• Potentially support
languages beyond Java and
Python. Go for example.
Samza & Beam: High-level
Python SDK
Beam
Pipeline
Samza
Runner
Java SDK Go SDK
Beam Portability Framework
• Users develop streaming application
with Beam SDKs
• Applications get translated into
language independent Beam
pipeline (protobuf)
• Pipelines get executed in the Samza
runner
Beam Pipeline to Samza: Translation
Samza Cluster
IO.Read
ParDo
Window.Into
GroupByKey
/Combine
UnboundedSourceSystem /BoundedSourceSystem
- Consume events and generate watermarks
DoFnOp:
- SimpleDoFnRunner / ExecutableStageDoFnRunner
- Inject Samza watermark, state and timers
WindowAssignOp
- Invoke window assignment function
GroupByKeyOp:
- GroupAlsoByWindowViaWindowSetNewDoFn
- Samza merge, partitionBy, watermark, state and
timers
Samza
DAG
Samza Runner / Job
Server
Legacy runner and portable runner to share as much code as possible
Beam Pipeline to Samza: Translation
Portable runner translation
● Impulse
● ExecutableStage
● Flatten
● GroupByKey
● IO
Beam Pipeline to
Samza: Translation
For portable pipelines, translate
IOs to Java Read/Write
• Convergence on using the
same IO clients (e.g. Kafka
clients)
• Leverage existing Java
components
• More efficient to do IOs on
Java side
Kafka IO in Samza Python
• From the Python side, users sees Python only
Kafka IO in Samza Python
• Under the hood, the “ReadFromKafka” is
translated into a protobuf transform and
passed to Samza runner on the Java
side.
• Here, topic name, cluster name, and
configs are enough information for Java
side to construct the Kafka consumers
Beam Pipeline to
Samza: Translation
Samza Table API: extension of IO
• Local table (RocksDb)
• Remote table (e.g.
Couchbase, Venice)
• Easy stream table join
support
• Out-of-box support for rate
limiting, caching, etc.
Table Write in Samza Python
(K1, V1)
Kafka Input
TableWrite (K1, Entry 1)
...
Remote/Local
Table
• Sinking the processing results to a remote/local table
Stream Table Join in Samza Python
(K1, V1)
Kafka Input
StreamTableJoinOp
(K1, Entry 1)
(K2, Entry 2)
...
Remote/Local
Table
(K1, [V1, Entry1])
PTransform Output
• Table read is provided as stream-table join.
• Useful for enriching the events to be processed
Table API in Samza Python
• Similar to Kafka IO, tables are also translated to execute on Java side
• Consistent APIs across Python and Java
Beam Pipeline to Samza: Runtime
Beam
Pipeline
Job
Server
A Tale of Two Languages
Python Process
Samza
Tasks
SDK
worker
gRPC
gRPC
Translate to Samza
High Level APIs
Java Process
• Pipeline submitted to job server and
translated into Samza High Level APIs
• Each Samza container starts two
processes - one Java, one Python
• Python worker executes UDFs (user
defined functions)
• Samza in Java handles everything
else (IO, state, shuffle, join, etc.)
Samza Deployment: Standalone Cluster
(p0, p2)
(p1, p3)
(p4, p5)
(p6, p7)
ZooKeeper
• Start Samza Java from a Python
process. Python process manages
lifecycles
• Using Samza as a library
• Coordination (leader election,
partition assignment) done by
ZooKeeper
Samza Processors
(Java)
SDK workers
(Python)
Create
Submit Job
Samza Deployment: YARN Cluster
Samza
Container
Samza
Container
Samza
Container
Application
Master
YARN Cluster
• Using YARN for resource
management
• Isolation, multi-tenancy,
fault-tolerance
• Work in progress to support YARN
mode in Samza Python
Authentication
For multi-tenant environment, data channels need to be protected.
● Simple solution is to enable SSL/TLS plus mutual authentication.
● But SSL/TLS could introduce unnecessary overhead.
Authentication
• Simple token based authentication
• Using loopback address so that port
and data packet are not exposed to
external network.
• Rely on fs system for acl
• Pull Request #8597
Unauthorized
Users
Java
Process
Python
Process
Token File
(rw-------)
1. write to 2. read from
3. every
connection
with token
4. authenticated
User account
Connect with
wrong token
Request denial
Performance
Deploying Samza Python to test performance...
● Initial result: ~85QPS per container (with saturated single core CPU
usage)
Performance: without bundling
• Single message round trip drags down
the performance
• Bundling (batching) is the key to
improve throughputPython Java
Input event
Processed event
• Single message round trip drags down
the performance
• Bundling (batching) is the key to
improve throughputPython Java
Input event
Processed event
...
...
...
...
...
...
Performance: with bundling
Samza Runloop (Java)
Performance: Samza Runloop
• Events are buffered/bundled for up to
N messages or up to T seconds before
sending to Python
• Buffer is backed by local state
(RocksDb) for persistence
• Would be nice to avoid implementing
this in every runner
Python...
Input events
Buffered events
Performance
bundle size QPS (k)
1 0.085
5 0.34
20 1.1
50 1.7
100 2.3
200 2.9
500 3.7
3000 4.1
5000 5.6
10000 7.8
50000 8.3
1
2
3
4
Agenda
Introduction: About Apache
Samza
Deep Dive: Samza Portable Runner
for Beam
Use Cases of Stream
Processing in Python
Future Work
Near real time Image OCR with Kafka IO
Image data
in Kafka
Python OCR*
Tensorflow
model
(portable)
Internal
service
(image url)
Download images
from url
Load into
memory
(image url, text)
OCR*: Optical Character
Recognition.
Kafka
Near real time Model Training with Samza Python
Feature data
in Kafka
Window GroupByKey
Training
algorithm
Coefficients
Store
Offline Training
Read coefficients
Update
coefficients
Push (less frequent)
HDFS
ETL
Near real time activity analysis/monitoring
Activity logs
in Kafka
Fixed
window
Count.perKey
(combine)
Process
Extract key
&
timestamp
Monitoring/Alerting
System
Capture (abnormal) activities and send to monitoring/alerting system
1
2
3
4
Agenda
Introduction: About Apache
Samza
Deep Dive: Samza Portable Runner
for Beam
Use Cases of Stream
Processing in Python
Future Work
Future
Samza Portable Runner
● YARN cluster support for portable runner
● Stateful ParDo and other features for
Python
● Jupyter Notebook/iBeam integration
● Go lang support
● Looking into cross language pipeline
from Beam
Future
Samza Runner in general
● Beam SQL
● Schema Aware
● Async API
● ...
Thank you

Samza portable runner for beam

  • 1.
    Hai Lu Staff SoftwareEngineer @ Stream Infra Samza Portable Runner for Beam
  • 2.
    1 2 3 4 Agenda Introduction: About Apache Samza DeepDive: Samza Portable Runner for Beam Use Cases of Stream Processing in Python Future Work
  • 3.
    Apache Samza Samza isa distributed stream processing framework for building stateful applications that process data at scale in real-time. • Developed at LinkedIn in 2013 • Apache Top-Level project since Dec 2014 • Users: LinkedIn, Intuit, Slack, TripAdvisor, Optimizely, Redfin, VMWare ...
  • 4.
    Apache Samza Stats Messages Over1.5 trillion messages processed daily at LinkedIn Jobs 3k jobs running at LinkedIn in prod Availability 99.99% availability for running Samza on Yarn 1.5T+ 3K+ 99.99%
  • 5.
    Apache Samza: operatingat large scale Samza focuses on operating at a very large scale: ● YARN HA ● Incremental checkpointing and state changelog ● Host-affinity ● Standby container
  • 6.
    Stream Processing Ecosystemat LinkedIn App Services Kafka Espresso Oracle MySQL Brooklin Venice Pinot Couchbase HDFS Ingestion Processing Near Real Time Processing (Apache Samza & Beam) Change Results
  • 7.
    Samza & Beam StreamingAPI & Model • State-of-the-art data processing API & model • Portability framework in support of multiple languages • SQL & DSL, Libraries for ML use cases Streaming Engine • Large-scale distributed stream processing • Scalable and durable local state • Fault-tolerance and fast recovery
  • 8.
    1 2 3 4 Agenda Introduction: About Apache Samza DeepDive: Samza Portable Runner for Beam Use Cases of Stream Processing in Python Future Work
  • 9.
    Samza Portable Runner: Motivation MakeStream Processing available everywhere (Python, Go, etc)! • Stream Processing more accessible. Many internal projects at LinkedIn are built in Python • Support machine learning use cases where Python is dominant. • Potentially support languages beyond Java and Python. Go for example.
  • 10.
    Samza & Beam:High-level Python SDK Beam Pipeline Samza Runner Java SDK Go SDK Beam Portability Framework • Users develop streaming application with Beam SDKs • Applications get translated into language independent Beam pipeline (protobuf) • Pipelines get executed in the Samza runner
  • 11.
    Beam Pipeline toSamza: Translation Samza Cluster IO.Read ParDo Window.Into GroupByKey /Combine UnboundedSourceSystem /BoundedSourceSystem - Consume events and generate watermarks DoFnOp: - SimpleDoFnRunner / ExecutableStageDoFnRunner - Inject Samza watermark, state and timers WindowAssignOp - Invoke window assignment function GroupByKeyOp: - GroupAlsoByWindowViaWindowSetNewDoFn - Samza merge, partitionBy, watermark, state and timers Samza DAG Samza Runner / Job Server Legacy runner and portable runner to share as much code as possible
  • 12.
    Beam Pipeline toSamza: Translation Portable runner translation ● Impulse ● ExecutableStage ● Flatten ● GroupByKey ● IO
  • 13.
    Beam Pipeline to Samza:Translation For portable pipelines, translate IOs to Java Read/Write • Convergence on using the same IO clients (e.g. Kafka clients) • Leverage existing Java components • More efficient to do IOs on Java side
  • 14.
    Kafka IO inSamza Python • From the Python side, users sees Python only
  • 15.
    Kafka IO inSamza Python • Under the hood, the “ReadFromKafka” is translated into a protobuf transform and passed to Samza runner on the Java side. • Here, topic name, cluster name, and configs are enough information for Java side to construct the Kafka consumers
  • 16.
    Beam Pipeline to Samza:Translation Samza Table API: extension of IO • Local table (RocksDb) • Remote table (e.g. Couchbase, Venice) • Easy stream table join support • Out-of-box support for rate limiting, caching, etc.
  • 17.
    Table Write inSamza Python (K1, V1) Kafka Input TableWrite (K1, Entry 1) ... Remote/Local Table • Sinking the processing results to a remote/local table
  • 18.
    Stream Table Joinin Samza Python (K1, V1) Kafka Input StreamTableJoinOp (K1, Entry 1) (K2, Entry 2) ... Remote/Local Table (K1, [V1, Entry1]) PTransform Output • Table read is provided as stream-table join. • Useful for enriching the events to be processed
  • 19.
    Table API inSamza Python • Similar to Kafka IO, tables are also translated to execute on Java side • Consistent APIs across Python and Java
  • 20.
    Beam Pipeline toSamza: Runtime Beam Pipeline Job Server A Tale of Two Languages Python Process Samza Tasks SDK worker gRPC gRPC Translate to Samza High Level APIs Java Process • Pipeline submitted to job server and translated into Samza High Level APIs • Each Samza container starts two processes - one Java, one Python • Python worker executes UDFs (user defined functions) • Samza in Java handles everything else (IO, state, shuffle, join, etc.)
  • 21.
    Samza Deployment: StandaloneCluster (p0, p2) (p1, p3) (p4, p5) (p6, p7) ZooKeeper • Start Samza Java from a Python process. Python process manages lifecycles • Using Samza as a library • Coordination (leader election, partition assignment) done by ZooKeeper Samza Processors (Java) SDK workers (Python) Create Submit Job
  • 22.
    Samza Deployment: YARNCluster Samza Container Samza Container Samza Container Application Master YARN Cluster • Using YARN for resource management • Isolation, multi-tenancy, fault-tolerance • Work in progress to support YARN mode in Samza Python
  • 23.
    Authentication For multi-tenant environment,data channels need to be protected. ● Simple solution is to enable SSL/TLS plus mutual authentication. ● But SSL/TLS could introduce unnecessary overhead.
  • 24.
    Authentication • Simple tokenbased authentication • Using loopback address so that port and data packet are not exposed to external network. • Rely on fs system for acl • Pull Request #8597 Unauthorized Users Java Process Python Process Token File (rw-------) 1. write to 2. read from 3. every connection with token 4. authenticated User account Connect with wrong token Request denial
  • 25.
    Performance Deploying Samza Pythonto test performance... ● Initial result: ~85QPS per container (with saturated single core CPU usage)
  • 26.
    Performance: without bundling •Single message round trip drags down the performance • Bundling (batching) is the key to improve throughputPython Java Input event Processed event
  • 27.
    • Single messageround trip drags down the performance • Bundling (batching) is the key to improve throughputPython Java Input event Processed event ... ... ... ... ... ... Performance: with bundling
  • 28.
    Samza Runloop (Java) Performance:Samza Runloop • Events are buffered/bundled for up to N messages or up to T seconds before sending to Python • Buffer is backed by local state (RocksDb) for persistence • Would be nice to avoid implementing this in every runner Python... Input events Buffered events
  • 29.
    Performance bundle size QPS(k) 1 0.085 5 0.34 20 1.1 50 1.7 100 2.3 200 2.9 500 3.7 3000 4.1 5000 5.6 10000 7.8 50000 8.3
  • 30.
    1 2 3 4 Agenda Introduction: About Apache Samza DeepDive: Samza Portable Runner for Beam Use Cases of Stream Processing in Python Future Work
  • 31.
    Near real timeImage OCR with Kafka IO Image data in Kafka Python OCR* Tensorflow model (portable) Internal service (image url) Download images from url Load into memory (image url, text) OCR*: Optical Character Recognition. Kafka
  • 32.
    Near real timeModel Training with Samza Python Feature data in Kafka Window GroupByKey Training algorithm Coefficients Store Offline Training Read coefficients Update coefficients Push (less frequent) HDFS ETL
  • 33.
    Near real timeactivity analysis/monitoring Activity logs in Kafka Fixed window Count.perKey (combine) Process Extract key & timestamp Monitoring/Alerting System Capture (abnormal) activities and send to monitoring/alerting system
  • 34.
    1 2 3 4 Agenda Introduction: About Apache Samza DeepDive: Samza Portable Runner for Beam Use Cases of Stream Processing in Python Future Work
  • 35.
    Future Samza Portable Runner ●YARN cluster support for portable runner ● Stateful ParDo and other features for Python ● Jupyter Notebook/iBeam integration ● Go lang support ● Looking into cross language pipeline from Beam
  • 36.
    Future Samza Runner ingeneral ● Beam SQL ● Schema Aware ● Async API ● ...
  • 37.