Apache Samza - New features in the upcoming Samza release 0.10.0

Apache Samza 0.10.0
What’s coming up in the next Samza release
LinkedIn
Navina R
Committer @ Apache
Samza

New Features in Samza
0.10.0
 Dynamic Configuration & Control
◦ Coordinator Stream
◦ Broadcast Stream
 Host affinity in Samza
 New Consumer: Kinesis
 New Producers: Kinesis, HDFS,
ElasticSearch
 Upgraded RocksDB

Dynamic Configuration &
Control
1. Coordinator Stream
2. Broadcast Stream

How does Config work today?
Job
Config RM
AM
C0 C1 C2
Submit Job
Cfg via cmd line
Cfg via cmd line
Job deployment in Yarn:
 Job is localized to the
Resource Manager (RM)
 RM allocates a container for
the Application Master (AM)
and passes the config
parameters as command-
line arguments to the run-
am script
 Similarly, AM passes config
to the containers on
allocation
Checkpoint Stream

Problems
Job
Config RM
AM
C0 C1 C2
Submit Job
Cfg via cmd line
Cfg via cmd line
 Escaping / Unescaping
quotes is cumbersome
(SAMZA-700)
 Limits the number of
arguments that can be set
through shell command line
(SAMZA-337, SAMZA-333)
 Dynamic config change not
possible. Every config
change requires a job re-
submission (restart)
(SAMZA-348)
 Handle system config like
checkpoints differently than
user-defined config (SAMZA-
348)Checkpoint Stream

Solution: Coordinator Stream
RM
AM
C0
C1 C2
Submit Job
JC
Coordinator Stream
Config requested via HTTP
Coordinator Stream (CS)
 Single partition
 Log-compacted
 Each job has its own
CS
Job Coordinator (JC)
 Exposes HTTP end-
point for containers to
query for Job Model
 Bootstraps from CS
and then, continues
consumption from CS
Samza job deployment using Job
Coordinator & Coorindator Stream
Bootstraps config from stream

Data in Coordinator Stream
Coordinator Stream (CS) contains:
◦ Checkpoints for the input streams
 Containers periodically write to checkpoints to CS,
instead of a separate checkpoint topic
◦ Task-to-changelog partition mapping
◦ Container Locality Info (required for Host
Affinity)
 Containers write their location (machine-name) to
CS
◦ User-defined configuration
 Entire configuration is written to the CS when the
job is started
◦ Migration related messages

Coordinator Stream: Benefits
RM
AM
C0
C1 C2
Submit Job
JC
Coordinator Stream
Config requested via HTTP
 Config can be easily
serialized /
deserialized
 Checkpoints & user-
defined configs are
stored similarly
 Config change can be
made by writing to the
CS*
 JC can be used to
coordinate job
execution*
* Work In Progress
Samza job deployment using Job
Coordinator & Coorindator Stream
Bootstraps config from stream

Coordinator Stream: Tools /
Migration
Tools:
 Command-line tool to write config
changes to coordinator stream
Migration:
 JobRunner in 0.10.0 automatically
migrates checkpoints and changelog
mappings in 0.9.1 to Coordinator
Stream in 0.10.0

Broadcast Stream
Stream consumed by all Tasks in the job

Motivation
 Dynamically configure job behavior
 Acts a custom control channel for an
application

A typical input stream
Task-
0
Task-
1
Task-
2
Task-
3
MyInputStream
Partition-0
Partition-1
Partition-2
Partition-3
task.inputs = $system-name.$stream-name
task.inputs = kafka.MyInputStream
One stream partition
consumed only by one
task

Broadcast Stream
Task-
0
Task-
1
Task-
2
Task-
3
MyInputStream
Partition-0
Partition-1
Partition-2
Partition-3
MyBroadcastStream
Partition-0
task.global.inputs = $system-name.$stream-name#$partition-
number
task.global.inputs = kafka.MyBroadcastStream#0
One stream partition
consumed only by ALL
partitions

Broadcast Stream
Task-
0
Task-
1
Task-
2
Task-
3
MyInputStream
Partition-0
Partition-1
Partition-2
Partition-3
MyBroadcastStream
Partition-0 Partition-1 Partition-2
task.global.inputs = $system-name.$stream-name#[$partition-
range]
task.global.inputs = kafka.MyBroadcastStream#[0-1]

Host-Affinity
Making Samza aware of container locality

A Stateful Job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Stable State

Fault Tolerance in a Stateful
Job
P0
P1
P2
P3
P0
P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Task-0 & Task-1
running on the
container in Host-A
fail

Job
P0
P1
P2
P3
P0
P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Yarn allocates the
tasks to a
container on a
different host!

Job
P0
P1
P2
P3
P0
P1
P2
P3
:
:
0
1
159
:
:
0
1
82
Local state restored by
consuming the
changelog from the
earliest offset!

Job
P0
P1
P2
P3
P0
P1
P2
P3
After restored, job
continues with input
processing – Back to
Stable State!

Problems
P0
P1
P2
P3
P0
P1
P2
P3
 State stores are not
persisted if the container
fails
◦ Tasks need to restore the state
stores from the change-log
before continuing with input
processing
 Samza AppMaster is not
aware of host locality for a
container
◦ Container gets relocated to a
new host
 Excessive start-up times
when a job is restarted

Motivation
 During upgrades and job failures,
◦ Local state built in the task is lost
◦ Samza is not aware of the container
locality
◦ Job start-up time is large (hours)
 Job is no longer “near-realtime”
 Multiple stateful jobs starting up at the
same time will DDoS kafka –
saturating the Kafka clusters

Solution: Host Affinity in
Samza
 Host Affinity – ability of Samza to
allocate a container to the same
machine across job
restarts/deployments
 Host affinity is best-effort
◦ Cluster load may vary
◦ Machine may be non-responsive
◦ Container should shutdown cleanly

Host Affinity in Samza
P0
P1
P2
P3
P0
P1
P2
P3
Coordinator Stream
Container-0 -> Host-E
Container-1 -> Host-B
Container-2 -> Host-C
Persist container
locality in
Coordinator
Stream

P0
P1
P2
P3
P0
P1
P2
P3
Coordinator Stream
Task-0 & Task-1 running on
the container in Host-E fail

P0
P1
P2
P3
P0
P1
P2
P3
Coordinator Stream
AM JC
Tasks failed, but local state
stores remain!

P0
P1
P2
P3
P0
P1
P2
P3
Coordinator Stream
AM JC
RM
Ask: Host-E Allocate: Host-E
Job Coordinator is aware of
container locality!

P0
P1
P2
P3
P0
P1
P2
P3
:
:
0
1
159
:
:
0
1
82
Coordinator Stream
State store does not have to
be restored from the earliest
offset!

P0
P1
P2
P3
P0
P1
P2
P3
Job back to Stable state
pretty quickly!

 Enable host-affinity
◦ yarn.samza.host-affinity.enabled=true
 Enable continuous scheduling in Yarn
 Useful for stateful jobs
 Does not affect stateless jobs

Upgraded RocksDB
 New RocksDb JNI 3.13.1+ version
supports TTL
 Impact:
◦ Removes the need to write customized
code to delete expired records

Thanks!
 Expected release date – Nov 2015
 Thanks to all the contributors!
 Contact Us:
◦ Mailing List – dev@samza.apache.org
◦ Twitter - #samza, @samzastream

Apache Samza - New features in the upcoming Samza release 0.10.0

More Related Content

What's hot

Similar to Apache Samza - New features in the upcoming Samza release 0.10.0

Recently uploaded

Apache Samza - New features in the upcoming Samza release 0.10.0

Editor's Notes