Apache samza

distributed stream processing
@humbertostreb

Samza overview
an open-source distributed stream processing created by Linkedin
- sub-second latency
- handle large amount of state
- fault tolerance
- no messages are ever lost
- partitioned and distributed at every level
- processor isolation
- pluggable

Architecture
Streaming: Kafka
Execution: YARN
Processing: Samza

Kafka
- The stream may be sharded into one or more partitions.
- Each partition is independent from the others, and is replicated
across multiple machines.
- Each partition consists of a sequence of messages in a fixed order.
- Each message has an offset, which indicates its position in that
sequence.
- A Samza job can start consuming the sequence of messages from
any starting offset.

YARN
- ResourceManager
- NodeManager
- ApplicationMaster

Streams
A stream is composed of immutable messages of a similar type or
category
- more than one stream consumed in the same job, are chosen by
RoundRobin by default, but can be overridden
- by configuration streams can be prioritised

Job
Job is code that performs a logical
transformation on a set of input
streams to append output messages to
set of output streams.

Partitions
Each stream is broken into one or
more partitions. Each partition in
the stream is a totally ordered
sequence of messages.

Task
A job is scaled by breaking it into
multiple tasks. The task is the unit of
parallelism of the job, just as the
partition is to the stream. Each task
consumes data from one partition for
each of the job’s input streams.

Containers
Containers are the unit of physical
parallelism, and a container is
essentially a Unix process (or Linux
cgroup). Each container runs one or
more tasks.

SamzaContainer starts up steps
1 - Get last checkpointed offset for each input stream partition
2 - Create a “reader” thread for every input stream partition
3 - Start metrics reporters to report metrics
4 - Start a checkpoint timer to save your task’s input stream offsets
every so often

SamzaContainer starts up steps
5 - Start a window timer to trigger your task’s window method, if it is
defined
6 - Instantiate and initialize your StreamTask once for each input
stream partition
7 - Start an event loop that takes messages from the input stream reader
threads, and gives them to your StreamTasks
8 - Notify lifecycle listeners during each one of these steps

Checkpointing
Samza writes checkpoints to a separate Kafka topic called
__samza_checkpoint_<job-name>_<job-id>

State Management
- fast approach using a local database
- fault tolerance sending a local store’s
writes to a replicated changelog and
checkpointing
- out of the box support RocksDB
(key-value)

Event Loop
- synchronous tasks will run on the single thread by default, but you
can configure
- asynchronous tasks will always be invoked in a single thread, while
callbacks can be triggered from a different thread.
Samza will make sure that checkpointing is automatically performed
only after the async calls have completed.

Metrics
Samza has its own library to expose metrics, with counters, gauges and
timer.
Metrics can be exposed by JMX, Kafka topic and so on

Security
Samza provides no security.
All security is implemented in the stream system, or in the environment
that Samza containers run.

Links
https://www.infoq.com/presentations/samza-linkedin
http://es.slideshare.net/martinkleppmann/samza-at-linkedin-taking-stream-
processing-to-the-next-level

Apache samza

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache samza

Similar to Apache samza (20)

More from Humberto Streb

More from Humberto Streb (8)

Recently uploaded

Recently uploaded (20)

Apache samza