distributed stream processing
@humbertostreb
Samza overview
an open-source distributed stream processing created by Linkedin
- sub-second latency
- handle large amount of state
- fault tolerance
- no messages are ever lost
- partitioned and distributed at every level
- processor isolation
- pluggable
Architecture
Streaming: Kafka
Execution: YARN
Processing: Samza
Kafka
- The stream may be sharded into one or more partitions.
- Each partition is independent from the others, and is replicated
across multiple machines.
- Each partition consists of a sequence of messages in a fixed order.
- Each message has an offset, which indicates its position in that
sequence.
- A Samza job can start consuming the sequence of messages from
any starting offset.
YARN
- ResourceManager
- NodeManager
- ApplicationMaster
YARN
Streams
A stream is composed of immutable messages of a similar type or
category
- more than one stream consumed in the same job, are chosen by
RoundRobin by default, but can be overridden
- by configuration streams can be prioritised
Job
Job is code that performs a logical
transformation on a set of input
streams to append output messages to
set of output streams.
Partitions
Each stream is broken into one or
more partitions. Each partition in
the stream is a totally ordered
sequence of messages.
Task
A job is scaled by breaking it into
multiple tasks. The task is the unit of
parallelism of the job, just as the
partition is to the stream. Each task
consumes data from one partition for
each of the job’s input streams.
Containers
Containers are the unit of physical
parallelism, and a container is
essentially a Unix process (or Linux
cgroup). Each container runs one or
more tasks.
SamzaContainer starts up steps
1 - Get last checkpointed offset for each input stream partition
2 - Create a “reader” thread for every input stream partition
3 - Start metrics reporters to report metrics
4 - Start a checkpoint timer to save your task’s input stream offsets
every so often
SamzaContainer starts up steps
5 - Start a window timer to trigger your task’s window method, if it is
defined
6 - Instantiate and initialize your StreamTask once for each input
stream partition
7 - Start an event loop that takes messages from the input stream reader
threads, and gives them to your StreamTasks
8 - Notify lifecycle listeners during each one of these steps
Checkpointing
Samza writes checkpoints to a separate Kafka topic called
__samza_checkpoint_<job-name>_<job-id>
State Management
- fast approach using a local database
- fault tolerance sending a local store’s
writes to a replicated changelog and
checkpointing
- out of the box support RocksDB
(key-value)
Event Loop
- synchronous tasks will run on the single thread by default, but you
can configure
- asynchronous tasks will always be invoked in a single thread, while
callbacks can be triggered from a different thread.
Samza will make sure that checkpointing is automatically performed
only after the async calls have completed.
Metrics
Samza has its own library to expose metrics, with counters, gauges and
timer.
Metrics can be exposed by JMX, Kafka topic and so on
Security
Samza provides no security.
All security is implemented in the stream system, or in the environment
that Samza containers run.
Links
https://www.infoq.com/presentations/samza-linkedin
http://es.slideshare.net/martinkleppmann/samza-at-linkedin-taking-stream-
processing-to-the-next-level
tanks

Apache samza

  • 1.
  • 2.
    Samza overview an open-sourcedistributed stream processing created by Linkedin - sub-second latency - handle large amount of state - fault tolerance - no messages are ever lost - partitioned and distributed at every level - processor isolation - pluggable
  • 3.
  • 4.
    Kafka - The streammay be sharded into one or more partitions. - Each partition is independent from the others, and is replicated across multiple machines. - Each partition consists of a sequence of messages in a fixed order. - Each message has an offset, which indicates its position in that sequence. - A Samza job can start consuming the sequence of messages from any starting offset.
  • 5.
  • 6.
  • 7.
    Streams A stream iscomposed of immutable messages of a similar type or category - more than one stream consumed in the same job, are chosen by RoundRobin by default, but can be overridden - by configuration streams can be prioritised
  • 8.
    Job Job is codethat performs a logical transformation on a set of input streams to append output messages to set of output streams.
  • 9.
    Partitions Each stream isbroken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages.
  • 10.
    Task A job isscaled by breaking it into multiple tasks. The task is the unit of parallelism of the job, just as the partition is to the stream. Each task consumes data from one partition for each of the job’s input streams.
  • 11.
    Containers Containers are theunit of physical parallelism, and a container is essentially a Unix process (or Linux cgroup). Each container runs one or more tasks.
  • 12.
    SamzaContainer starts upsteps 1 - Get last checkpointed offset for each input stream partition 2 - Create a “reader” thread for every input stream partition 3 - Start metrics reporters to report metrics 4 - Start a checkpoint timer to save your task’s input stream offsets every so often
  • 13.
    SamzaContainer starts upsteps 5 - Start a window timer to trigger your task’s window method, if it is defined 6 - Instantiate and initialize your StreamTask once for each input stream partition 7 - Start an event loop that takes messages from the input stream reader threads, and gives them to your StreamTasks 8 - Notify lifecycle listeners during each one of these steps
  • 14.
    Checkpointing Samza writes checkpointsto a separate Kafka topic called __samza_checkpoint_<job-name>_<job-id>
  • 15.
    State Management - fastapproach using a local database - fault tolerance sending a local store’s writes to a replicated changelog and checkpointing - out of the box support RocksDB (key-value)
  • 16.
    Event Loop - synchronoustasks will run on the single thread by default, but you can configure - asynchronous tasks will always be invoked in a single thread, while callbacks can be triggered from a different thread. Samza will make sure that checkpointing is automatically performed only after the async calls have completed.
  • 17.
    Metrics Samza has itsown library to expose metrics, with counters, gauges and timer. Metrics can be exposed by JMX, Kafka topic and so on
  • 18.
    Security Samza provides nosecurity. All security is implemented in the stream system, or in the environment that Samza containers run.
  • 19.
  • 20.