Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

1
Unified Processing at Scale with Apache
Samza
Jake Maes
Staff SW Engineer at LinkedIn
Apache Samza PMC

2
About Me
● Apache Samza PMC member
● LinkedIn 3 years
● 8 years performance & infra development
● Passionate about scale
● Long walks on the peaks

3
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch  Streaming
Future

4
Agenda
Future

5
About
● Production at LinkedIn since 2014
● Apache top level project since 2014
● 16 Committers
● 74 Contributors
● Known for
 Scale
 Pluggability
 Kafka integration

6
● Low latency
● One message at a time
● Checkpointing, state, durability
● All I/O with high-performance message brokers
Traditional Stream Processing

7
Stateful Processing
TaskTask0
State0
Changelog
Stream
(partition 0)
Checkpoint
Stream
Processor
Output
Streams
Input
Streams
(partition 0)

9
Typical Flow - Two Stages Minimum
Re-
partitio
n
windo
w
ma
p
sendT
o
PageVie
w
Event
PageViewEven
t
ByMemberId
PageViewEventP
er
MemberStream
PageViewRepartitionTask PageViewByMemberIdCounterTask

10
Agenda
Future

11
Stream Processing Ecosystem – The Dream
Applications and Services
Samz
a
Kafka
Storag
e
Externa
l
Stream
s
Storage
&
Serving
Brooklin

12
Stream Processing Ecosystem - Reality
Applications and Services
Samz
a
Kafka
Storag
e
Externa
l
Stream
s
Storage
&
Serving
Brooklin

13
Expansion of Stream Processing at LinkedIn
● Influx of new applications
 10 -> over 200
● New use cases
 Batch  Streaming
 Remote I/O
 Composable API
● Incoming applications have different expectations
● Let’s take a look at two
Services

14
Agenda
Future

15
Online Service + Stream Processing
Requirements:
● Deployment model
 Cluster environment not suitable
● Remote I/O
 Dependencies on other services
 I/O latency stalls single threaded processor
 Container parallelism - too much overhead
Services

16
App Instance
Embedded Samza
● Zookeeper-based JobCoordinator
 Uses Zookeeper for leader election
 Leader assigns work to the processors
ZooKeeperZooKeeper
Stream Processor
Samza
Container
Job
Coordinato
r*
App Instance
Stream Processor
Samza
Container
Job
Coordinato
r
App Instance
Stream Processor
Samza
Container
Job
Coordinato
r
* Leader

17
Asynchronous Event Loop
Stream
Processor
Event Loop
 Single
thread
 1 : Task
 n : Task
Restful Services
Java NIO, Netty

18
Checkpointing
● Sync – Barrier
● Async - Watermark
t1 t2 t3 tc
t4
checkpoint
callback
3
complet
e
time
callback
1
complet
e
callback
2compl
ete
callback
4
complet
e

19
Performance for Remote I/O
Baseline
Thread pool size =
10
Max concurrency =
1
Thread pool size =
10
Max concurrency =
3
Sync I/O with MultithreadingSingle thread

20
Case Study – Notification Scheduler
Processor
User Chat
Event
User
Action
Event
Connectio
n Activity
Event
Restful
Service
s
Member
profile
database
Aggregatio
n Engine
Channel
Selection
State
store
input1
input2
input3
① Local Data Access
② Remote Database
Lookup
③ Remote Service
Call
outp
ut

21
Agenda
Future

22
Offline Jobs
Requirements:
● Performance and low latency
● Resource hungry
 Finite jobs can hog resources
 Infinite jobs need to be better citizens
● Composable API
● Same app in batch and streaming
 Best of both worlds
● HDFS I/O

23
Low Level Logic
public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask {
private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews");
private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters;
private Long windowSize;
@Override
public void init(Config config, TaskContext context) throws Exception {
this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>)
context.getStore("windowed-counter-store");
this.windowSize = config.getLong("task.window.ms");
}
@Override
public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception {
getWindowCounterEvent().forEach(counter ->
collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter)));
}
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws
Exception {
PageViewEvent pve = (PageViewEvent) envelope.getMessage();
countPageViewEvent(pve);
}
}

24
High Level Logic
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pve =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyOutputType, MyOutputType> mpv = graph
.getOutputStream("memberPageViews", m -> m.memberId, m -> m);
pve
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0,
(m, c) -> c + 1))
.map(MyOutputType::new)
.sendTo(mpv);
}
} Built-in transform
functions

25
High Level API - Composable Operators
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
Stateless
Functions
I/O
Function
s
Stateful
Functions

26
Batch AND Streaming
streams.pageViewEvent.system=kafka
streams.pageViewEvent.physical.name=PageViewEvent
streams.memberPageViews.system= kafka
streams.memberPageViews.physical.name=MemberPageViews
streams.pageViewEvent.system=hdfs
streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEven
t/
streams.memberPageViews.system=hdfs
streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPage
Views
Streaming config
Batch config

27
Case Study - Unified Metrics with Samza
UMP
Analyst
Pig
Script
“Compile”Author
Generate Fluent Code +
Runtime Config
Deploy+
+

28
Performance - HDFS
● Profile count,
group by
country
● 500 files
● 250GB

29
Agenda
Future

30
What’s Next?
● SQL
 Prototyped 2015
 Now getting full time attention
● High Level API extensions
 Better config, I/O, windowing, and more
● Beam Runner
 Samza performance with Beam API
● Table support

31
Questions
Contact:
● Email: dev@samza.apache.org
● Social: http://twitter.com/jakemaes
Links:
● http://samza.apache.org
● http://github.com/apache/samza
● https://engineering.linkedin.com/blog

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017

Similar to Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017