Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1
Unified Stream Processing at Scale with Apache
Samza
Jake Maes
Staff SW Engineer at LinkedIn
Apache Samza PMC
2
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case...
3
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case...
4
About
● Stream processing framework
● Production at LinkedIn since 2014
● Apache top level project since 2014
● 16 Commi...
5
● Low latency
● One message at a time
● Checkpointing, durable state
● All I/O with high-performance message brokers
Tra...
6
Partitioned Processing
TaskTask0
State0
Changelog Stream
(partition 0)
Checkpoint
Stream
Processor
Output StreamsInput S...
7
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case...
8
● Anti abuse
● Derived data
● Search Indexing
● Geographic filtering
● A/B testing infrastructure
● Many many more…
Stre...
9
Stream Processing Ecosystem – The Dream
Applications and Services
Samza
Kafka
Storage
External
Streams
Storage
&
Serving...
10
Stream Processing Ecosystem - Reality
Applications and Services
Samza
Kafka
Storage
External
Streams
Storage
&
Serving
...
11
Expansion of Stream Processing at LinkedIn
● Influx of applications
 10 -> 200+ over 3 years
 13K containers processi...
12
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Cas...
13
Case Study – Notification Scheduler
Processor
User Chat
Event
User Action
Event
Connection
Activity
Event
Restful
Servi...
14
Online Service + Stream Processing
Why use stream processor?
● Richer framework than Kafka clients
Requirements:
● Depl...
15
App Instance
Embedded Samza
● Zookeeper-based JobCoordinator
 Uses Zookeeper for leader election
 Leader assigns work...
16
Asynchronous Event Loop
Stream Processor
Event Loop
 Single thread
 1 : Task
 n : Task
Restful Services
Java NIO, Ne...
17
Checkpointing
● Sync – Barrier
● Async - Watermark
t1 t2 t3 tc t4
checkpoint
callback3
complete
time
callback1
complete...
18
Performance for Remote I/O
Baseline
Thread pool size = 10
Max concurrency = 1
Thread pool size = 10
Max concurrency = 3...
19
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Cas...
20
Case Study - Unified Metrics with Samza
UMP
Analyst
Pig
Script
“Compile”Author
Generate Fluent Code +
Runtime Config
De...
21
Offline Jobs
Why use stream processor?
● Lower latency
Requirements:
● HDFS I/O
● Same app in batch and streaming
 Bes...
22
Low Level Logic
public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask {
privat...
23
High Level Logic
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(St...
24
Batch <-> Streaming
streams.pageViewEvent.system=kafka
streams.pageViewEvent.physical.name=PageViewEvent
streams.member...
25
Performance - HDFS
● Profile count,
group by country
● 500 files
● 250GB
26
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batc...
27
What’s Next?
● SQL
 Prototyped 2015
 Now getting full time attention
● High Level API extensions
 Better config, I/O...
28
Thank You
Contact:
● Email dev@samza.apache.org
● Social http://twitter.com/jakemaes
Links:
● http://samza.apache.org
●...
29
Bonus Slides
30
High Level API - Composable Operators
filter select a subset of messages from the stream
map map one input message to a...
31
Co-Partitioned Streams
32
Typical Flow - Two Stages Minimum
Re-
partition
window map sendTo
PageVie
w
Event
PageViewEvent
ByMemberId
PageViewEven...
Upcoming SlideShare
Loading in …5
×

Unified Stream Processing at Scale with Apache Samza - BDS2017

413 views

Published on

The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day. Many of these are new applications, but there have also been more migrations from existing online and offline applications. To support the influx of new use cases, we have improved the flexibility, efficiency and reliability of Apache Samza.
In this talk, we will take a brief look at the broader streaming ecosystem at LinkedIn, then we will zoom in on a few representative use cases and explain how they are powered by recent advancements to Apache Samza including a unified high level API, flexible deployment model, batch processing, and more.

Published in: Engineering
  • Be the first to comment

Unified Stream Processing at Scale with Apache Samza - BDS2017

  1. 1. 1 Unified Stream Processing at Scale with Apache Samza Jake Maes Staff SW Engineer at LinkedIn Apache Samza PMC
  2. 2. 2 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  3. 3. 3 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  4. 4. 4 About ● Stream processing framework ● Production at LinkedIn since 2014 ● Apache top level project since 2014 ● 16 Committers ● 74 Contributors ● Known for  Scale  Managed local state  Pluggability  Kafka integration
  5. 5. 5 ● Low latency ● One message at a time ● Checkpointing, durable state ● All I/O with high-performance message brokers Traditional Stream Processing
  6. 6. 6 Partitioned Processing TaskTask0 State0 Changelog Stream (partition 0) Checkpoint Stream Processor Output StreamsInput Streams (partition 0)
  7. 7. 7 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  8. 8. 8 ● Anti abuse ● Derived data ● Search Indexing ● Geographic filtering ● A/B testing infrastructure ● Many many more… Stream Processing Use Cases at LinkedIn
  9. 9. 9 Stream Processing Ecosystem – The Dream Applications and Services Samza Kafka Storage External Streams Storage & Serving Brooklin
  10. 10. 10 Stream Processing Ecosystem - Reality Applications and Services Samza Kafka Storage External Streams Storage & Serving Brooklin
  11. 11. 11 Expansion of Stream Processing at LinkedIn ● Influx of applications  10 -> 200+ over 3 years  13K containers processing 260B events/day ● Migrations of existing applications  Online services  Offline jobs ● Incoming applications have different expectations Services
  12. 12. 12 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  13. 13. 13 Case Study – Notification Scheduler Processor User Chat Event User Action Event Connection Activity Event Restful Services Member profile database Aggregation Engine Channel Selection State store input1 input2 input3 ① Local Data Access ② Remote Database Lookup ③ Remote Service Calloutput
  14. 14. 14 Online Service + Stream Processing Why use stream processor? ● Richer framework than Kafka clients Requirements: ● Deployment model  Cluster (YARN) environment not suitable ● Remote I/O  Dependencies on other services  I/O latency stalls single threaded processor  Container parallelism - too much overhead Services
  15. 15. 15 App Instance Embedded Samza ● Zookeeper-based JobCoordinator  Uses Zookeeper for leader election  Leader assigns work to the processors ZooKeeper Stream Processor Samza Container Job Coordinator* App Instance Stream Processor Samza Container Job Coordinator App Instance Stream Processor Samza Container Job Coordinator * Leader
  16. 16. 16 Asynchronous Event Loop Stream Processor Event Loop  Single thread  1 : Task  n : Task Restful Services Java NIO, Netty
  17. 17. 17 Checkpointing ● Sync – Barrier ● Async - Watermark t1 t2 t3 tc t4 checkpoint callback3 complete time callback1 complete callback2 complete callback4 complete
  18. 18. 18 Performance for Remote I/O Baseline Thread pool size = 10 Max concurrency = 1 Thread pool size = 10 Max concurrency = 3 Sync I/O with MultithreadingSingle thread
  19. 19. 19 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  20. 20. 20 Case Study - Unified Metrics with Samza UMP Analyst Pig Script “Compile”Author Generate Fluent Code + Runtime Config Deploy+ +
  21. 21. 21 Offline Jobs Why use stream processor? ● Lower latency Requirements: ● HDFS I/O ● Same app in batch and streaming  Best of both worlds ● Composable API
  22. 22. 22 Low Level Logic public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask { private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews"); private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters; private Long windowSize; @Override public void init(Config config, TaskContext context) throws Exception { this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>) context.getStore("windowed-counter-store"); this.windowSize = config.getLong("task.window.ms"); } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception { getWindowCounterEvent().forEach(counter -> collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter))); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); countPageViewEvent(pve); } }
  23. 23. 23 High Level Logic public class RepartitionAndCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pve = graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m); OutputStream<String, MyOutputType, MyOutputType> mpv = graph .getOutputStream("memberPageViews", m -> m.memberId, m -> m); pve .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0, (m, c) -> c + 1)) .map(MyOutputType::new) .sendTo(mpv); } } Built-in transform functions
  24. 24. 24 Batch <-> Streaming streams.pageViewEvent.system=kafka streams.pageViewEvent.physical.name=PageViewEvent streams.memberPageViews.system= kafka streams.memberPageViews.physical.name=MemberPageViews streams.pageViewEvent.system=hdfs streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEvent/ streams.memberPageViews.system=hdfs streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPageViews Streaming config Batch config
  25. 25. 25 Performance - HDFS ● Profile count, group by country ● 500 files ● 250GB
  26. 26. 26 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  27. 27. 27 What’s Next? ● SQL  Prototyped 2015  Now getting full time attention ● High Level API extensions  Better config, I/O, windowing, and more ● Beam Runner  Samza performance with Beam API ● Table support
  28. 28. 28 Thank You Contact: ● Email dev@samza.apache.org ● Social http://twitter.com/jakemaes Links: ● http://samza.apache.org ● http://github.com/apache/samza ● https://engineering.linkedin.com/blog
  29. 29. 29 Bonus Slides
  30. 30. 30 High Level API - Composable Operators filter select a subset of messages from the stream map map one input message to an output message flatMap map one input message to 0 or more output messages merge union all inputs into a single output stream partitionBy re-partition the input messages based on a specific field sendTo send the result to an output stream sink send the result to an external system (e.g. external DB) window window aggregation on the input stream join join messages from two input streams Stateless Functions I/O Functions Stateful Functions
  31. 31. 31 Co-Partitioned Streams
  32. 32. 32 Typical Flow - Two Stages Minimum Re- partition window map sendTo PageVie w Event PageViewEvent ByMemberId PageViewEventP er MemberStream PageViewRepartitionTask PageViewByMemberIdCounterTask

×