Unified Stream Processing at Scale with Apache Samza - BDS2017

Jacob Maes
Jacob MaesSenior Software Engineer at LinkedIn
1
Unified Stream Processing at Scale with Apache
Samza
Jake Maes
Staff SW Engineer at LinkedIn
Apache Samza PMC
2
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch  Streaming
Future
3
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch  Streaming
Future
4
About
● Stream processing framework
● Production at LinkedIn since 2014
● Apache top level project since 2014
● 16 Committers
● 74 Contributors
● Known for
 Scale
 Managed local state
 Pluggability
 Kafka integration
5
● Low latency
● One message at a time
● Checkpointing, durable state
● All I/O with high-performance message brokers
Traditional Stream Processing
6
Partitioned Processing
TaskTask0
State0
Changelog Stream
(partition 0)
Checkpoint
Stream
Processor
Output StreamsInput Streams
(partition 0)
7
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch  Streaming
Future
8
● Anti abuse
● Derived data
● Search Indexing
● Geographic filtering
● A/B testing infrastructure
● Many many more…
Stream Processing Use Cases at LinkedIn
9
Stream Processing Ecosystem – The Dream
Applications and Services
Samza
Kafka
Storage
External
Streams
Storage
&
Serving
Brooklin
10
Stream Processing Ecosystem - Reality
Applications and Services
Samza
Kafka
Storage
External
Streams
Storage
&
Serving
Brooklin
11
Expansion of Stream Processing at LinkedIn
● Influx of applications
 10 -> 200+ over 3 years
 13K containers processing 260B events/day
● Migrations of existing applications
 Online services
 Offline jobs
● Incoming applications have different expectations
Services
12
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch  Streaming
Future
13
Case Study – Notification Scheduler
Processor
User Chat
Event
User Action
Event
Connection
Activity
Event
Restful
Services
Member
profile
database
Aggregation
Engine
Channel
Selection
State
store
input1
input2
input3
① Local Data Access
② Remote Database Lookup
③ Remote Service Calloutput
14
Online Service + Stream Processing
Why use stream processor?
● Richer framework than Kafka clients
Requirements:
● Deployment model
 Cluster (YARN) environment not suitable
● Remote I/O
 Dependencies on other services
 I/O latency stalls single threaded processor
 Container parallelism - too much overhead
Services
15
App Instance
Embedded Samza
● Zookeeper-based JobCoordinator
 Uses Zookeeper for leader election
 Leader assigns work to the processors
ZooKeeper
Stream Processor
Samza
Container
Job
Coordinator*
App Instance
Stream Processor
Samza
Container
Job
Coordinator
App Instance
Stream Processor
Samza
Container
Job
Coordinator
* Leader
16
Asynchronous Event Loop
Stream Processor
Event Loop
 Single thread
 1 : Task
 n : Task
Restful Services
Java NIO, Netty
17
Checkpointing
● Sync – Barrier
● Async - Watermark
t1 t2 t3 tc t4
checkpoint
callback3
complete
time
callback1
complete
callback2
complete
callback4
complete
18
Performance for Remote I/O
Baseline
Thread pool size = 10
Max concurrency = 1
Thread pool size = 10
Max concurrency = 3
Sync I/O with MultithreadingSingle thread
19
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch  Streaming
Future
20
Case Study - Unified Metrics with Samza
UMP
Analyst
Pig
Script
“Compile”Author
Generate Fluent Code +
Runtime Config
Deploy+
+
21
Offline Jobs
Why use stream processor?
● Lower latency
Requirements:
● HDFS I/O
● Same app in batch and streaming
 Best of both worlds
● Composable API
22
Low Level Logic
public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask {
private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews");
private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters;
private Long windowSize;
@Override
public void init(Config config, TaskContext context) throws Exception {
this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>)
context.getStore("windowed-counter-store");
this.windowSize = config.getLong("task.window.ms");
}
@Override
public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception {
getWindowCounterEvent().forEach(counter ->
collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter)));
}
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws
Exception {
PageViewEvent pve = (PageViewEvent) envelope.getMessage();
countPageViewEvent(pve);
}
}
23
High Level Logic
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pve =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyOutputType, MyOutputType> mpv = graph
.getOutputStream("memberPageViews", m -> m.memberId, m -> m);
pve
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0,
(m, c) -> c + 1))
.map(MyOutputType::new)
.sendTo(mpv);
}
} Built-in transform functions
24
Batch <-> Streaming
streams.pageViewEvent.system=kafka
streams.pageViewEvent.physical.name=PageViewEvent
streams.memberPageViews.system= kafka
streams.memberPageViews.physical.name=MemberPageViews
streams.pageViewEvent.system=hdfs
streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEvent/
streams.memberPageViews.system=hdfs
streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPageViews
Streaming config
Batch config
25
Performance - HDFS
● Profile count,
group by country
● 500 files
● 250GB
26
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch  Streaming
Future
27
What’s Next?
● SQL
 Prototyped 2015
 Now getting full time attention
● High Level API extensions
 Better config, I/O, windowing, and more
● Beam Runner
 Samza performance with Beam API
● Table support
28
Thank You
Contact:
● Email dev@samza.apache.org
● Social http://twitter.com/jakemaes
Links:
● http://samza.apache.org
● http://github.com/apache/samza
● https://engineering.linkedin.com/blog
29
Bonus Slides
30
High Level API - Composable Operators
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
Stateless
Functions
I/O
Functions
Stateful
Functions
31
Co-Partitioned Streams
32
Typical Flow - Two Stages Minimum
Re-
partition
window map sendTo
PageVie
w
Event
PageViewEvent
ByMemberId
PageViewEventP
er
MemberStream
PageViewRepartitionTask PageViewByMemberIdCounterTask
1 of 32

Recommended

stream-processing-at-linkedin-with-apache-samza by
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzaAbhishek Shivanna
478 views61 slides
Stream Processing using Samza SQL by
Stream Processing using Samza SQLStream Processing using Samza SQL
Stream Processing using Samza SQLSamarth Shetty
791 views32 slides
Will it Scale? The Secrets behind Scaling Stream Processing Applications by
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
1.3K views72 slides
SamzaSQL QCon'16 presentation by
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
1.9K views50 slides
HBaseCon2017 Highly-Available HBase by
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon
1.1K views34 slides
ksqlDB: A Stream-Relational Database System by
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
1.4K views37 slides

More Related Content

What's hot

KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S... by
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...confluent
5K views33 slides
Streaming Data from Cassandra into Kafka by
Streaming Data from Cassandra into KafkaStreaming Data from Cassandra into Kafka
Streaming Data from Cassandra into KafkaAbrar Sheikh
888 views46 slides
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019 by
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
5.3K views42 slides
Event sourcing - what could possibly go wrong ? Devoxx PL 2021 by
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021Andrzej Ludwikowski
628 views128 slides
Webinar: Deep Dive on Apache Flink State - Seth Wiesman by
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanVerverica
1.2K views49 slides
KSQL: Streaming SQL for Kafka by
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafkaconfluent
6.7K views33 slides

What's hot(20)

KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S... by confluent
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
KSQL Performance Tuning for Fun and Profit ( Nick Dearden, Confluent) Kafka S...
confluent5K views
Streaming Data from Cassandra into Kafka by Abrar Sheikh
Streaming Data from Cassandra into KafkaStreaming Data from Cassandra into Kafka
Streaming Data from Cassandra into Kafka
Abrar Sheikh888 views
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019 by confluent
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent5.3K views
Event sourcing - what could possibly go wrong ? Devoxx PL 2021 by Andrzej Ludwikowski
Event sourcing  - what could possibly go wrong ? Devoxx PL 2021Event sourcing  - what could possibly go wrong ? Devoxx PL 2021
Event sourcing - what could possibly go wrong ? Devoxx PL 2021
Webinar: Deep Dive on Apache Flink State - Seth Wiesman by Ververica
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica 1.2K views
KSQL: Streaming SQL for Kafka by confluent
KSQL: Streaming SQL for KafkaKSQL: Streaming SQL for Kafka
KSQL: Streaming SQL for Kafka
confluent6.7K views
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes by HBaseCon
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
HBaseCon3.9K views
Kafka Streams: the easiest way to start with stream processing by Yaroslav Tkachenko
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
Yaroslav Tkachenko6.6K views
Lambda-less stream processing - linked in by Yi Pan
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
Yi Pan772 views
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data... by Big Data Spain
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain1.4K views
Introduction to Kafka Streams by Guozhang Wang
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
Guozhang Wang29.7K views
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si... by confluent
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent3K views
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud" by Flink Forward
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud" Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"
Flink Forward2.8K views
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa... by Flink Forward
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Flink Forward390 views
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE by kawamuray
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray850 views
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil... by HostedbyConfluent
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
HostedbyConfluent1.2K views
Apache Samza - New features in the upcoming Samza release 0.10.0 by Navina Ramesh
Apache Samza - New features in the upcoming Samza release 0.10.0Apache Samza - New features in the upcoming Samza release 0.10.0
Apache Samza - New features in the upcoming Samza release 0.10.0
Navina Ramesh3K views
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ... by HostedbyConfluent
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
High Available Task Scheduling Design using Kafka and Kafka Streams | Naveen ...
HostedbyConfluent3.6K views
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming by Yaroslav Tkachenko
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingBravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Yaroslav Tkachenko542 views
Top Ten Kafka® Configs by confluent
Top Ten Kafka® ConfigsTop Ten Kafka® Configs
Top Ten Kafka® Configs
confluent1.1K views

Similar to Unified Stream Processing at Scale with Apache Samza - BDS2017

Samza 0.13 meetup slide v1.0.pptx by
Samza 0.13 meetup slide   v1.0.pptxSamza 0.13 meetup slide   v1.0.pptx
Samza 0.13 meetup slide v1.0.pptxYi Pan
1.2K views27 slides
Samza at LinkedIn by
Samza at LinkedInSamza at LinkedIn
Samza at LinkedInVenu Ryali
109 views60 slides
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale by
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
900 views56 slides
Apache Samza 1.0 - What's New, What's Next by
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
292 views45 slides
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021 by
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021StreamNative
124 views32 slides
Intro to Apache Apex - Next Gen Platform for Ingest and Transform by
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
1.2K views30 slides

Similar to Unified Stream Processing at Scale with Apache Samza - BDS2017(20)

Samza 0.13 meetup slide v1.0.pptx by Yi Pan
Samza 0.13 meetup slide   v1.0.pptxSamza 0.13 meetup slide   v1.0.pptx
Samza 0.13 meetup slide v1.0.pptx
Yi Pan1.2K views
Samza at LinkedIn by Venu Ryali
Samza at LinkedInSamza at LinkedIn
Samza at LinkedIn
Venu Ryali109 views
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale by Sean Zhong
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong900 views
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021 by StreamNative
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
StreamNative124 views
Intro to Apache Apex - Next Gen Platform for Ingest and Transform by Apache Apex
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex1.2K views
Nextcon samza preso july - final by Yi Pan
Nextcon samza preso   july - finalNextcon samza preso   july - final
Nextcon samza preso july - final
Yi Pan808 views
Flexible and Real-Time Stream Processing with Apache Flink by DataWorks Summit
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit2.2K views
Network Automation with Salt and NAPALM: Introuction by Cloudflare
Network Automation with Salt and NAPALM: IntrouctionNetwork Automation with Salt and NAPALM: Introuction
Network Automation with Salt and NAPALM: Introuction
Cloudflare297 views
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing by Timothy Spann
Pulsar summit asia 2021   apache pulsar with mqtt for edge computingPulsar summit asia 2021   apache pulsar with mqtt for edge computing
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Timothy Spann366 views
Stream and Batch Processing in the Cloud with Data Microservices by marius_bogoevici
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
marius_bogoevici6.8K views
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin... by ucelebi
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi1.2K views
Samza Demo @scale 2017 by Xinyu Liu
Samza Demo @scale 2017Samza Demo @scale 2017
Samza Demo @scale 2017
Xinyu Liu470 views
Flink Streaming Hadoop Summit San Jose by Kostas Tzoumas
Flink Streaming Hadoop Summit San JoseFlink Streaming Hadoop Summit San Jose
Flink Streaming Hadoop Summit San Jose
Kostas Tzoumas2.4K views
QCON 2015: Gearpump, Realtime Streaming on Akka by Sean Zhong
QCON 2015: Gearpump, Realtime Streaming on AkkaQCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on Akka
Sean Zhong634 views
The magic behind your Lyft ride prices: A case study on machine learning and ... by Karthik Murugesan
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
Karthik Murugesan558 views
Fabric - Realtime stream processing framework by Shashank Gautam
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
Shashank Gautam711 views
First Flink Bay Area meetup by Kostas Tzoumas
First Flink Bay Area meetupFirst Flink Bay Area meetup
First Flink Bay Area meetup
Kostas Tzoumas1.7K views

Recently uploaded

Multi-objective distributed generation integration in radial distribution sy... by
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...IJECEIAES
15 views14 slides
Effect of deep chemical mixing columns on properties of surrounding soft clay... by
Effect of deep chemical mixing columns on properties of surrounding soft clay...Effect of deep chemical mixing columns on properties of surrounding soft clay...
Effect of deep chemical mixing columns on properties of surrounding soft clay...AltinKaradagli
6 views10 slides
Saikat Chakraborty Java Oracle Certificate.pdf by
Saikat Chakraborty Java Oracle Certificate.pdfSaikat Chakraborty Java Oracle Certificate.pdf
Saikat Chakraborty Java Oracle Certificate.pdfSaikatChakraborty787148
14 views1 slide
IWISS Catalog 2022 by
IWISS Catalog 2022IWISS Catalog 2022
IWISS Catalog 2022Iwiss Tools Co.,Ltd
24 views66 slides
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx by
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxSTUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxAnnieRachelJohn
31 views34 slides
SUMIT SQL PROJECT SUPERSTORE 1.pptx by
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptxSumit Jadhav
11 views26 slides

Recently uploaded(20)

Multi-objective distributed generation integration in radial distribution sy... by IJECEIAES
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...
IJECEIAES15 views
Effect of deep chemical mixing columns on properties of surrounding soft clay... by AltinKaradagli
Effect of deep chemical mixing columns on properties of surrounding soft clay...Effect of deep chemical mixing columns on properties of surrounding soft clay...
Effect of deep chemical mixing columns on properties of surrounding soft clay...
AltinKaradagli6 views
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx by AnnieRachelJohn
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptxSTUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
STUDY OF SMART MATERIALS USED IN CONSTRUCTION-1.pptx
AnnieRachelJohn31 views
SUMIT SQL PROJECT SUPERSTORE 1.pptx by Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 11 views
An approach of ontology and knowledge base for railway maintenance by IJECEIAES
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenance
IJECEIAES12 views
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,... by AakashShakya12
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
Literature review and Case study on Commercial Complex in Nepal, Durbar mall,...
AakashShakya1257 views

Unified Stream Processing at Scale with Apache Samza - BDS2017

  • 1. 1 Unified Stream Processing at Scale with Apache Samza Jake Maes Staff SW Engineer at LinkedIn Apache Samza PMC
  • 2. 2 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  • 3. 3 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  • 4. 4 About ● Stream processing framework ● Production at LinkedIn since 2014 ● Apache top level project since 2014 ● 16 Committers ● 74 Contributors ● Known for  Scale  Managed local state  Pluggability  Kafka integration
  • 5. 5 ● Low latency ● One message at a time ● Checkpointing, durable state ● All I/O with high-performance message brokers Traditional Stream Processing
  • 6. 6 Partitioned Processing TaskTask0 State0 Changelog Stream (partition 0) Checkpoint Stream Processor Output StreamsInput Streams (partition 0)
  • 7. 7 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  • 8. 8 ● Anti abuse ● Derived data ● Search Indexing ● Geographic filtering ● A/B testing infrastructure ● Many many more… Stream Processing Use Cases at LinkedIn
  • 9. 9 Stream Processing Ecosystem – The Dream Applications and Services Samza Kafka Storage External Streams Storage & Serving Brooklin
  • 10. 10 Stream Processing Ecosystem - Reality Applications and Services Samza Kafka Storage External Streams Storage & Serving Brooklin
  • 11. 11 Expansion of Stream Processing at LinkedIn ● Influx of applications  10 -> 200+ over 3 years  13K containers processing 260B events/day ● Migrations of existing applications  Online services  Offline jobs ● Incoming applications have different expectations Services
  • 12. 12 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  • 13. 13 Case Study – Notification Scheduler Processor User Chat Event User Action Event Connection Activity Event Restful Services Member profile database Aggregation Engine Channel Selection State store input1 input2 input3 ① Local Data Access ② Remote Database Lookup ③ Remote Service Calloutput
  • 14. 14 Online Service + Stream Processing Why use stream processor? ● Richer framework than Kafka clients Requirements: ● Deployment model  Cluster (YARN) environment not suitable ● Remote I/O  Dependencies on other services  I/O latency stalls single threaded processor  Container parallelism - too much overhead Services
  • 15. 15 App Instance Embedded Samza ● Zookeeper-based JobCoordinator  Uses Zookeeper for leader election  Leader assigns work to the processors ZooKeeper Stream Processor Samza Container Job Coordinator* App Instance Stream Processor Samza Container Job Coordinator App Instance Stream Processor Samza Container Job Coordinator * Leader
  • 16. 16 Asynchronous Event Loop Stream Processor Event Loop  Single thread  1 : Task  n : Task Restful Services Java NIO, Netty
  • 17. 17 Checkpointing ● Sync – Barrier ● Async - Watermark t1 t2 t3 tc t4 checkpoint callback3 complete time callback1 complete callback2 complete callback4 complete
  • 18. 18 Performance for Remote I/O Baseline Thread pool size = 10 Max concurrency = 1 Thread pool size = 10 Max concurrency = 3 Sync I/O with MultithreadingSingle thread
  • 19. 19 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Online Service Use Case: Batch  Streaming Future
  • 20. 20 Case Study - Unified Metrics with Samza UMP Analyst Pig Script “Compile”Author Generate Fluent Code + Runtime Config Deploy+ +
  • 21. 21 Offline Jobs Why use stream processor? ● Lower latency Requirements: ● HDFS I/O ● Same app in batch and streaming  Best of both worlds ● Composable API
  • 22. 22 Low Level Logic public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask { private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews"); private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters; private Long windowSize; @Override public void init(Config config, TaskContext context) throws Exception { this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>) context.getStore("windowed-counter-store"); this.windowSize = config.getLong("task.window.ms"); } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception { getWindowCounterEvent().forEach(counter -> collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter))); } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws Exception { PageViewEvent pve = (PageViewEvent) envelope.getMessage(); countPageViewEvent(pve); } }
  • 23. 23 High Level Logic public class RepartitionAndCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pve = graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m); OutputStream<String, MyOutputType, MyOutputType> mpv = graph .getOutputStream("memberPageViews", m -> m.memberId, m -> m); pve .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0, (m, c) -> c + 1)) .map(MyOutputType::new) .sendTo(mpv); } } Built-in transform functions
  • 24. 24 Batch <-> Streaming streams.pageViewEvent.system=kafka streams.pageViewEvent.physical.name=PageViewEvent streams.memberPageViews.system= kafka streams.memberPageViews.physical.name=MemberPageViews streams.pageViewEvent.system=hdfs streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEvent/ streams.memberPageViews.system=hdfs streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPageViews Streaming config Batch config
  • 25. 25 Performance - HDFS ● Profile count, group by country ● 500 files ● 250GB
  • 26. 26 Agenda Intro to Stream Processing Stream Processing Ecosystem at LinkedIn Use Case: Pre-Existing Service Use Case: Batch  Streaming Future
  • 27. 27 What’s Next? ● SQL  Prototyped 2015  Now getting full time attention ● High Level API extensions  Better config, I/O, windowing, and more ● Beam Runner  Samza performance with Beam API ● Table support
  • 28. 28 Thank You Contact: ● Email dev@samza.apache.org ● Social http://twitter.com/jakemaes Links: ● http://samza.apache.org ● http://github.com/apache/samza ● https://engineering.linkedin.com/blog
  • 30. 30 High Level API - Composable Operators filter select a subset of messages from the stream map map one input message to an output message flatMap map one input message to 0 or more output messages merge union all inputs into a single output stream partitionBy re-partition the input messages based on a specific field sendTo send the result to an output stream sink send the result to an external system (e.g. external DB) window window aggregation on the input stream join join messages from two input streams Stateless Functions I/O Functions Stateful Functions
  • 32. 32 Typical Flow - Two Stages Minimum Re- partition window map sendTo PageVie w Event PageViewEvent ByMemberId PageViewEventP er MemberStream PageViewRepartitionTask PageViewByMemberIdCounterTask

Editor's Notes

  1. Talk is an evolution story of Stream Processing at Linkedin: Few years ago, services, batch, and stream processing isolated Now stream processing used everywhere Talk focuses on LI, but should apply if your organization is looking to adopt or expand its usage of stream processing.
  2. Title of the talk used the word “Unified” = Stream processing framework that can be used by itself, embedded in an online service, or deployed in both streaming and batch environments seamlessly
  3. Latency Spectrum Samza-Kafka interaction optimized Performance: Most processors can handle 10K msg/s per container. Yes, even with state! Have seen trivial processors like a repartitioner handle as much as 50K msg/s per container Have run benchmarks showing 1.2M msg/s on a single host
  4. Under the hood: Partitions Data parallelism Could be files on HDFS, partitions in Kafka, etc. Tasks Logical unit of execution Isolated local state Processor Computational parallelism (coarse grained, 1 JVM) Single threaded event loop Work assignment Input partitions are assigned to task instances A particular task instance usually processes 1 partition from each input (topic) A task instance will often write to all output partitions Checkpoints are used to track progress Changelog for state durability
  5. So, how does this fit into the broader ecosystem? Stream processing center of the world Left is storage data at rest Brooklin is stream ingestion normalization layer CDC plus ingestion from other streams Events come into Kafka from brooklin or apps & services Processed by Samza and back out to Kafka Ingested by other storage and serving components Common pattern Optimized for streams (everything is a stream) Realistic? no Stream processor is optimized for interacting with streams, it makes sense to pursue an architecture which provides access to all the necessary data sources and sinks as streams.
  6. In reality, streaming applications often need to interface with a number of other systems. Why? Too expensive to replicate everything into Kafka [Kafka Connect] Processor was written offline but for latency purposes needs to also run in streaming mode Some datasources are shared with other services that need Random Access that is easier to provide from a database or serving layer and we don’t want multiple sources of truth Because some systems don’t have the ability to ingest from a stream (either because it wasn’t created, or they just wouldn’t be able to do it fast enough) Because sometimes for security purposes, it’s better for an application to interact directly with another Where does this come from? Well over the remainder of this talk. I’ll describe how stream processing has changed at LinkedIn and dig in to 2 sources of the evolving requirements and how we adapted to them.
  7. As I mentioned earlier, at LinkedIn we’ve been using Samza in production for over 3 years. In that time it has grown from 10 to over 200 applications. We now have over 1300 app instances, with an average of 10 containers each, handling over 260B events per day (conservative numbers These applications are not all new stream processors; many of them are migrations of existing applications that can be divided into 2 main categories: Preexis
  8. Why use stream processor? Abstractions for input output streams, checkpointing, durable state, etc. Existing services often don’t fit with the YARN deployment model May already have dedicated hardware they want to use May require a more static host assignment. e.g. if they’re exposing a RESTful endpoint Also tend to depend on other services Datasources with only RESTful or client APIs (not streaming) Remote I/O introduces huge latency into single-threaded event loop. Workaround: users would manage their own thread pool and use manual commit in window() Workaround2: users would just use a massive number of containers to get more parallelism
  9. Metrics used to be run daily with pig scripts Now same script can be compiled into a Samza processor that runs both online and offline. No need to rewrite for the new platform. Real time metrics  From a single definition (e.g. a Pig script ) generate batch and real time flows. Flip a config!
  10. How to associate the keys from multiple streams? Copartitioning Each input has: Same partition key Same partition count Same partition algorithm (usually hash + modulo)
  11. The result of the co-partitioning requirement: Most stateful jobs include a repartition stage, which re-keys or reshuffles the inputs to achieve co-partitioning Often implemented as separate processors that are deployed at the same time.