Samza Demo @scale 2017

X
Apache Kafka
• 2.1 Trillion messages ingested per day
• 0.5 PB in, 2 PB out per day (compressed)
• 16 million msg/sec peaks
Apache Samza
• Over 500 applications running in production,
• With 10000+ containers
• Applications with several TB of local state
1
Scale of Event Processing at LinkedIn
Best in Class Support for
Stateful Stream Processing
• Incremental checkpointing for large state and
fast recovery.
• Local state that works seamlessly across
upgrades and failures.
• Async Processing for efficient remote I/O
Hardened at Internet Scale
• In use at LinkedIn, Uber, Netflix, Intuit,
Metamarkets, TripAdvisor, VMWare, Optimizely,
Redfin, etc.
• Processing events from Kafka, Kinesis, EventHub,
HDFS, ZeroMQ, DynamoDB Streams, MongoDB,
Databus, Brooklin etc.
Why Apache Samza ?
2
Unified API For Stream and Batch
Processing
• Process data in streams or in hadoop without any
code changes.
Run as a Service or a Library
• Write once run anywhere.
• Deploy in a managed cluster, or embed as a
library in another application.
Stream (data in motion) Processing
• Click Stream Processing, Interactive User Feeds
• Security, Fraud Detection
• Application Monitoring
• Internet of Things
• Ads, Gaming, Trading etc.
Security
3
Multi-Stage Dataflow Example
4
Page View
in stream
Page View per Member
out stream
Repartition
by member id
Window Map SendTo
public class PageViewCountApplication implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pageViewEvents = graph.getInputStream("pageViewStream" );
MessageStream pageViewPerMember = graph.getOutputStream("pageViewPerMemberStream" );
pageView
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5),
initialValue, (m, c) -> c + 1))
.map(MyStreamOutput::new)
.sendTo(pageViewPerMember);
}
}
built-in
transform
functions
Stream Application in Batch
Application logic: Count number of ‘Page Views’ for each member in a 5 minute
window and send the counts to ‘Page View Per Member’
5
Page View
in stream
Page View per Member
out stream
Repartition
by member id
Window Map SendTo
HDFS
PageView: hdfs://mydbsnapshot/PageViewFiles/
PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
Stream Processing as a Library
6
Page View Page View per Member
Repartition
by member id
Window Map SendTo
Launch Stream Processor
public static void main(String[] args) {
CommandLine cmdLine = new CommandLine();
OptionSet options = cmdLine.parser().parse(args);
Config config = cmdLine.loadConfig(options);
LocalApplicationRunner runner = new
LocalApplicationRunner(config);
PageViewCountApplication app = new
PageViewCountApplication();
runner.run(app);
runner.waitForFinish();
}
job.coordinator.factory=org.apache.samza.zk.
ZkJobCoordinatorFactory
job.coordinator.zk.connect=my-zk.server:2191
Zero code changes
Apache
Kafka
Real Time Processing
(Apache Samza)
Processing
Espresso
Services Tier
Ingestion
Clients(browser,devices ….)
Brooklin
Oracle
AWS
Kinesis
Azure
EventHub
Data Ingestion at LinkedIn
7
Backup
8
Local State -- Throughput
9
remote state 30-150x
worse than local state
on disk w/ caching
comparable with in memory
changelog adds minimal
overhead
Failure Recovery
10
~ constant overhead with
Host Affinity
parallel recovery:
equal recovery time
irrespective of # failed
containers
Samza HDFS Benchmark
Profile count,
group-by country
500 files
250GB input
1 of 11

Recommended

MongoDB and Machine Learning with Flowable by
MongoDB and Machine Learning with FlowableMongoDB and Machine Learning with Flowable
MongoDB and Machine Learning with FlowableFlowable
555 views26 slides
Elastic Search Meetup Special - Yann Cluchey, Cogenta by
Elastic Search Meetup Special - Yann Cluchey, Cogenta Elastic Search Meetup Special - Yann Cluchey, Cogenta
Elastic Search Meetup Special - Yann Cluchey, Cogenta Internet World
1.2K views9 slides
Using Kafka: Anatomy of the Flowable event registry by
Using Kafka: Anatomy of the Flowable event registryUsing Kafka: Anatomy of the Flowable event registry
Using Kafka: Anatomy of the Flowable event registryFlowable
884 views34 slides
Distributed Build Services @ Mippin by
Distributed Build Services @ MippinDistributed Build Services @ Mippin
Distributed Build Services @ MippinMaciej Matyjas
326 views12 slides
Types of replication, pooling and ha by
Types of replication, pooling and haTypes of replication, pooling and ha
Types of replication, pooling and haDimitar Ianakiev
1.1K views17 slides
Making Wallstreet talk with GO (GO India Conference 2015) by
Making Wallstreet talk with GO (GO India Conference 2015)Making Wallstreet talk with GO (GO India Conference 2015)
Making Wallstreet talk with GO (GO India Conference 2015)Matthew Campbell
1.1K views32 slides

More Related Content

What's hot

GCF Application server by
GCF Application serverGCF Application server
GCF Application serverAneesh Muralidharan
364 views23 slides
Nick Raienko ''Service-oriented GraphQL'' by
Nick Raienko ''Service-oriented GraphQL''Nick Raienko ''Service-oriented GraphQL''
Nick Raienko ''Service-oriented GraphQL''OdessaJS Conf
236 views43 slides
Charla ro 2 by
Charla ro 2Charla ro 2
Charla ro 2GeneXus
476 views36 slides
Flowable: Building a crowd sourced document extraction and verification system by
Flowable: Building a crowd sourced document extraction and verification systemFlowable: Building a crowd sourced document extraction and verification system
Flowable: Building a crowd sourced document extraction and verification systemFlowable
912 views20 slides
Timur Shemsedinov "Эволюция архитектуры ИС" by
Timur Shemsedinov "Эволюция архитектуры ИС"Timur Shemsedinov "Эволюция архитектуры ИС"
Timur Shemsedinov "Эволюция архитектуры ИС"OdessaJS Conf
88 views16 slides
CMMN makes BPMN smarter and engaging by
CMMN makes BPMN smarter and engagingCMMN makes BPMN smarter and engaging
CMMN makes BPMN smarter and engagingFlowable
358 views15 slides

What's hot(19)

Nick Raienko ''Service-oriented GraphQL'' by OdessaJS Conf
Nick Raienko ''Service-oriented GraphQL''Nick Raienko ''Service-oriented GraphQL''
Nick Raienko ''Service-oriented GraphQL''
OdessaJS Conf236 views
Charla ro 2 by GeneXus
Charla ro 2Charla ro 2
Charla ro 2
GeneXus476 views
Flowable: Building a crowd sourced document extraction and verification system by Flowable
Flowable: Building a crowd sourced document extraction and verification systemFlowable: Building a crowd sourced document extraction and verification system
Flowable: Building a crowd sourced document extraction and verification system
Flowable912 views
Timur Shemsedinov "Эволюция архитектуры ИС" by OdessaJS Conf
Timur Shemsedinov "Эволюция архитектуры ИС"Timur Shemsedinov "Эволюция архитектуры ИС"
Timur Shemsedinov "Эволюция архитектуры ИС"
OdessaJS Conf88 views
CMMN makes BPMN smarter and engaging by Flowable
CMMN makes BPMN smarter and engagingCMMN makes BPMN smarter and engaging
CMMN makes BPMN smarter and engaging
Flowable358 views
Levelling up in Akka by Sigmoid
Levelling up in AkkaLevelling up in Akka
Levelling up in Akka
Sigmoid257 views
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor... by MongoDB
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...
MongoDB1.5K views
Reactive Integrations - Caveats and bumps in the road explained by Markus Eisele
Reactive Integrations - Caveats and bumps in the road explained  Reactive Integrations - Caveats and bumps in the road explained
Reactive Integrations - Caveats and bumps in the road explained
Markus Eisele303 views
DevOps Fest 2019. Игорь Фесенко. DevOps: Be good, Get good or Give up by DevOps_Fest
DevOps Fest 2019. Игорь Фесенко. DevOps: Be good, Get good or Give upDevOps Fest 2019. Игорь Фесенко. DevOps: Be good, Get good or Give up
DevOps Fest 2019. Игорь Фесенко. DevOps: Be good, Get good or Give up
DevOps_Fest138 views
Solving your Backup Needs - Ben Cefalo mdbe18 by MongoDB
Solving your Backup Needs - Ben Cefalo mdbe18Solving your Backup Needs - Ben Cefalo mdbe18
Solving your Backup Needs - Ben Cefalo mdbe18
MongoDB249 views
Flowable Business Processing from Kafka Events by Flowable
Flowable Business Processing from Kafka Events Flowable Business Processing from Kafka Events
Flowable Business Processing from Kafka Events
Flowable1.6K views
モダンなアプリ設計っぽい話 by susan335
モダンなアプリ設計っぽい話モダンなアプリ設計っぽい話
モダンなアプリ設計っぽい話
susan335184 views
Softwerkskammer Lübeck 08/2018 Event Sourcing and CQRS by Daniel Bimschas
Softwerkskammer Lübeck 08/2018 Event Sourcing and CQRSSoftwerkskammer Lübeck 08/2018 Event Sourcing and CQRS
Softwerkskammer Lübeck 08/2018 Event Sourcing and CQRS
Daniel Bimschas107 views
State management in react applications (Statecharts) by Tomáš Drenčák
State management in react applications (Statecharts)State management in react applications (Statecharts)
State management in react applications (Statecharts)
Tomáš Drenčák342 views
Serverless JavaScript by gojkoadzic
Serverless JavaScriptServerless JavaScript
Serverless JavaScript
gojkoadzic627 views
State Management in Angular/React by DEV Cafe
State Management in Angular/ReactState Management in Angular/React
State Management in Angular/React
DEV Cafe582 views
Samza tech talk_2015 - strata by Yi Pan
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strata
Yi Pan1.2K views

Similar to Samza Demo @scale 2017

Scalable Stream Processing with Apache Samza by
Scalable Stream Processing with Apache SamzaScalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache SamzaPrateek Maheshwari
142 views32 slides
Samza 0.13 meetup slide v1.0.pptx by
Samza 0.13 meetup slide   v1.0.pptxSamza 0.13 meetup slide   v1.0.pptx
Samza 0.13 meetup slide v1.0.pptxYi Pan
1.2K views27 slides
Unified Stream Processing at Scale with Apache Samza - BDS2017 by
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
1.9K views32 slides
Fabric - Realtime stream processing framework by
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing frameworkShashank Gautam
711 views33 slides
Apache Samza 1.0 - What's New, What's Next by
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
292 views45 slides
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data... by
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
1.4K views32 slides

Similar to Samza Demo @scale 2017(20)

Scalable Stream Processing with Apache Samza by Prateek Maheshwari
Scalable Stream Processing with Apache SamzaScalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache Samza
Prateek Maheshwari142 views
Samza 0.13 meetup slide v1.0.pptx by Yi Pan
Samza 0.13 meetup slide   v1.0.pptxSamza 0.13 meetup slide   v1.0.pptx
Samza 0.13 meetup slide v1.0.pptx
Yi Pan1.2K views
Unified Stream Processing at Scale with Apache Samza - BDS2017 by Jacob Maes
Unified Stream Processing at Scale with Apache Samza - BDS2017Unified Stream Processing at Scale with Apache Samza - BDS2017
Unified Stream Processing at Scale with Apache Samza - BDS2017
Jacob Maes1.9K views
Fabric - Realtime stream processing framework by Shashank Gautam
Fabric - Realtime stream processing frameworkFabric - Realtime stream processing framework
Fabric - Realtime stream processing framework
Shashank Gautam711 views
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data... by Big Data Spain
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Big Data Spain1.4K views
ADF and JavaScript - AMIS SIG, July 2017 by Lucas Jellema
ADF and JavaScript - AMIS SIG, July 2017ADF and JavaScript - AMIS SIG, July 2017
ADF and JavaScript - AMIS SIG, July 2017
Lucas Jellema1.8K views
Building event-driven (Micro)Services with Apache Kafka by Guido Schmutz
Building event-driven (Micro)Services with Apache KafkaBuilding event-driven (Micro)Services with Apache Kafka
Building event-driven (Micro)Services with Apache Kafka
Guido Schmutz638 views
GigaSpaces PAAS For Cloud Based Java Applications by IndicThreads
GigaSpaces PAAS For Cloud Based Java ApplicationsGigaSpaces PAAS For Cloud Based Java Applications
GigaSpaces PAAS For Cloud Based Java Applications
IndicThreads4K views
Real time Communication with Signalr (Android Client) by Deepak Gupta
Real time Communication with Signalr (Android Client)Real time Communication with Signalr (Android Client)
Real time Communication with Signalr (Android Client)
Deepak Gupta10.3K views
Stream Application Development with Apache Kafka by Matthias J. Sax
Stream Application Development with Apache KafkaStream Application Development with Apache Kafka
Stream Application Development with Apache Kafka
Matthias J. Sax2K views
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More by WSO2
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2293 views
Flexible and Real-Time Stream Processing with Apache Flink by DataWorks Summit
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit2.2K views
Sviluppare Applicazioni Real Time con AppSync Deck.pptx by Amazon Web Services
Sviluppare Applicazioni Real Time con AppSync Deck.pptxSviluppare Applicazioni Real Time con AppSync Deck.pptx
Sviluppare Applicazioni Real Time con AppSync Deck.pptx
Building Event-Driven (Micro)Services with Apache Kafka by Guido Schmutz
Building Event-Driven (Micro)Services with Apache KafkaBuilding Event-Driven (Micro)Services with Apache Kafka
Building Event-Driven (Micro)Services with Apache Kafka
Guido Schmutz2.6K views
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) by Kai Wähner
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Kai Wähner28.7K views
Spring Boot+Kafka: the New Enterprise Platform by VMware Tanzu
Spring Boot+Kafka: the New Enterprise PlatformSpring Boot+Kafka: the New Enterprise Platform
Spring Boot+Kafka: the New Enterprise Platform
VMware Tanzu1.4K views
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto... by Kai Wähner
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Confluent Platform 5.5 + Apache Kafka 2.5 => New Features (JSON Schema, Proto...
Kai Wähner330 views
Building event-driven Microservices with Kafka Ecosystem by Guido Schmutz
Building event-driven Microservices with Kafka EcosystemBuilding event-driven Microservices with Kafka Ecosystem
Building event-driven Microservices with Kafka Ecosystem
Guido Schmutz2.2K views

Recently uploaded

Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI... by
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...Marc Müller
41 views83 slides
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft... by
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...Deltares
7 views18 slides
tecnologia18.docx by
tecnologia18.docxtecnologia18.docx
tecnologia18.docxnosi6702
5 views5 slides
Software evolution understanding: Automatic extraction of software identifier... by
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...Ra'Fat Al-Msie'deen
9 views33 slides
Quality Engineer: A Day in the Life by
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the LifeJohn Valentino
6 views18 slides
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...Deltares
9 views31 slides

Recently uploaded(20)

Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI... by Marc Müller
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Dev-Cloud Conference 2023 - Continuous Deployment Showdown: Traditionelles CI...
Marc Müller41 views
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft... by Deltares
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...
DSD-INT 2023 Process-based modelling of salt marsh development coupling Delft...
Deltares7 views
tecnologia18.docx by nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67025 views
Software evolution understanding: Automatic extraction of software identifier... by Ra'Fat Al-Msie'deen
Software evolution understanding: Automatic extraction of software identifier...Software evolution understanding: Automatic extraction of software identifier...
Software evolution understanding: Automatic extraction of software identifier...
Quality Engineer: A Day in the Life by John Valentino
Quality Engineer: A Day in the LifeQuality Engineer: A Day in the Life
Quality Engineer: A Day in the Life
John Valentino6 views
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h... by Deltares
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
DSD-INT 2023 Exploring flash flood hazard reduction in arid regions using a h...
Deltares9 views
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko... by Deltares
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
DSD-INT 2023 Simulation of Coastal Hydrodynamics and Water Quality in Hong Ko...
Deltares14 views
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P... by NimaTorabi2
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
Unlocking the Power of AI in Product Management - A Comprehensive Guide for P...
NimaTorabi212 views
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with... by sparkfabrik
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
20231129 - Platform @ localhost 2023 - Application-driven infrastructure with...
sparkfabrik7 views
FOSSLight Community Day 2023-11-30 by Shane Coughlan
FOSSLight Community Day 2023-11-30FOSSLight Community Day 2023-11-30
FOSSLight Community Day 2023-11-30
Shane Coughlan5 views
Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta6 views
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx by animuscrm
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
2023-November-Schneider Electric-Meetup-BCN Admin Group.pptx
animuscrm15 views
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols by Deltares
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - DolsDSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
DSD-INT 2023 European Digital Twin Ocean and Delft3D FM - Dols
Deltares9 views
AI and Ml presentation .pptx by FayazAli87
AI and Ml presentation .pptxAI and Ml presentation .pptx
AI and Ml presentation .pptx
FayazAli8712 views

Samza Demo @scale 2017

  • 1. Apache Kafka • 2.1 Trillion messages ingested per day • 0.5 PB in, 2 PB out per day (compressed) • 16 million msg/sec peaks Apache Samza • Over 500 applications running in production, • With 10000+ containers • Applications with several TB of local state 1 Scale of Event Processing at LinkedIn
  • 2. Best in Class Support for Stateful Stream Processing • Incremental checkpointing for large state and fast recovery. • Local state that works seamlessly across upgrades and failures. • Async Processing for efficient remote I/O Hardened at Internet Scale • In use at LinkedIn, Uber, Netflix, Intuit, Metamarkets, TripAdvisor, VMWare, Optimizely, Redfin, etc. • Processing events from Kafka, Kinesis, EventHub, HDFS, ZeroMQ, DynamoDB Streams, MongoDB, Databus, Brooklin etc. Why Apache Samza ? 2 Unified API For Stream and Batch Processing • Process data in streams or in hadoop without any code changes. Run as a Service or a Library • Write once run anywhere. • Deploy in a managed cluster, or embed as a library in another application.
  • 3. Stream (data in motion) Processing • Click Stream Processing, Interactive User Feeds • Security, Fraud Detection • Application Monitoring • Internet of Things • Ads, Gaming, Trading etc. Security 3
  • 4. Multi-Stage Dataflow Example 4 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo public class PageViewCountApplication implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.getInputStream("pageViewStream" ); MessageStream pageViewPerMember = graph.getOutputStream("pageViewPerMemberStream" ); pageView .partitionBy(m -> m.memberId) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(MyStreamOutput::new) .sendTo(pageViewPerMember); } } built-in transform functions
  • 5. Stream Application in Batch Application logic: Count number of ‘Page Views’ for each member in a 5 minute window and send the counts to ‘Page View Per Member’ 5 Page View in stream Page View per Member out stream Repartition by member id Window Map SendTo HDFS PageView: hdfs://mydbsnapshot/PageViewFiles/ PageViewPerMember: hdfs://myoutputdb/PageViewPerMemberFiles Zero code changes
  • 6. Stream Processing as a Library 6 Page View Page View per Member Repartition by member id Window Map SendTo Launch Stream Processor public static void main(String[] args) { CommandLine cmdLine = new CommandLine(); OptionSet options = cmdLine.parser().parse(args); Config config = cmdLine.loadConfig(options); LocalApplicationRunner runner = new LocalApplicationRunner(config); PageViewCountApplication app = new PageViewCountApplication(); runner.run(app); runner.waitForFinish(); } job.coordinator.factory=org.apache.samza.zk. ZkJobCoordinatorFactory job.coordinator.zk.connect=my-zk.server:2191 Zero code changes
  • 7. Apache Kafka Real Time Processing (Apache Samza) Processing Espresso Services Tier Ingestion Clients(browser,devices ….) Brooklin Oracle AWS Kinesis Azure EventHub Data Ingestion at LinkedIn 7
  • 9. Local State -- Throughput 9 remote state 30-150x worse than local state on disk w/ caching comparable with in memory changelog adds minimal overhead
  • 10. Failure Recovery 10 ~ constant overhead with Host Affinity parallel recovery: equal recovery time irrespective of # failed containers
  • 11. Samza HDFS Benchmark Profile count, group-by country 500 files 250GB input