SlideShare a Scribd company logo
1 of 70
Download to read offline
Processing over a trillion events a day
CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN
Processing over a trillion events a day
CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN
​ Jagadish Venkatraman
​ Sr. Software Engineer, LinkedIn
​ Apache Samza committer
​ jagadish@apache.org
Today’s
Agenda
1 Stream processing scenarios
2 Hard problems in stream processing
3 Case Study 1: LinkedIn’s communications platform
4 Case Study 2: Activity tracking in the News feed
Data processing latencies
Response latency
Synchronous Later. Possibly much later.
0 millis
Data processing latencies
Response latency
Milliseconds to minutes
Synchronous Later. Possibly much later.
0 millis
Some stream processing scenarios at LinkedIn
Real-time DDoS
protection for
members
Security
Some stream processing scenarios at LinkedIn
Real-time DDoS
protection for
members
Security
Notifications to
members
Notifications
Some stream processing scenarios at LinkedIn
Real-time DDoS
protection for
members
Security
Notifications to
members
Notifications
Real-time topic
tagging of articles
News
classification
Some stream processing scenarios at LinkedIn
Real-time DDoS
protection for
members
Security
Notifications to
members
Notifications
Real-time topic
tagging of articles
News
classification
Real-time site-speed
profiling by facets
Performance
monitoring
Some stream processing scenarios at LinkedIn
Real-time DDoS
protection for
members
Security
Analysis of service
calls
Call-graph
computation
Notifications to
members
Notifications
Real-time topic
tagging of articles
News
classification
Real-time site-speed
profiling by facets
Performance
monitoring
Some stream processing scenarios at LinkedIn
Tracking ads that
were clicked
Ad relevance
Tracking ads that
were clicked
Ad relevance
Aggregated, real-
time counts by
dimensions
Aggregates
Some stream processing scenarios at LinkedIn
Tracking ads that
were clicked
Ad relevance
Aggregated, real-
time counts by
dimensions
Aggregates
Tracking session
duration 
View-port
tracking
Some stream processing scenarios at LinkedIn
Tracking ads that
were clicked
Ad relevance
Aggregated, real-
time counts by
dimensions
Aggregates
Tracking session
duration 
View-port
tracking
Standardizing titles,
gender, education
Profile
standardization
Some stream processing scenarios at LinkedIn
….
A N D M A N Y M O R E S C E N A R I O S I N P R O D U C T I O N …
200+
Applications
A N D M A N Y M O R E S C E N A R I O S I N P R O D U C T I O N …
200+
Applications
Powered By
Samza
overview
•  A distributed stream processing framework
•  2012: Development at LinkedIn
•  Used at Netflix, Uber, Tripadvisor and several large
companies.
•  2013: Apache incubator
•  2015 (Jan): Apache Top Level Project
Today’s
agenda
1 Stream processing scenarios
2 Hard problems in stream processing
3 Case Study 1: LinkedIn’s communications platform
4 Case Study 2: Activity tracking in the news feed
Hard problems in stream processing
•  Partitioned streams
•  Distribute processing
among all workers
Scaling processing
•  Hardware failures are
inevitable
•  Efficient check-pointing
•  Instant recovery
State management
•  Need primitives for
supporting remote I/O
High performance
remote I/O
•  Running on a multi-tenant
cluster
•  Running as an embedded
library
•  Unified API for batch and
real-time data

Heterogeneous
deployment models
Partitioned processing model
DB change
capture
Kafka
Hadoop
Kafka
Hadoop
Serving stores
Elastic
Search
Samza application
Partitioned processing model
DB change
capture
Hadoop
Task-1
Task-2
Task-3
Container-1
Container-2
Kafka
Heartbeat
Samza master
•  Each task processes one or more
partitions
•  The Samza master assigns partitions
to tasks and monitors container
liveness
•  Scaling by increasing # of containers
Samza application
Hard problems in stream processing
•  Partitioned streams
•  Distribute processing
among all workers
Partitioned
processing
•  Hardware failures are
inevitable
•  Instant recovery
•  Efficient Check-pointing
State Management
•  Provides primitives for
supporting efficient
remote I/O.
Efficient remote data
access
•  Samza supports running
on a multi-tenant cluster
•  Samza can also run as a
light-weight embedded
library
•  Unified API for batch and
streaming
Heterogeneous
deployment models
•  Stateless versus Stateful operations
Use cases of Local state
•  Temporary data storage
•  Adjunct data lookups
Advantages
•  Richer access patterns, faster than remote state

Why local state matters?
•  Querying the remote DB from your
app versus local embedded DB access
•  Samza provides an embedded fault-
tolerant, persistent database.
Samza
Samza
Architecture for Temporary data
VERSUS
•  Querying the remote DB from your
app versus change capture from
database.
•  Turn remote I/O to local I/O
•  Bootstrap streams
Samza
Samza
Architecture for Adjunct lookups
VERSUS
•  Querying the remote DB from your
app versus change capture from
database.
•  Bootstrap and partition the remote
DB from the stream
Samza
Samza
Architecture for Adjunct lookups
VERSUS
100X
Faster
1.1M
TPS
Task-1
Container-1
Container-2
Task-2
•  Changelog
1.  Fault tolerant, replicated into Kafka.
2.  Ability to catch up from arbitrary
point in the log. 
3.  State restore from Kafka during host
failures
Key optimizations for scaling local state
Speed thrills but can also KILL.
Task-1
Container-1
Container-2
Heartbeat
Samza master
Task-2
Durable task– host mapping
•  Incremental state check-pointing
1.  Re-use on-disk state snapshot (host
affinity)
2.  Write on-disk file on host at
checkpoint
3.  Catch-up on only delta from the Kafka
change-log
Key optimizations for scaling local state
Task-1
Container-1
Container-2
Heartbeat
Samza master
Task-2
Durable task– host mapping
•  Incremental state check-pointing
1.  Re-use on-disk state snapshot (host
affinity)
2.  Write on-disk file on host at
checkpoint
3.  Catch-up on only delta from the Kafka
change-log
Key optimizations for scaling local state
60X
Faster
0
Downtime
than full checkpointing
 during restarts and recovery
1.2TB
State
Task-1
Container-1
Container-2
Heartbeat
Samza master
Task-2
Durable task– host mapping

Key optimizations for scaling local state
Drawbacks of local state
1.  Size is bounded
2.  Some data is not partitionable
3.  Repartitioning the stream messes up
local state
4.  Not useful for serving results
Hard problems in stream processing
•  Partitioned streams
•  Distribute processing
among all workers
Scaling processing
•  Hardware failures are
inevitable
•  Efficient check-pointing
•  Instant recovery
State management
•  Need primitives for
supporting remote I/O
High performance
remote I/O
•  Running on a multi-tenant
cluster
•  Running as an embedded
library
•  Unified API for batch and
real-time data

Heterogeneous
deployment models
Why remote I/O matters?
1.  Writing to a remote data store (eg: CouchDB) for serving
2.  Some data is available only in the remote store or through REST
3.  Invoking down-stream services from your processor
Scaling remote I/O
1.  Parallelism need not be tied to
number of partitions

2.  Complexity of programming model
3.  Application has to worry about
synchronization and check-pointing
Hard problems

Samza
Scaling remote I/O
1.  Parallelism need not be tied to
number of partitions

2.  Complexity of programming model
3.  Application has to worry about
synchronization and check-pointing



1.  Support out-of-order processing
within a partition

2.  Simple callback-based async API.
Built-in support for both multi-
threading and async processing
3.  Samza handles checkpointing
internally
Hard problems

How we solved them?
Samza
Samza
Performance results
•  PageViewEvent topic - 10 partitions
•  For each event, remote lookup of the
member’s inbox information (latency: 1ms to
200ms)
•  Single node Yarn cluster, 1 container (1 CPU
core), 1GB memory.
Experiment setup
Samza
Performance results
10X
Faster
With In-order processing
40X
Faster
Out-of-order processing
•  PageViewEvent topic - 10 partitions
•  For each event, remote lookup of the
member’s inbox information (latency: 1ms to
200ms)
•  Single node Yarn cluster, 1 container (1 CPU
core), 1GB memory.
At our scale
this is HUGE!
Experiment setup
Hard problems in stream processing
•  Partitioned streams
•  Distribute processing
among all workers
Scaling processing
•  Hardware failures are
inevitable
•  Efficient checkpointing
•  Instant recovery
State management
•  Need primitives for
supporting remote I/O
High performance
remote I/O
•  Running on a multi-tenant
cluster
•  Running as an embedded
library
•  Unified API for batch and
real-time data


Heterogeneous
deployment models
Samza - Write once, Run anywhere
public interface StreamTask {
void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
// process message
}
}

public class MyApp implements StreamApplication {
void init(StreamGraph streamGraph,
Config config) {
MessageStream<PageView> pageViews = ..;
pageViews.filter(myKeyFn)
.map(pageView -> new ProjectedPageView())
.window(Windows.keyedTumblingWindow(keyFn,
Duration.ofSeconds(30))
.sink(outputTopic);
}
}
NEW
Heterogeneity – Different deployment models
•  Uses Yarn for coordination, liveness monitoring
•  Better resource sharing
•  Scale by increasing the number of containers
Samza on multi-tenant clusters Samza as a light-weight library
•  Purely embedded library: No Yarn dependency 
•  Use zoo-keeper for coordination, liveness and
partition distribution
•  Seamless auto-scale by increasing the number
of instances
EXACT SAME
CODE
StreamApplication app = new MyApp();
ApplicationRunner localRunner =
ApplicationRunner.getLocalRunner(config);
localRunner.runApplication(app);
Heterogeneity – Support streaming and batch jobs
•  Supports re-processing and experimentation
•  Process data on hadoop instead of copying over
to Kafka ($$)
•  Job automatically shuts-down when end-of-
stream is reached
Samza on Hadoop Samza on streaming

•  World-class support for streaming inputs
EXACT SAME
CODE
Today’s
agenda
1 Stream processing scenarios
2 Hard problems in stream processing
3 Case Study 1: LinkedIn’s communications platform
4 Case Study 2: Activity tracking in the news feed
5 Conclusion
Craft a clear, consistent conversation between our
members and their professional world
AT C G O A L :
Features
Channel selection
•  “Don’t blast me on all channels”
•  Route through Inmail, email or
notification in app
Features
Channel selection
 Aggregation / Capping
•  “Don’t blast me on all channels”
•  Route through Inmail, email or
notification in app
•  “Don’t flood me, Consolidate
if you have too much to say!”
•  “Here’s a weekly summary of
who invited you to connect”
Features
Channel selection
 Aggregation / Capping
 Delivery time optimization
•  “Don’t blast me on all channels”
•  Route through Inmail, email or
notification in app
•  “Don’t flood me, Consolidate
if you have too much to say!”
•  “Here’s a weekly summary of
who invited you to connect”
•  “Send me when I will engage and
don’t buzz me at 2AM”
Features
Channel selection
 Aggregation / Capping
 Delivery time optimization
•  “Don’t blast me on all channels”
•  Route through Inmail, email or
notification in app
•  “Don’t flood me, Consolidate
if you have too much to say!”
•  “Here’s a weekly summary of
who invited you to connect”
•  “Send me when I will engage and
don’t buzz me at 2AM”
Filter
•  “Filter out spam, duplicates, stuff I
have seen or know about”
Why Apache Samza?
1.  Highly Scalable and distributed 
2.  Multiple sources of email 
3.  Fault tolerant state (notifications
to be scheduled later)
4.  Efficient remote calls to several
services
1.  Samza partitions streams and provides
fault-tolerant processing
2.  Pluggable connector API (Kafka, Hadoop,
change capture)
3.  Instant recovery and incremental
checkpointing
4.  Async I/O support
Requirements
 How Samza fits in?
Why Apache Samza?
1.  Highly Scalable and distributed 
2.  Multiple stream joins
3.  Fault tolerant state (notifications
to be scheduled later)
4.  Efficient remote calls to several
services
Requirements
ATC Architecture
DB change capture of
User Profiles
Task-1
Task-2
Task-3
Container-1
Container-2
ATC
Relevance Scores
from Offline
processing
User events / Feed
Activity/ Network
Activity
Notifications
Service
Email Service
Social graph
Inside each ATC Task
Scheduler
Channel
selector
Aggregator
Delivery
time
optimizer
Filter
Today’s
agenda
1 Stream Processing Scenarios
2 Hard Problems in Stream Processing
3 Case Study 1: LinkedIn’s communications platform
4 Case Study 2: Activity tracking in the news feed
Activity tracking in News feed
HOW WE USE SAMZA TO IMPROVE YOUR NEWS FEED..
Power relevant, fresh content for the LinkedIn Feed
A C T I V I T Y T R A C K I N G : G O A L
Server-side tracking event
Feed Server
Server-side tracking event
Feed Server
Server-side tracking event
Feed Server
{
“paginationId”:
“feedUpdates”: [{
“updateUrn”: “update1”
“trackingId”:
“position”:
“creationTime”:
“numLikes”:
“numComments”:
“comments”: [
{“commentId”: }
]
}, {
“updateUrn”:“update2”
“trackingId”:
..
}

Rich payload
Client-side tracking event
Client-side tracking event
•  Track user activity in the app
•  Fire timer during a scroll or a view-
port change
Client-side tracking event
“feedImpression”: {
“urn”:
“trackingId”:
“duration”:
“height”:
}, {
“urn”:
“trackingId”:
} ,]

•  Light pay load
•  Bandwidth friendly
•  Battery friendly
Join client-side event with server-side event
“FEED_IMPRESSION”: {
“trackingId”: “abc”
“duration”: “100”
“height”: “50”
}, {
“trackingId”:
} ,]

“FEED_SERVED”: {
“paginationId”:
“feedUpdates”: [{
“updateUrn”: “update1”
“trackingId”: “abc”
“position”:
“creationTime”:
“numLikes”:
“numComments”:
“comments”: [
{“commentId”: }
]
}, {
“updateUrn”:“update2”
“trackingId”: “def”
“creationTime”:
“FEED_IMPRESSION”: {
“trackingId”: “abc”
“duration”: “100”
“height”: “50”
}, {
“trackingId”:
} ,]

“FEED_SERVED”: {
“paginationId”:
“feedUpdates”: [{
“updateUrn”: “update1”
“trackingId”: “abc”
“position”:
“creationTime”:
“numLikes”:
“numComments”:
“comments”: [
{“commentId”: }
]
}, {
“updateUrn”:“update2”
“trackingId”: “def”
“creationTime”:

“JOINED_EVENT”:{
“paginationId”:
“feedUpdates”: [{
“updateUrn”: “update1”
“trackingId”: “abc”
“duration”: “100”
“height”: “50”
“position”:
“creationTime”:
“numLikes”:
“numComments”:
“comments”: [
{“commentId”: }
]
}
}

Samza Joins client-side event with server-side event
Architecture
Task-2
Container-2
CLIENT EVENT
Feed Server
SERVER EVENT
FEED JOINER
Task-1
Container-1
To Downstream 
Ranking Systems
1.2
Billion
Processed per-day
90
Containers
Distributed, Partitioned
Recap
1 Stream processing scenarios
2 Hard problems in stream processing
3 Case Study 1: LinkedIn’s communications platform
4 Case Study 2: Activity tracking in the news feed
Recap
SUMMARY OF WHAT WE LEARNT IN THIS TALK
Key differentiators for Apache Samza
Key differentiators for Apache Samza
•  Stream Processing both as a multi-tenant service and a light-weight
embedded library
•  No micro batching (first class streaming)
•  Unified processing of batch and streaming data
•  Efficient remote calls I/O using built-in async mode
•  World-class support for scalable local state
•  Incremental check-pointing
•  Instant restore with zero down-time
•  Low level API and Stream based high level API (DSL)
Coming soon
•  Event time, and out of order arrival handling
•  Beam API integration
•  SQL on streams
Powered by
S T A B L E , M A T U R E
S T R E A M P R O C E S S I N G
F R A M E W O R K 
200+
Applications at
LinkedIn
+
Q & A
http://samza.apache.org

More Related Content

What's hot

MariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB plc
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
MariaDB High Availability
MariaDB High AvailabilityMariaDB High Availability
MariaDB High AvailabilityMariaDB plc
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NETDavid Giard
 
Why Distributed Databases?
Why Distributed Databases?Why Distributed Databases?
Why Distributed Databases?Sargun Dhillon
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Managementguest2e11e8
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Actor-based concurrency in a modern Java Enterprise
Actor-based concurrency in a modern Java EnterpriseActor-based concurrency in a modern Java Enterprise
Actor-based concurrency in a modern Java EnterpriseAlexander Lukyanchikov
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupJeff Holoman
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centersMariaDB plc
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
AKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs
 
Riding the Stream Processing Wave (Strange loop 2019)
Riding the Stream Processing Wave (Strange loop 2019)Riding the Stream Processing Wave (Strange loop 2019)
Riding the Stream Processing Wave (Strange loop 2019)Samarth Shetty
 
VMworld 2014: Virtualize Active Directory, the Right Way!
VMworld 2014: Virtualize Active Directory, the Right Way!VMworld 2014: Virtualize Active Directory, the Right Way!
VMworld 2014: Virtualize Active Directory, the Right Way!VMworld
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is uselessAdrian Cockcroft
 
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Microsoft Technet France
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoopLen Bass
 

What's hot (19)

MariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB High Availability Webinar
MariaDB High Availability Webinar
 
Five steps perform_2013
Five steps perform_2013Five steps perform_2013
Five steps perform_2013
 
Akka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive PlatformAkka 2.4 plus commercial features in Typesafe Reactive Platform
Akka 2.4 plus commercial features in Typesafe Reactive Platform
 
MariaDB High Availability
MariaDB High AvailabilityMariaDB High Availability
MariaDB High Availability
 
Scaling habits of ASP.NET
Scaling habits of ASP.NETScaling habits of ASP.NET
Scaling habits of ASP.NET
 
Why Distributed Databases?
Why Distributed Databases?Why Distributed Databases?
Why Distributed Databases?
 
Intro to Databases
Intro to DatabasesIntro to Databases
Intro to Databases
 
Continuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data ManagementContinuent Tungsten - Scalable Saa S Data Management
Continuent Tungsten - Scalable Saa S Data Management
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Actor-based concurrency in a modern Java Enterprise
Actor-based concurrency in a modern Java EnterpriseActor-based concurrency in a modern Java Enterprise
Actor-based concurrency in a modern Java Enterprise
 
Kafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User GroupKafka Reliability Guarantees ATL Kafka User Group
Kafka Reliability Guarantees ATL Kafka User Group
 
Running MariaDB in multiple data centers
Running MariaDB in multiple data centersRunning MariaDB in multiple data centers
Running MariaDB in multiple data centers
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
AKUDA Labs: Pulsar
AKUDA Labs: PulsarAKUDA Labs: Pulsar
AKUDA Labs: Pulsar
 
Riding the Stream Processing Wave (Strange loop 2019)
Riding the Stream Processing Wave (Strange loop 2019)Riding the Stream Processing Wave (Strange loop 2019)
Riding the Stream Processing Wave (Strange loop 2019)
 
VMworld 2014: Virtualize Active Directory, the Right Way!
VMworld 2014: Virtualize Active Directory, the Right Way!VMworld 2014: Virtualize Active Directory, the Right Way!
VMworld 2014: Virtualize Active Directory, the Right Way!
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
Exchange 2013 Haute disponibilité et tolérance aux sinistres (Session 1/2 pre...
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoop
 

Similar to ApacheCon BigData - What it takes to process a trillion events a day?

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Productionconfluent
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesJosef Adersberger
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesQAware GmbH
 
Reactive programming
Reactive programmingReactive programming
Reactive programmingSUDIP GHOSH
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked inYi Pan
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strataYi Pan
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analyticsamesar0
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
 

Similar to ApacheCon BigData - What it takes to process a trillion events a day? (20)

Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Streaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in ProductionStreaming in Practice - Putting Apache Kafka in Production
Streaming in Practice - Putting Apache Kafka in Production
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Patterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to KubernetesPatterns and Pains of Migrating Legacy Applications to Kubernetes
Patterns and Pains of Migrating Legacy Applications to Kubernetes
 
Reactive programming
Reactive programmingReactive programming
Reactive programming
 
Lambda-less stream processing - linked in
Lambda-less stream processing - linked inLambda-less stream processing - linked in
Lambda-less stream processing - linked in
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strata
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
 

Recently uploaded

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Recently uploaded (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

ApacheCon BigData - What it takes to process a trillion events a day?

  • 1. Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN
  • 2. Processing over a trillion events a day CASE STUDIES IN SCALING STREAM PROCESSING AT LINKEDIN ​ Jagadish Venkatraman ​ Sr. Software Engineer, LinkedIn ​ Apache Samza committer ​ jagadish@apache.org
  • 3. Today’s Agenda 1 Stream processing scenarios 2 Hard problems in stream processing 3 Case Study 1: LinkedIn’s communications platform 4 Case Study 2: Activity tracking in the News feed
  • 4. Data processing latencies Response latency Synchronous Later. Possibly much later. 0 millis
  • 5. Data processing latencies Response latency Milliseconds to minutes Synchronous Later. Possibly much later. 0 millis
  • 6. Some stream processing scenarios at LinkedIn Real-time DDoS protection for members Security
  • 7. Some stream processing scenarios at LinkedIn Real-time DDoS protection for members Security Notifications to members Notifications
  • 8. Some stream processing scenarios at LinkedIn Real-time DDoS protection for members Security Notifications to members Notifications Real-time topic tagging of articles News classification
  • 9. Some stream processing scenarios at LinkedIn Real-time DDoS protection for members Security Notifications to members Notifications Real-time topic tagging of articles News classification Real-time site-speed profiling by facets Performance monitoring
  • 10. Some stream processing scenarios at LinkedIn Real-time DDoS protection for members Security Analysis of service calls Call-graph computation Notifications to members Notifications Real-time topic tagging of articles News classification Real-time site-speed profiling by facets Performance monitoring
  • 11. Some stream processing scenarios at LinkedIn Tracking ads that were clicked Ad relevance
  • 12. Tracking ads that were clicked Ad relevance Aggregated, real- time counts by dimensions Aggregates Some stream processing scenarios at LinkedIn
  • 13. Tracking ads that were clicked Ad relevance Aggregated, real- time counts by dimensions Aggregates Tracking session duration View-port tracking Some stream processing scenarios at LinkedIn
  • 14. Tracking ads that were clicked Ad relevance Aggregated, real- time counts by dimensions Aggregates Tracking session duration View-port tracking Standardizing titles, gender, education Profile standardization Some stream processing scenarios at LinkedIn ….
  • 15. A N D M A N Y M O R E S C E N A R I O S I N P R O D U C T I O N … 200+ Applications
  • 16. A N D M A N Y M O R E S C E N A R I O S I N P R O D U C T I O N … 200+ Applications Powered By
  • 17. Samza overview •  A distributed stream processing framework •  2012: Development at LinkedIn •  Used at Netflix, Uber, Tripadvisor and several large companies. •  2013: Apache incubator •  2015 (Jan): Apache Top Level Project
  • 18. Today’s agenda 1 Stream processing scenarios 2 Hard problems in stream processing 3 Case Study 1: LinkedIn’s communications platform 4 Case Study 2: Activity tracking in the news feed
  • 19. Hard problems in stream processing •  Partitioned streams •  Distribute processing among all workers Scaling processing •  Hardware failures are inevitable •  Efficient check-pointing •  Instant recovery State management •  Need primitives for supporting remote I/O High performance remote I/O •  Running on a multi-tenant cluster •  Running as an embedded library •  Unified API for batch and real-time data Heterogeneous deployment models
  • 20. Partitioned processing model DB change capture Kafka Hadoop Kafka Hadoop Serving stores Elastic Search Samza application
  • 21. Partitioned processing model DB change capture Hadoop Task-1 Task-2 Task-3 Container-1 Container-2 Kafka Heartbeat Samza master •  Each task processes one or more partitions •  The Samza master assigns partitions to tasks and monitors container liveness •  Scaling by increasing # of containers Samza application
  • 22. Hard problems in stream processing •  Partitioned streams •  Distribute processing among all workers Partitioned processing •  Hardware failures are inevitable •  Instant recovery •  Efficient Check-pointing State Management •  Provides primitives for supporting efficient remote I/O. Efficient remote data access •  Samza supports running on a multi-tenant cluster •  Samza can also run as a light-weight embedded library •  Unified API for batch and streaming Heterogeneous deployment models
  • 23. •  Stateless versus Stateful operations Use cases of Local state •  Temporary data storage •  Adjunct data lookups Advantages •  Richer access patterns, faster than remote state Why local state matters?
  • 24. •  Querying the remote DB from your app versus local embedded DB access •  Samza provides an embedded fault- tolerant, persistent database. Samza Samza Architecture for Temporary data VERSUS
  • 25. •  Querying the remote DB from your app versus change capture from database. •  Turn remote I/O to local I/O •  Bootstrap streams Samza Samza Architecture for Adjunct lookups VERSUS
  • 26. •  Querying the remote DB from your app versus change capture from database. •  Bootstrap and partition the remote DB from the stream Samza Samza Architecture for Adjunct lookups VERSUS 100X Faster 1.1M TPS
  • 27. Task-1 Container-1 Container-2 Task-2 •  Changelog 1.  Fault tolerant, replicated into Kafka. 2.  Ability to catch up from arbitrary point in the log. 3.  State restore from Kafka during host failures Key optimizations for scaling local state
  • 28. Speed thrills but can also KILL.
  • 29. Task-1 Container-1 Container-2 Heartbeat Samza master Task-2 Durable task– host mapping •  Incremental state check-pointing 1.  Re-use on-disk state snapshot (host affinity) 2.  Write on-disk file on host at checkpoint 3.  Catch-up on only delta from the Kafka change-log Key optimizations for scaling local state
  • 30. Task-1 Container-1 Container-2 Heartbeat Samza master Task-2 Durable task– host mapping •  Incremental state check-pointing 1.  Re-use on-disk state snapshot (host affinity) 2.  Write on-disk file on host at checkpoint 3.  Catch-up on only delta from the Kafka change-log Key optimizations for scaling local state 60X Faster 0 Downtime than full checkpointing during restarts and recovery 1.2TB State
  • 31. Task-1 Container-1 Container-2 Heartbeat Samza master Task-2 Durable task– host mapping Key optimizations for scaling local state Drawbacks of local state 1.  Size is bounded 2.  Some data is not partitionable 3.  Repartitioning the stream messes up local state 4.  Not useful for serving results
  • 32. Hard problems in stream processing •  Partitioned streams •  Distribute processing among all workers Scaling processing •  Hardware failures are inevitable •  Efficient check-pointing •  Instant recovery State management •  Need primitives for supporting remote I/O High performance remote I/O •  Running on a multi-tenant cluster •  Running as an embedded library •  Unified API for batch and real-time data Heterogeneous deployment models
  • 33. Why remote I/O matters? 1.  Writing to a remote data store (eg: CouchDB) for serving 2.  Some data is available only in the remote store or through REST 3.  Invoking down-stream services from your processor
  • 34. Scaling remote I/O 1.  Parallelism need not be tied to number of partitions 2.  Complexity of programming model 3.  Application has to worry about synchronization and check-pointing Hard problems Samza
  • 35. Scaling remote I/O 1.  Parallelism need not be tied to number of partitions 2.  Complexity of programming model 3.  Application has to worry about synchronization and check-pointing 1.  Support out-of-order processing within a partition 2.  Simple callback-based async API. Built-in support for both multi- threading and async processing 3.  Samza handles checkpointing internally Hard problems How we solved them? Samza
  • 36. Samza Performance results •  PageViewEvent topic - 10 partitions •  For each event, remote lookup of the member’s inbox information (latency: 1ms to 200ms) •  Single node Yarn cluster, 1 container (1 CPU core), 1GB memory. Experiment setup
  • 37. Samza Performance results 10X Faster With In-order processing 40X Faster Out-of-order processing •  PageViewEvent topic - 10 partitions •  For each event, remote lookup of the member’s inbox information (latency: 1ms to 200ms) •  Single node Yarn cluster, 1 container (1 CPU core), 1GB memory. At our scale this is HUGE! Experiment setup
  • 38. Hard problems in stream processing •  Partitioned streams •  Distribute processing among all workers Scaling processing •  Hardware failures are inevitable •  Efficient checkpointing •  Instant recovery State management •  Need primitives for supporting remote I/O High performance remote I/O •  Running on a multi-tenant cluster •  Running as an embedded library •  Unified API for batch and real-time data Heterogeneous deployment models
  • 39. Samza - Write once, Run anywhere public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { // process message } } public class MyApp implements StreamApplication { void init(StreamGraph streamGraph, Config config) { MessageStream<PageView> pageViews = ..; pageViews.filter(myKeyFn) .map(pageView -> new ProjectedPageView()) .window(Windows.keyedTumblingWindow(keyFn, Duration.ofSeconds(30)) .sink(outputTopic); } } NEW
  • 40. Heterogeneity – Different deployment models •  Uses Yarn for coordination, liveness monitoring •  Better resource sharing •  Scale by increasing the number of containers Samza on multi-tenant clusters Samza as a light-weight library •  Purely embedded library: No Yarn dependency •  Use zoo-keeper for coordination, liveness and partition distribution •  Seamless auto-scale by increasing the number of instances EXACT SAME CODE StreamApplication app = new MyApp(); ApplicationRunner localRunner = ApplicationRunner.getLocalRunner(config); localRunner.runApplication(app);
  • 41. Heterogeneity – Support streaming and batch jobs •  Supports re-processing and experimentation •  Process data on hadoop instead of copying over to Kafka ($$) •  Job automatically shuts-down when end-of- stream is reached Samza on Hadoop Samza on streaming •  World-class support for streaming inputs EXACT SAME CODE
  • 42. Today’s agenda 1 Stream processing scenarios 2 Hard problems in stream processing 3 Case Study 1: LinkedIn’s communications platform 4 Case Study 2: Activity tracking in the news feed 5 Conclusion
  • 43. Craft a clear, consistent conversation between our members and their professional world AT C G O A L :
  • 44. Features Channel selection •  “Don’t blast me on all channels” •  Route through Inmail, email or notification in app
  • 45. Features Channel selection Aggregation / Capping •  “Don’t blast me on all channels” •  Route through Inmail, email or notification in app •  “Don’t flood me, Consolidate if you have too much to say!” •  “Here’s a weekly summary of who invited you to connect”
  • 46. Features Channel selection Aggregation / Capping Delivery time optimization •  “Don’t blast me on all channels” •  Route through Inmail, email or notification in app •  “Don’t flood me, Consolidate if you have too much to say!” •  “Here’s a weekly summary of who invited you to connect” •  “Send me when I will engage and don’t buzz me at 2AM”
  • 47. Features Channel selection Aggregation / Capping Delivery time optimization •  “Don’t blast me on all channels” •  Route through Inmail, email or notification in app •  “Don’t flood me, Consolidate if you have too much to say!” •  “Here’s a weekly summary of who invited you to connect” •  “Send me when I will engage and don’t buzz me at 2AM” Filter •  “Filter out spam, duplicates, stuff I have seen or know about”
  • 48. Why Apache Samza? 1.  Highly Scalable and distributed 2.  Multiple sources of email 3.  Fault tolerant state (notifications to be scheduled later) 4.  Efficient remote calls to several services 1.  Samza partitions streams and provides fault-tolerant processing 2.  Pluggable connector API (Kafka, Hadoop, change capture) 3.  Instant recovery and incremental checkpointing 4.  Async I/O support Requirements How Samza fits in?
  • 49. Why Apache Samza? 1.  Highly Scalable and distributed 2.  Multiple stream joins 3.  Fault tolerant state (notifications to be scheduled later) 4.  Efficient remote calls to several services Requirements
  • 50. ATC Architecture DB change capture of User Profiles Task-1 Task-2 Task-3 Container-1 Container-2 ATC Relevance Scores from Offline processing User events / Feed Activity/ Network Activity Notifications Service Email Service Social graph
  • 51. Inside each ATC Task Scheduler Channel selector Aggregator Delivery time optimizer Filter
  • 52. Today’s agenda 1 Stream Processing Scenarios 2 Hard Problems in Stream Processing 3 Case Study 1: LinkedIn’s communications platform 4 Case Study 2: Activity tracking in the news feed
  • 53. Activity tracking in News feed HOW WE USE SAMZA TO IMPROVE YOUR NEWS FEED..
  • 54. Power relevant, fresh content for the LinkedIn Feed A C T I V I T Y T R A C K I N G : G O A L
  • 57. Server-side tracking event Feed Server { “paginationId”: “feedUpdates”: [{ “updateUrn”: “update1” “trackingId”: “position”: “creationTime”: “numLikes”: “numComments”: “comments”: [ {“commentId”: } ] }, { “updateUrn”:“update2” “trackingId”: .. } Rich payload
  • 59. Client-side tracking event •  Track user activity in the app •  Fire timer during a scroll or a view- port change
  • 60. Client-side tracking event “feedImpression”: { “urn”: “trackingId”: “duration”: “height”: }, { “urn”: “trackingId”: } ,] •  Light pay load •  Bandwidth friendly •  Battery friendly
  • 61. Join client-side event with server-side event “FEED_IMPRESSION”: { “trackingId”: “abc” “duration”: “100” “height”: “50” }, { “trackingId”: } ,] “FEED_SERVED”: { “paginationId”: “feedUpdates”: [{ “updateUrn”: “update1” “trackingId”: “abc” “position”: “creationTime”: “numLikes”: “numComments”: “comments”: [ {“commentId”: } ] }, { “updateUrn”:“update2” “trackingId”: “def” “creationTime”:
  • 62. “FEED_IMPRESSION”: { “trackingId”: “abc” “duration”: “100” “height”: “50” }, { “trackingId”: } ,] “FEED_SERVED”: { “paginationId”: “feedUpdates”: [{ “updateUrn”: “update1” “trackingId”: “abc” “position”: “creationTime”: “numLikes”: “numComments”: “comments”: [ {“commentId”: } ] }, { “updateUrn”:“update2” “trackingId”: “def” “creationTime”: “JOINED_EVENT”:{ “paginationId”: “feedUpdates”: [{ “updateUrn”: “update1” “trackingId”: “abc” “duration”: “100” “height”: “50” “position”: “creationTime”: “numLikes”: “numComments”: “comments”: [ {“commentId”: } ] } } Samza Joins client-side event with server-side event
  • 63. Architecture Task-2 Container-2 CLIENT EVENT Feed Server SERVER EVENT FEED JOINER Task-1 Container-1 To Downstream Ranking Systems 1.2 Billion Processed per-day 90 Containers Distributed, Partitioned
  • 64. Recap 1 Stream processing scenarios 2 Hard problems in stream processing 3 Case Study 1: LinkedIn’s communications platform 4 Case Study 2: Activity tracking in the news feed
  • 65. Recap SUMMARY OF WHAT WE LEARNT IN THIS TALK
  • 66. Key differentiators for Apache Samza
  • 67. Key differentiators for Apache Samza •  Stream Processing both as a multi-tenant service and a light-weight embedded library •  No micro batching (first class streaming) •  Unified processing of batch and streaming data •  Efficient remote calls I/O using built-in async mode •  World-class support for scalable local state •  Incremental check-pointing •  Instant restore with zero down-time •  Low level API and Stream based high level API (DSL)
  • 68. Coming soon •  Event time, and out of order arrival handling •  Beam API integration •  SQL on streams
  • 69. Powered by S T A B L E , M A T U R E S T R E A M P R O C E S S I N G F R A M E W O R K 200+ Applications at LinkedIn