SlideShare a Scribd company logo
1 of 30
Harvesting the
Power of Samza
in News Feed
Providing fresh and relevant
content to hundreds of millions
of members
A Few Things Mentioned Here
Prerequisites
1 Samza
2 RocksDB (a key-value store)
3 SerDe (Serializer/Deserializer)
4 Kafka (a distributed messaging system)
5 Java
2
The Challenge
Relevant content is a great way to stay informed about your
professional interests; Fresh relevant content is even better!
How do we keep track of what
hundreds of millions of members
viewed on their News Feeds?
Tracking
News Feed is the Landing Page for Most Members
Scale
6
Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter
• Lightweight events that track what
the member viewed
• Tiny payload (bandwidth-friendly)
• Events end up in a Kafka topic
Client-Side Tracking
• Events that have more data about
served feeds
• Rich payload
• Events end up in a Kafka topic
Server-Side Tracking
Improving Member Experience Using Samza
(Overview)
A stream-stream join task buffers events from both
streams; matches are sent to an output Kafka stream1 Join input streams
A custom TTL mechanism reaps stale events every n
seconds2 Purge stale events
Convert the rich data about impressions into machine
learning features used for ranking items in the News Feed3 Consume output stream
9
Join
10
1
Overview
11
Client Events
Server Events
Process Client Events
Process Server Events
Output Events
Client-Side Events Processor Overview
12
ID in
server-
side events
store?
Match events
Store (ID, const.)
Yes
No
Output to Kafka
Optimizations
Client-Side Events Processor
13
• Initial capacity of matches map (event, matched
IDs) is determined by a metric (GC-friendly)
• Initial capacity of value set is equal to |IDs|
• An empty byte array is used as a dummy value
for IDs to store in RocksDB (passes through the
NOP byte array SerDe); acting as a set
Server-Side Events Processor Overview
14
ID in client-
side events
store?
Match events
Store (ID, event)
Yes
No
Output to Kafka
• Header (shared event data)
• List of payloads (one for each item)
• Each payload has a join key (ID)
Event Anatomy
Shared Event Data
(e.g. member ID)
ID: 123
ID: 456
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
Server-Side Events Storage
16
Shared Event Data
(e.g. member ID)
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
ID: 123
ID: 456
ID: 123
ID: 456
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
UUID: 1A67343FE…83B
ManyKeysToOneValueStore<K, V>
Server-Side Events Storage
17
• Space-efficient
• Insertion is transactional
• Rolling back a transaction is a best effort thing
• Requires an additional lookup (but it’s worth it)
Event Matching
18
Client-Side
Event
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
ID: 123
ID: 456
Server-Side
Event
A
ID: 111
ID: 456
ID: 906
ID: 678
…
ID: 901
ID: 431
ID: 746
Server-Side
Event
B
ID: 234
ID: 012
ID: 123
ID: 100
…
ID: 313
ID: 345
ID: 333
Output
Event
A
ID: 901
ID: 456
ID: 678
Output
Event
B
ID: 012
ID: 123
ID: 345
ID: 234
[SAMZA-647]
Key-Value Store Contributions to Samza
19
• The access pattern is getAll(List<K>)
• RocksDB supports multiGet that’s faster than get
• Added that support to samza’s KeyValueStore
• Perf test results confirm that of RocksDB (with
caching disabled)
TTL
20
2
Custom TTL Mechanism
 Records the timestamp of when an event was stored
 The “death row” store: key is the timestamp and the value is an ID
 Because the key is a timestamp, collisions occur:
21
Generate timestamp
Bucket is taken
Bucket is free
Attempts <= max Attempts > max
put(timestamp, ID)
Linear Probing Timestamper
22
 TTL calculation is not mission-critical (currentTimeMillis() is not very precise
anyways); events get deleted in the next window
 Keeping it simple and stupid works
Reapers
 Every n seconds:
 Get death rows (t < now – TTL)
 For each entry in death row:
 Remove from core stores
 Remove from death row
23
Optimizations
Reapers
24
• Keys (timestamps) are stored in order
• A range query (0, now – TTL) is much faster
than a range scan (testing all values)
• Even though TTL is in the order of
minutes/hours, reaping stale events happens
every 10 seconds (the window method is
blocking)
Stats
25
[SAMZA-647] getAll is %23 Faster
RocksDB Get All vs. Get Performance
26
Timestamp Collision Resolution Metrics
27
The Most Important Metric
28
29
of messages handled by the job
everyday
Billions
Find out more:
©2015 LinkedIn Corporation. All Rights
Reserved.
blog.linkedin.com
linkedin.com/in/elgeish
mmahmoud@linkedin.com
30

More Related Content

What's hot

Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafkaMole Wong
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strataYi Pan
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Monal Daxini
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jetStreamNative
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...Paul Brebner
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInChris Riccomini
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon CrosbyEasily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon CrosbyStreamNative
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019confluent
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...confluent
 
From a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandFrom a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandRan Silberman
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Legacy Typesafe (now Lightbend)
 

What's hot (20)

Data pipeline with kafka
Data pipeline with kafkaData pipeline with kafka
Data pipeline with kafka
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
Samza tech talk_2015 - strata
Samza tech talk_2015 - strataSamza tech talk_2015 - strata
Samza tech talk_2015 - strata
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Low latency stream processing with jet
Low latency stream processing with jetLow latency stream processing with jet
Low latency stream processing with jet
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...ApacheCon2019 Talk: Kafka, Cassandra and Kubernetesat Scale – Real-time Ano...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Apache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedInApache Incubator Samza: Stream Processing at LinkedIn
Apache Incubator Samza: Stream Processing at LinkedIn
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon CrosbyEasily Build a Smart Pulsar Stream Processor_Simon Crosby
Easily Build a Smart Pulsar Stream Processor_Simon Crosby
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
Kafka Summit NYC 2017 - Building Advanced Streaming Applications using the La...
 
From a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised LandFrom a kafkaesque story to The Promised Land
From a kafkaesque story to The Promised Land
 
Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)Reactive Streams 1.0.0 and Why You Should Care (webinar)
Reactive Streams 1.0.0 and Why You Should Care (webinar)
 

Viewers also liked

LinkedIn - A highly scalable Architecture on Java!
LinkedIn - A highly scalable Architecture on Java!LinkedIn - A highly scalable Architecture on Java!
LinkedIn - A highly scalable Architecture on Java!manivannan57
 
Engineering the New LinkedIn Profile
Engineering the New LinkedIn ProfileEngineering the New LinkedIn Profile
Engineering the New LinkedIn ProfileJosh Clemm
 
Fault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexFault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexApache Apex Organizer
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERShuyi Chen
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msApache Apex
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMike Dirolf
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaJiangjie Qin
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 

Viewers also liked (8)

LinkedIn - A highly scalable Architecture on Java!
LinkedIn - A highly scalable Architecture on Java!LinkedIn - A highly scalable Architecture on Java!
LinkedIn - A highly scalable Architecture on Java!
 
Engineering the New LinkedIn Profile
Engineering the New LinkedIn ProfileEngineering the New LinkedIn Profile
Engineering the New LinkedIn Profile
 
Fault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexFault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache Apex
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Similar to Harvesting the Power of Samza in LinkedIn's Feed

Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWSSungmin Kim
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
Real Time Insights for Advertising Tech
Real Time Insights for Advertising TechReal Time Insights for Advertising Tech
Real Time Insights for Advertising TechApache Apex
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiYi Pan
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha Quiñones
 
ksqlDB Workshop
ksqlDB WorkshopksqlDB Workshop
ksqlDB Workshopconfluent
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
Riga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWSRiga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWSAntons Kranga
 
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDBAggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDBScyllaDB
 
Concepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with KafkaConcepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with KafkaQAware GmbH
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
 

Similar to Harvesting the Power of Samza in LinkedIn's Feed (20)

AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Realtime Analytics on AWS
Realtime Analytics on AWSRealtime Analytics on AWS
Realtime Analytics on AWS
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Real Time Insights for Advertising Tech
Real Time Insights for Advertising TechReal Time Insights for Advertising Tech
Real Time Insights for Advertising Tech
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Samza tech talk_2015 - huawei
Samza tech talk_2015 - huaweiSamza tech talk_2015 - huawei
Samza tech talk_2015 - huawei
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
ksqlDB Workshop
ksqlDB WorkshopksqlDB Workshop
ksqlDB Workshop
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Riga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWSRiga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWS
 
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDBAggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
Aggregations at Scale for ShareChat —Using Kafka Streams and ScyllaDB
 
Concepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with KafkaConcepts and Patterns for Streaming Services with Kafka
Concepts and Patterns for Streaming Services with Kafka
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 

Recently uploaded

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITmanoharjgpsolutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfRTS corp
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shardsChristopher Curtin
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 

Recently uploaded (20)

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Best Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh ITBest Angular 17 Classroom & Online training - Naresh IT
Best Angular 17 Classroom & Online training - Naresh IT
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdfEnhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
Enhancing Supply Chain Visibility with Cargo Cloud Solutions.pdf
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards2024 DevNexus Patterns for Resiliency: Shuffle shards
2024 DevNexus Patterns for Resiliency: Shuffle shards
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 

Harvesting the Power of Samza in LinkedIn's Feed

  • 1. Harvesting the Power of Samza in News Feed Providing fresh and relevant content to hundreds of millions of members
  • 2. A Few Things Mentioned Here Prerequisites 1 Samza 2 RocksDB (a key-value store) 3 SerDe (Serializer/Deserializer) 4 Kafka (a distributed messaging system) 5 Java 2
  • 4. Relevant content is a great way to stay informed about your professional interests; Fresh relevant content is even better! How do we keep track of what hundreds of millions of members viewed on their News Feeds?
  • 6. News Feed is the Landing Page for Most Members Scale 6 Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter
  • 7. • Lightweight events that track what the member viewed • Tiny payload (bandwidth-friendly) • Events end up in a Kafka topic Client-Side Tracking
  • 8. • Events that have more data about served feeds • Rich payload • Events end up in a Kafka topic Server-Side Tracking
  • 9. Improving Member Experience Using Samza (Overview) A stream-stream join task buffers events from both streams; matches are sent to an output Kafka stream1 Join input streams A custom TTL mechanism reaps stale events every n seconds2 Purge stale events Convert the rich data about impressions into machine learning features used for ranking items in the News Feed3 Consume output stream 9
  • 11. Overview 11 Client Events Server Events Process Client Events Process Server Events Output Events
  • 12. Client-Side Events Processor Overview 12 ID in server- side events store? Match events Store (ID, const.) Yes No Output to Kafka
  • 13. Optimizations Client-Side Events Processor 13 • Initial capacity of matches map (event, matched IDs) is determined by a metric (GC-friendly) • Initial capacity of value set is equal to |IDs| • An empty byte array is used as a dummy value for IDs to store in RocksDB (passes through the NOP byte array SerDe); acting as a set
  • 14. Server-Side Events Processor Overview 14 ID in client- side events store? Match events Store (ID, event) Yes No Output to Kafka
  • 15. • Header (shared event data) • List of payloads (one for each item) • Each payload has a join key (ID) Event Anatomy Shared Event Data (e.g. member ID) ID: 123 ID: 456 ID: 789 ID: 012 ID: 345 ID: 678 ID: 901 ID: 234
  • 16. Server-Side Events Storage 16 Shared Event Data (e.g. member ID) ID: 789 ID: 012 ID: 345 ID: 678 ID: 901 ID: 234 ID: 123 ID: 456 ID: 123 ID: 456 ID: 789 ID: 012 ID: 345 ID: 678 ID: 901 ID: 234 UUID: 1A67343FE…83B UUID: 1A67343FE…83B UUID: 1A67343FE…83B UUID: 1A67343FE…83B UUID: 1A67343FE…83B UUID: 1A67343FE…83B UUID: 1A67343FE…83B UUID: 1A67343FE…83B UUID: 1A67343FE…83B
  • 17. ManyKeysToOneValueStore<K, V> Server-Side Events Storage 17 • Space-efficient • Insertion is transactional • Rolling back a transaction is a best effort thing • Requires an additional lookup (but it’s worth it)
  • 18. Event Matching 18 Client-Side Event ID: 789 ID: 012 ID: 345 ID: 678 ID: 901 ID: 234 ID: 123 ID: 456 Server-Side Event A ID: 111 ID: 456 ID: 906 ID: 678 … ID: 901 ID: 431 ID: 746 Server-Side Event B ID: 234 ID: 012 ID: 123 ID: 100 … ID: 313 ID: 345 ID: 333 Output Event A ID: 901 ID: 456 ID: 678 Output Event B ID: 012 ID: 123 ID: 345 ID: 234
  • 19. [SAMZA-647] Key-Value Store Contributions to Samza 19 • The access pattern is getAll(List<K>) • RocksDB supports multiGet that’s faster than get • Added that support to samza’s KeyValueStore • Perf test results confirm that of RocksDB (with caching disabled)
  • 21. Custom TTL Mechanism  Records the timestamp of when an event was stored  The “death row” store: key is the timestamp and the value is an ID  Because the key is a timestamp, collisions occur: 21 Generate timestamp Bucket is taken Bucket is free Attempts <= max Attempts > max put(timestamp, ID)
  • 22. Linear Probing Timestamper 22  TTL calculation is not mission-critical (currentTimeMillis() is not very precise anyways); events get deleted in the next window  Keeping it simple and stupid works
  • 23. Reapers  Every n seconds:  Get death rows (t < now – TTL)  For each entry in death row:  Remove from core stores  Remove from death row 23
  • 24. Optimizations Reapers 24 • Keys (timestamps) are stored in order • A range query (0, now – TTL) is much faster than a range scan (testing all values) • Even though TTL is in the order of minutes/hours, reaping stale events happens every 10 seconds (the window method is blocking)
  • 26. [SAMZA-647] getAll is %23 Faster RocksDB Get All vs. Get Performance 26
  • 28. The Most Important Metric 28
  • 29. 29 of messages handled by the job everyday Billions
  • 30. Find out more: ©2015 LinkedIn Corporation. All Rights Reserved. blog.linkedin.com linkedin.com/in/elgeish mmahmoud@linkedin.com 30