Harvesting the Power of Samza in LinkedIn's Feed

Harvesting the
Power of Samza
in News Feed
Providing fresh and relevant
content to hundreds of millions
of members

A Few Things Mentioned Here
Prerequisites
1 Samza
2 RocksDB (a key-value store)
3 SerDe (Serializer/Deserializer)
4 Kafka (a distributed messaging system)
5 Java
2

Relevant content is a great way to stay informed about your
professional interests; Fresh relevant content is even better!
How do we keep track of what
hundreds of millions of members
viewed on their News Feeds?

News Feed is the Landing Page for Most Members
Scale
6
Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter

• Lightweight events that track what
the member viewed
• Tiny payload (bandwidth-friendly)
• Events end up in a Kafka topic
Client-Side Tracking

• Events that have more data about
served feeds
• Rich payload
• Events end up in a Kafka topic
Server-Side Tracking

Improving Member Experience Using Samza
(Overview)
A stream-stream join task buffers events from both
streams; matches are sent to an output Kafka stream1 Join input streams
A custom TTL mechanism reaps stale events every n
seconds2 Purge stale events
Convert the rich data about impressions into machine
learning features used for ranking items in the News Feed3 Consume output stream
9

Overview
11
Client Events
Server Events
Process Client Events
Process Server Events
Output Events

Client-Side Events Processor Overview
12
ID in
server-
side events
store?
Match events
Store (ID, const.)
Yes
No
Output to Kafka

Optimizations
Client-Side Events Processor
13
• Initial capacity of matches map (event, matched
IDs) is determined by a metric (GC-friendly)
• Initial capacity of value set is equal to |IDs|
• An empty byte array is used as a dummy value
for IDs to store in RocksDB (passes through the
NOP byte array SerDe); acting as a set

Server-Side Events Processor Overview
14
ID in client-
side events
store?
Match events
Store (ID, event)
Yes
No
Output to Kafka

• Header (shared event data)
• List of payloads (one for each item)
• Each payload has a join key (ID)
Event Anatomy
Shared Event Data
(e.g. member ID)
ID: 123
ID: 456
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234

Server-Side Events Storage
16
Shared Event Data
(e.g. member ID)
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
ID: 123
ID: 456
ID: 123
ID: 456
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
UUID: 1A67343FE…83B

ManyKeysToOneValueStore<K, V>
Server-Side Events Storage
17
• Space-efficient
• Insertion is transactional
• Rolling back a transaction is a best effort thing
• Requires an additional lookup (but it’s worth it)

Event Matching
18
Client-Side
Event
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
ID: 123
ID: 456
Server-Side
Event
A
ID: 111
ID: 456
ID: 906
ID: 678
…
ID: 901
ID: 431
ID: 746
Server-Side
Event
B
ID: 234
ID: 012
ID: 123
ID: 100
…
ID: 313
ID: 345
ID: 333
Output
Event
A
ID: 901
ID: 456
ID: 678
Output
Event
B
ID: 012
ID: 123
ID: 345
ID: 234

[SAMZA-647]
Key-Value Store Contributions to Samza
19
• The access pattern is getAll(List<K>)
• RocksDB supports multiGet that’s faster than get
• Added that support to samza’s KeyValueStore
• Perf test results confirm that of RocksDB (with
caching disabled)

Custom TTL Mechanism
 Records the timestamp of when an event was stored
 The “death row” store: key is the timestamp and the value is an ID
 Because the key is a timestamp, collisions occur:
21
Generate timestamp
Bucket is taken
Bucket is free
Attempts <= max Attempts > max
put(timestamp, ID)

Linear Probing Timestamper
22
 TTL calculation is not mission-critical (currentTimeMillis() is not very precise
anyways); events get deleted in the next window
 Keeping it simple and stupid works

Reapers
 Every n seconds:
 Get death rows (t < now – TTL)
 For each entry in death row:
 Remove from core stores
 Remove from death row
23

Optimizations
Reapers
24
• Keys (timestamps) are stored in order
• A range query (0, now – TTL) is much faster
than a range scan (testing all values)
• Even though TTL is in the order of
minutes/hours, reaping stale events happens
every 10 seconds (the window method is
blocking)

[SAMZA-647] getAll is %23 Faster
RocksDB Get All vs. Get Performance
26

Timestamp Collision Resolution Metrics
27

29
of messages handled by the job
everyday
Billions

Harvesting the Power of Samza in LinkedIn's Feed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Harvesting the Power of Samza in LinkedIn's Feed

Similar to Harvesting the Power of Samza in LinkedIn's Feed (20)

Recently uploaded

Recently uploaded (20)

Harvesting the Power of Samza in LinkedIn's Feed