LinkedIn's Feed is the entry point for hundreds of millions of members who seek to stay informed about their professional interests. The feed strives to provide relevant content to members that's also new and fresh. How does the feed solve this problem at scale? What role does Samza play in this? Join us to find out.
VK Business Profile - provides IT solutions and Web Development
Harvesting the Power of Samza in LinkedIn's Feed
1. Harvesting the
Power of Samza
in News Feed
Providing fresh and relevant
content to hundreds of millions
of members
2. A Few Things Mentioned Here
Prerequisites
1 Samza
2 RocksDB (a key-value store)
3 SerDe (Serializer/Deserializer)
4 Kafka (a distributed messaging system)
5 Java
2
4. Relevant content is a great way to stay informed about your
professional interests; Fresh relevant content is even better!
How do we keep track of what
hundreds of millions of members
viewed on their News Feeds?
6. News Feed is the Landing Page for Most Members
Scale
6
Source: investors.linkedin.com | 1 as of quarter end | 2 monthly average during the quarter
7. • Lightweight events that track what
the member viewed
• Tiny payload (bandwidth-friendly)
• Events end up in a Kafka topic
Client-Side Tracking
8. • Events that have more data about
served feeds
• Rich payload
• Events end up in a Kafka topic
Server-Side Tracking
9. Improving Member Experience Using Samza
(Overview)
A stream-stream join task buffers events from both
streams; matches are sent to an output Kafka stream1 Join input streams
A custom TTL mechanism reaps stale events every n
seconds2 Purge stale events
Convert the rich data about impressions into machine
learning features used for ranking items in the News Feed3 Consume output stream
9
12. Client-Side Events Processor Overview
12
ID in
server-
side events
store?
Match events
Store (ID, const.)
Yes
No
Output to Kafka
13. Optimizations
Client-Side Events Processor
13
• Initial capacity of matches map (event, matched
IDs) is determined by a metric (GC-friendly)
• Initial capacity of value set is equal to |IDs|
• An empty byte array is used as a dummy value
for IDs to store in RocksDB (passes through the
NOP byte array SerDe); acting as a set
14. Server-Side Events Processor Overview
14
ID in client-
side events
store?
Match events
Store (ID, event)
Yes
No
Output to Kafka
15. • Header (shared event data)
• List of payloads (one for each item)
• Each payload has a join key (ID)
Event Anatomy
Shared Event Data
(e.g. member ID)
ID: 123
ID: 456
ID: 789
ID: 012
ID: 345
ID: 678
ID: 901
ID: 234
17. ManyKeysToOneValueStore<K, V>
Server-Side Events Storage
17
• Space-efficient
• Insertion is transactional
• Rolling back a transaction is a best effort thing
• Requires an additional lookup (but it’s worth it)
19. [SAMZA-647]
Key-Value Store Contributions to Samza
19
• The access pattern is getAll(List<K>)
• RocksDB supports multiGet that’s faster than get
• Added that support to samza’s KeyValueStore
• Perf test results confirm that of RocksDB (with
caching disabled)
21. Custom TTL Mechanism
Records the timestamp of when an event was stored
The “death row” store: key is the timestamp and the value is an ID
Because the key is a timestamp, collisions occur:
21
Generate timestamp
Bucket is taken
Bucket is free
Attempts <= max Attempts > max
put(timestamp, ID)
22. Linear Probing Timestamper
22
TTL calculation is not mission-critical (currentTimeMillis() is not very precise
anyways); events get deleted in the next window
Keeping it simple and stupid works
23. Reapers
Every n seconds:
Get death rows (t < now – TTL)
For each entry in death row:
Remove from core stores
Remove from death row
23
24. Optimizations
Reapers
24
• Keys (timestamps) are stored in order
• A range query (0, now – TTL) is much faster
than a range scan (testing all values)
• Even though TTL is in the order of
minutes/hours, reaping stale events happens
every 10 seconds (the window method is
blocking)