Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

stream-processing-at-linkedin-with-apache-samza

63 views

Published on

Talk given at the Bangalore Apache Kafka Group meetup
https://www.meetup.com/Bangalore-Apache-Kafka-Group/events/251707854/

Published in: Engineering
  • Be the first to comment

stream-processing-at-linkedin-with-apache-samza

  1. 1. Stream Processing at LinkedIn with Apache Samza Abhishek Shivanna Sr Engineer, Site Reliability Streams Infrastructure
  2. 2. Today’s agenda 1 Introduction to Stream Processing with Samza 2 Stream Processing at LinkedIn 3 Deep dive – Notifications @ LinkedIn 4 Deep dive – Viewport Tracking @ LinkedIn 5 Q & A
  3. 3. Today’s agenda 1 Introduction to Stream Processing with Samza 2 Stream Processing at LinkedIn 3 Deep dive – Notifications @ LinkedIn 4 Deep dive – Viewport Tracking @ LinkedIn 5 Q & A
  4. 4. Processing Latency of User Interactions WHERE DOES STREAM PROCESSING FIT IN ? Response Latency Synchronous Milliseconds to Minutes Hours to days
  5. 5. Serving Stores
  6. 6. Topic AP1 P2 P3 Container Task Task Task Topic BP1 P2 P3
  7. 7. Topic AP1 P2 P3 Task Task Task Container Container Container Topic BP1 P2 P3
  8. 8. State Stores • Temporary data storage • Adjunct data lookups • Rich access patterns, 100x Faster than remote lookups (1.1MTPS) Task
  9. 9. Task State Store Changelog Optimization: Host Affinity
  10. 10. Container Threading Model Task 1 Task 2 Task 3
  11. 11. Container Threading Model Task 1 Task 2 Task 3
  12. 12. Container Threading Model Task 1 Task 2 Task 3
  13. 13. Checkpointing Topic A P1 P2 P3 Container Task 1 Task 2 Task 3 Checkpoints Task1P0: 3 Task2P2: 1 Task3P3: 4
  14. 14. Checkpointing Topic A P1 P2 P3 Container Task 1 Task 2 Task 3 Checkpoints Task1P0: 1 Task2P2: 2 Task3P3: 3
  15. 15. Container 1 Task 1 Task 2 Container 2 Task 3 Task 4 Container 3 Task 5 Task 6 Checkpointin g
  16. 16. Container 1 Task 1 Task 2 Container 2 Task 3 Task 4 Container 3 Task 5 Task 6 Checkpointin g
  17. 17. Container 1 Task 1 Task 2 Container 2 Task 3 Task 4 Container 3 Task 5 Task 6 Container 1 Task 1 Task 2 Checkpointin g
  18. 18. Container 1 Task 1 Task 2 Container 2 Task 3 Task 4 Container 3 Task 5 Task 6 Container 1 Task 1 Task 2 Checkpointin g
  19. 19. Event Loop Choose Message PickTask(s) to send Window Checkpoint / Flush Process
  20. 20. public class HelloWorldTask implements InitableTask, StreamTask, WindowableTask { } @Override public void init(Config config, TaskContext context) { } @Override public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { } @Override public void window(MessageCollector collector, TaskCoordinator coordinator) { } private KeyValueStore<String, Integer> store; store = (KeyValueStore<String, Integer>) context.getStore("page-key-counts"); GenericRecord record = (GenericRecord) envelope.getMessage(); // store.put(record.pageKey.toString(), currentCount + 1); // KeyValueIterator<String, Integer> iterator = store.all(); Low Level API
  21. 21. High Level APIpublic class AdServedJoinApp implements StreamApplication { } @Override public void init(StreamGraph streamGraph, Config config) { } MessageStream<KV<String, GenericData.Record>> adServedEvent = streamGraph.getInputStream("AdServed"); MessageStream<KV<String, GenericData.Record>> adClickEvent = streamGraph.getInputStream("AdClickEvent"); OutputStream<KV<String, SamzaApiTestJoinEvent>> outputStream = streamGraph.getOutputStream("TestJoinOutput"); // Omitted code related to conversion from GenericRecord -> message types adServedEvent .join(adClickEvent, new AdClickJoinFunction(), new StringSerde(), new JsonSerdeV2<>(AdServed.class), new JsonSerdeV2<>(AdClick.class), Duration.ofMinutes(30), "join") .map(joinEvent -> KV.of(joinEvent.adId.toString(), joinEvent)) .sendTo(outputStream);
  22. 22. Samza SQL ./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select Name as __key__, Name, NewCompany, RegexMatch('.*soft', OldCompany) from kafka.ProfileChangeStream where NewCompany = 'LinkedIn'"
  23. 23. Flexible Deployments YARN Standalone/Library
  24. 24. Today’s agenda 1 Introduction to Stream Processing with Samza 2 Stream Processing at LinkedIn 3 Deep dive – Notifications @ LinkedIn 4 Deep dive – Viewport Tracking @ LinkedIn 5 Q & A
  25. 25. Stream Processing Use Cases at LinkedIn Anti-scraping and data theft prevention Security Notifications to members Notifications Real time tagging of articles News Classification Analysis of service calls Call graph
  26. 26. Stream Processing Use Cases at LinkedIn Tracking ad relevance and click through rate Ad Relevance Tracking session duration Viewport Tracking Standardizing titles, gender, education Profile Standardization Auto triaging of application errors ErrorTracking
  27. 27. 0 50 100 150 200 250 300 Q2 - 17 Q3 - 17 Q4 - 17 Q1 - 18 Q2 - 18 Number of Jobs Number of Jobs
  28. 28. Today’s agenda 1 Introduction to Stream Processing with Samza 2 Stream Processing at LinkedIn 3 Deep dive – Notifications @ LinkedIn 4 Deep dive –ViewportTracking @ LinkedIn 5 Q & A
  29. 29. Notifications The Problem
  30. 30. • Handle notifications rate – Control user experience, engagement and resultant app uninstalls. • Relevance – Notifications about things you care about • Channel and time – Email, Push, SMS etc. and personalize delivery time Notifications Goal
  31. 31. ATC (AirTraffic Controller)
  32. 32. Notify Tracking Relevance Member Setting Client Online/Offline Services
  33. 33. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals
  34. 34. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration
  35. 35. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications - Installed devices - Notification preferences
  36. 36. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection
  37. 37. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection - Email/SMS/Push ? - Predict click/disable notify rate for channel - Member settings
  38. 38. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection Aggregation - Group notifications into one payload - Member settings (Weekly digest) - Delay notification based on history
  39. 39. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection Aggregation Delivery time optimization
  40. 40. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection Aggregation Delivery time optimization - Best time to send notification - In bed or while commuting etc
  41. 41. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection Aggregation Delivery time optimization Request Queue
  42. 42. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection Aggregation Delivery time optimization Request Queue Filter - Dedup - Member interaction complete ? - Notification expiry - Rate limit upstream service
  43. 43. Notify Tracking Relevance Member Setting Client Online/Offline Services External Signals Request Decoration Member Notifications Channel Selection Aggregation Delivery time optimization Request Queue Filter
  44. 44. Samza + ATC NotificationsForMember P1 P2 P3 Data Locality for co-partitioned topics MemberID {0-1M} MemberID {1M-2M} MemberID {3M-4M} MemberSettingChange P1 P2 P3 MemberID {0-1M} MemberID {1M-2M} MemberID {3M-4M} Task3 Task2 Task1 Container
  45. 45. Samza + ATC NotificationsForMember P1 P2 P3 Scalable MemberSettingChange P1 P2 P3 Task3 Task2 Task1 Container
  46. 46. Samza + ATC NotificationsForMember P1 P2 P3 Scalable MemberSettingChange P1 P2 P3 Task3 Task2 Task1 Container Container Container
  47. 47. Samza + ATC • Fault Tolerant – All data in stores backed up with change logs to Kafka. This can be restored on startup after failure • Topic priority – M2M messages vs Daily Rundown • Async API – Remote call throughput. • Range Query (with RocksDB) – Keys with member id prefix for disk locality. (eg: all pending notifications for mid)
  48. 48. Today’s agenda 1 Introduction to Stream Processing with Samza 2 Stream Processing at LinkedIn 3 Deep dive – Notifications @ LinkedIn 4 Deep dive –ViewportTracking @ LinkedIn 5 Q & A
  49. 49. Power relevant, fresh content for the LinkedIn Feed Viewport Tracking Goal
  50. 50. Viewport ??
  51. 51. Viewport ??
  52. 52. Feed Server Feed Server
  53. 53. Server-side tracking event Feed Server “feedUpdates”: [ { “updateUrn”: “1” “trackingId”: “abc” “position”: “creationTime”: “numLikes”: “numComments”: “comments”: [ {“commentId”: } ] }, { “updateUrn”:“2” “trackingId”: “def” .. } ]
  54. 54. Client-side tracking event “feedImpression”: [ { “urn”: “trackingId”: “abc” “durationMs”: “5000” }, { “urn”: “trackingId”: “ghi” “duration”: }, … … ] • Light payload • Bandwidth and Battery friendly
  55. 55. “feedUpdates”: [ { “updateUrn”: “1” “trackingId”: “abc” “position”: “creationTime”: “numLikes”: “numComments”: “comments”: [ {“commentId”: } ] }, { “updateUrn”:“2” “trackingId”: “def” .. } ] “feedImpression”: [ { “urn”: “trackingId”: “abc” “durationMs”: “5000” }, { “urn”: “trackingId”: “ghi” “duration”: “6000” }, … … ] “feedJoined”: [ { “updateUrn”: “1” “trackingId”: “abc” “durationMs”: “5000” “position”: “creationTime”: “numLikes”: “numComments”: “comments”: [ {“commentId”: } ] }, { “updateUrn”:“3” “trackingId”: “ghi” .. } ]
  56. 56. Feed Server P1 P2 P3 Task Task Task Container Container Container Client Impression P1 P2 P3 To downstream feed ranking systems 2+ Billion Events per day 90 Containers 2G / 1vCore
  57. 57. Key Differentiators • Stream Processing both as a multi-tenant service with a cluster manager or as a light-weight embedded library • First-class streaming support (No micro batching) • Unified processing of batch and streaming data • First-class support for async processing for efficient remote calls • First-class support for scalable and durable local state • Incremental changelog • Instant restore with zero down-time • Rich expression with Low level API, Stream based high level API (DSL) and SQL
  58. 58. Powered by
  59. 59. https://samza.apache.org
  60. 60. Thank you

×