Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members

The Communities AI team at LinkedIn generates follow recommendations from a large (10’s of millions) set of entities to each of our 690+ million members.

  • Be the first to comment

Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members

  1. 1. Scoring At Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members Abdulla Al Qawasmeh Engineering Manager, AI Emilie de Longueau Sr Software Engineer, AI
  2. 2. Agenda Introduction to Follows Relevance at LinkedIn Offline Scoring Architecture Scalability Improvements with 2D Hash-Partitioned Join
  3. 3. Introduction to Follows Relevance
  4. 4. Product Placements
  5. 5. Communities AI ▪ Discover ▪ Follow entities with shared interest ▪ Engage ▪ Join conversations happening in communities with shared interest ▪ Contribute ▪ Engage with the right communities when creating content https://engineering.linkedin.com/blog/2019/06/building-communities-around-interests https://recsys.acm.org/recsys19/industry-session-1/ Mission: Empower members to form communities around common interests and have active conversations
  6. 6. Discover: Follow Recommendations at Scale Large-scale system that recommends entities to follow for every LinkedIn member Members: 100s of millions Entities (e): millionsX Members Pages NewslettersGroups Hashtags Key Challenge: 100s of trillions of possible pairs! Viewer (v) Events
  7. 7. Recommendation Objective ▪ Interesting (form edges): pfollow(v follows e | e recommended to v) ▪ Engaging: utility(v engages e | v follows e) ▪ Follow edges (link between v and e) contribute a substantial amount of content and engagement on the Feed Recommend entities that the member finds interesting and engaging
  8. 8. PFOLLOW Model: ● Binary response ● Predicts the probability of following the entity given an impression UTILITY Model: ● Continuous response ● Look at engagement between viewer and entity after the follow edge is formed Problem Formulation The ranking objective function
  9. 9. Offline Scoring Architecture
  10. 10. Active vs Inactive members Recommending entities to follow for every LinkedIn member ▪ Active Members ▪ Users who have performed recent actions on LinkedIn ▪ Inactive Members ▪ New users to LinkedIn ▪ Registered users who have not performed recent actions on LinkedIn 690+ million members
  11. 11. Recommending entities to follow for every LinkedIn member ▪ Active Members ▪ Users who have performed recent actions on LinkedIn ▪ Inactive Members ▪ New users to LinkedIn ▪ Registered users who have not performed recent actions on LinkedIn Personalized recommendations precomputed offline per member (+ real-time contextual recommendations based on recent activity) Heavy Spark offline pipeline Segment-based recommendations precomputed offline per segment (e.g industry, skills, country) and fetched online Lightweight Spark offline pipeline High % client calls Low % client calls Active vs Inactive members
  12. 12. Scoring Architecture Simplified end-to-end pipeline for active members Active member precomputed recommendations and scores: (viewer, (entity, score)) Push Context-based precomputed recommendations and scores: (context, (entity, score)) Key-Value Store Push Not Found: Inactive member Query active members store for X Fetch contexts for X Found: Active member Query store for contextual recommendations Final Scoring Filtering Blending “Get follow recommendations for viewer X” CLIENT Recent Member Activity (Realtime Service) Key-Value Store ONLINE (Java) OFFLINE (Spark) (followed, (entity_X, score)) (interacted, (entity_Y, score))
  13. 13. Scoring Architecture Simplified end-to-end pipeline for active members Active member precomputed recommendations and scores: (viewer, (entity, score)) Push periodically Context-based precomputed recommendations and scores: (context, (entity, score)) Key-Value Store Push Not Found: Inactive member Query active members store for X Fetch contexts for X Found: Active member Query store for contextual recommendations Final Scoring Filtering Blending “Get follow recommendations for viewer X” CLIENT Recent Member Activity (Realtime Service) Key-Value Store ONLINE (Java) OFFLINE (Spark) (followed, (entity_X, score)) (interacted, (entity_Y, score))
  14. 14. Feature Categories Viewer Features (small number) ▪ Follow-through -rate (FTR) ▪ Feed click-through-rate (CTR) ▪ Impression counts ▪ Interaction counts ▪ Segments: industry, country, skills, company... ▪ Language(s) ... Pair/Interaction Features (large number) ▪ Viewer-entity engagement ▪ Segment-entity engagement and follow ▪ Graph-based features ▪ Browsemap scores of entities already followed by the viewer (blog link) ▪ Embedding features … many more Entity Features (medium number) ▪ Follow-through -rate (FTR) ▪ Unfollow-through-rate (UTR) ▪ Feed click-through-rate (CTR) ▪ Impression counts ▪ Interaction counts ▪ Number of posts ▪ Language(s) ...
  15. 15. Joining Features Viewer Features millions distinct active members Pair/Interaction Features trillions possible (viewer-entity) pairs 100s of billions (viewer-entity) pairs Entity Features millions recommendable entities (member, company, hashtag, newsletters) How to manage the explosive growth of members / entities Candidate selection How can we join all features together and meet an acceptable performance ?
  16. 16. Viewer Features Pair/Interaction Features Entity Features Partition Partition Partition 1st HASH JOIN on viewerId key (100s TB of shuffle) TBs GBsGBs 1st Option : 3-way Spark Join .join()
  17. 17. 1st Option : 3-way Spark Join Viewer Features Pair/Interaction Features Entity Features Partition Partition Partition TBs GBsGBs 2nd HASH JOIN on entityId key (100s TB of shuffle, very skewed) ● 2 gigantic shuffles ● Poor runtime performance ● Problematic skewness 1st HASH JOIN on viewerId key (100s TB of shuffle) .join() .join()
  18. 18. 2nd Option: Partial Scoring with Linear model GBs TBs GBs Partial scoring ● Manageable 3-way join performed on smaller outputs Disadvantages: ● Scoring overhead and intermediary outputs ● Constraint to use a linear model
  19. 19. Scalability Improvements with 2D Hash-Partitioned Join
  20. 20. Goal: Avoid huge shuffles Bottleneck: Large / wide table of pair features + skewed entity distribution Can we manage to join features together without shuffling the pair features ?
  21. 21. 2D Hash-Partitioned Join Partitioning of the 3 feature tables ▪ Hash-Partition the viewer features table into V partitions ▪ Hash-Partition the entity features table into E partitions ▪ Partition the pair features table into V * E partitions , using a 2-dimensional custom partition function to allow joining on two keys (member, entity) ▪ Choose E and V so that every member and entity partition can be loaded into memory (depends on data size + executor memory) Partition V1 E1 Partition V1 E2 Partition V1 E3 Partition V2 E1 Partition V2E2 Partition V2E3 Partition E1 Partition E2 Partition E3 Partition V1 Partition V2 Partition V3 Partition V3 E1 Partition V3E2 Partition V3E3 Viewer Features Entity Features Pair Features * Blog Link (*)
  22. 22. 2D Hash-Partitioned Join Partition E20 Pair Features Partition viewer 1001, entity 220 For a (viewer v, entity e): ▪ Viewer table partition number: h (v) % V ▪ Entity table partition number: h (e) % E ▪ Pair table partition number: h (v) % V * E + h (e) % E Smart partitioning of pair features table For each pair partition P, we always have a single corresponding: ▪ Viewer partition number equals to: P / E ▪ Entity partition number equals to: P % E h: Custom positive hash function Example: V= 50, E = 100, h(x) = abs(x) P = 120 entity table partition ? 120 % 100 = 20 viewer table partition ? Partition V1 120 / 100 = 1
  23. 23. 2D Hash-Partitioned Join Join Algorithm Partition V1 E1 Partition V1 E2 Partition V1 E3 Partition V2 E1 Partition V2E2 Partition V2E3 Partition V3 E1 Partition V3E2 Partition V3E3 Partition V1 Partition V2 Partition V3 Partition E1 Partition E2 Partition E3 Partitioned Viewer Features Partitioned Entity Features Partitioned Pair Features 1 - Launch a mapper for each pair partition 2.1 - Load the corresponding entity partition as in-memory hashmap 2.2 - Load the corresponding viewer partition (presorted by viewer id) into a stream reader 3 - For each pair features record, lookup entity features record by entity id, and viewer features record from stream reader 4 - Merge three feature sets into a joined record 5 - Features can be scored right away before storing to HDFS! ALGORITHM: .mapPartitions()
  24. 24. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs
  25. 25. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs No shuffle of the pair features table during the join
  26. 26. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs No intermediate data stored in HDFS (single Spark job)
  27. 27. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs Ability to score using a non-linear model that interacts features (XGBoost)
  28. 28. Gains After adopting 2D Hash-Partition Join Offline Scoring Runtime Performance ▪ Cost to Serve of Offline Scoring pipeline reduced by 5X (in Gb.h) ▪ HDFS Storage: intermediate outputs reduced by 8X Relevance ▪ Enabled transition from linear model (LoR, LiR) to non-linear model (XGBoost) ▪ Total follows up by 17% ▪ Engagement up by 11%
  29. 29. Thank you !
  30. 30. Contacts: ● https://www.linkedin.com/in/emilie-de-longueau/ ● https://www.linkedin.com/in/aalqawasmeh/ Credits: ● LinkedIn Hadoop team, in particular Fangshi Li for implementing the algorithm and helping with its adoption in Follows Relevance Check these Blogs: ● LinkedIn Engineering - Communities AI: Building Communities Around Interests ● LinkedIn Engineering - Managing Exploding Big Data

×