The "Big Data" Ecosystem at LinkedIn

1,164
-1

Published on

[This work was presented at SIGMOD'13.]

The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

Published in: Technology

The "Big Data" Ecosystem at LinkedIn

  1. 1. The "Big Data" Ecosystem at LinkedIn SIGMOD 2013 Roshan Sumbaly, Jay Kreps, & Sam Shah June 2013
  2. 2. LinkedIn: the professional profile of record ©2012 LinkedIn Corporation. All Rights Reserved. 2 225MMembers 225M Member Profiles 1 2
  3. 3. 3 Applications
  4. 4. 4 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  5. 5. 5 Skill sets
  6. 6. Rich Hadoop-based ecosystem ©2013 LinkedIn Corporation. All Rights Reserved. 6
  7. 7. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 7  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  8. 8. 8 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  9. 9. 9 People You May Know
  10. 10. 10 People You May Know – Workflow Perform triangle closing for all members Ethan Jacob William connected connected Triangle closing Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
  11. 11. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 11  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  12. 12. Ingress - O(n2) data integration complexity ©2013 LinkedIn Corporation. All Rights Reserved. 12  Point to point  Fragile, delayed and potentially lossy  Non-standardized
  13. 13. Ingress - O(n) data integration ©2013 LinkedIn Corporation. All Rights Reserved. 13
  14. 14. 14 Ingress – Kafka  Distributed and elastic – Multi-broker system  Categorized topics – “PeopleYouMayKnowTopic” – “ConnectionUpdateTopic”
  15. 15. 15 Ingress  Standardized schemas – Avro – Central repository – Programmatic compatibility  Audited  ETL to Hadoop People you may know service Kafka brokers (dev) Kafka brokers Hadoop PeopleYouMayKnowTopic
  16. 16. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 16  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results form offline to online systems  Key/Value  Streams  OLAP
  17. 17. 17 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
  18. 18. 18 People You May Know – Workflow (in reality)
  19. 19. 19 Workflow Management - Azkaban  Dependency management – Historical logs  Diverse job types – Pig, Hive, Java  Scheduling  Monitoring  Visualization  Configuration  Retry/restart on failure  Resource locking
  20. 20. 20 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream Member Id 1213 => [ Recommended member id 1734, Recommended member id 1523 … Recommended member id 6332 ]
  21. 21. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 21  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  22. 22. 22 Egress – Key/Value  Voldemort – Based on Amazon‟s Dynamo  Distributed and Elastic  Horizontally scalable  Bulk load pipeline from Hadoop  Simple to use store results into „url‟ using KeyValue(„member_id‟) People you may know service Voldemort Hadoop Batch load getRecommendations(member id)
  23. 23. 23 People You May Know - Summary People you may know service Kafka brokers (mirror) Kafka brokers Hadoop PeopleYouMayKnowTopic Voldemort Front end
  24. 24. 24 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
  25. 25. 25 Year In Review Email
  26. 26. 26 Year In Review Email memberPosition = LOAD '$latest_positions' USING BinaryJSON; memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high)) ) GENERATE member_id, start_date, end_date; allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON; allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest; allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct; memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON; pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name; resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id; connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name; joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address; resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset); -- Get the count of results per recipient resultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results; resultGroupPre = filter resultGroupCount by num_results > 2; resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results; } x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results; x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body; rmf $xir; STORE x_in_review INTO '$url' USING Kafka();
  27. 27. 27 Year In Review Email – Workflow Find users that have changed jobs Join with connections and metadata (pictures) Group by connections of these users Push content to email service
  28. 28. “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 28  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
  29. 29. 29 Egress - Streams  Service acts as consumer  “EmailContentTopic” store emails into „url‟ using Stream(“topic=x“) Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailSentTopic Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailContentTopic
  30. 30. 30 Conclusion  Hadoop: simple programmatic model, rich developer ecosystem  Primitives for – Ingress:  Structured, complete data available  Automatically handles data evolution – Workflow management  Run and operate production processes – Egress  1-line command for data for exporting data  Horizontally scalable, little need for capacity planning  Empowers data scientists to focus on new product ideas, not infrastructure
  31. 31. Future work: models of computation • Alternating Direction Method of Multipliers (ADMM) • Distributed Conjugate Gradient Descent (DCGD) • Distributed L-BFGS • Bayesian Distributed Learning (BDL) Graphs Distributed learning Near-line processing
  32. 32. 32 data.linkedin.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×