• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The "Big Data" Ecosystem at LinkedIn
 

The "Big Data" Ecosystem at LinkedIn

on

  • 829 views

[This work was presented at SIGMOD'13.] ...

[This work was presented at SIGMOD'13.]

The use of large-scale data mining and machine learning has proliferated through the adoption of technologies such as Hadoop, with its simple programming semantics and rich and active ecosystem. This paper presents LinkedIn's Hadoop-based analytics stack, which allows data scientists and machine learning researchers to extract insights and build product features from massive amounts of data. In particular, we present our solutions to the "last mile" issues in providing a rich developer ecosystem. This includes easy ingress from and egress to online systems, and managing workflows as production processes. A key characteristic of our solution is that these distributed system concerns are completely abstracted away from researchers. For example, deploying data back into the online system is simply a 1-line Pig command that a data scientist can add to the end of their script. We also present case studies on how this ecosystem is used to solve problems ranging from recommendations to news feed updates to email digesting to descriptive analytical dashboards for our members.

Statistics

Views

Total Views
829
Views on SlideShare
715
Embed Views
114

Actions

Likes
4
Downloads
61
Comments
0

3 Embeds 114

http://www.linkedin.com 42
http://www.scoop.it 36
https://www.linkedin.com 36

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    The "Big Data" Ecosystem at LinkedIn The "Big Data" Ecosystem at LinkedIn Presentation Transcript

    • The "Big Data" Ecosystem at LinkedIn SIGMOD 2013 Roshan Sumbaly, Jay Kreps, & Sam Shah June 2013
    • LinkedIn: the professional profile of record ©2012 LinkedIn Corporation. All Rights Reserved. 2 225MMembers 225M Member Profiles 1 2
    • 3 Applications
    • 4 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
    • 5 Skill sets
    • Rich Hadoop-based ecosystem ©2013 LinkedIn Corporation. All Rights Reserved. 6
    • “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 7  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
    • 8 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
    • 9 People You May Know
    • 10 People You May Know – Workflow Perform triangle closing for all members Ethan Jacob William connected connected Triangle closing Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
    • “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 11  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
    • Ingress - O(n2) data integration complexity ©2013 LinkedIn Corporation. All Rights Reserved. 12  Point to point  Fragile, delayed and potentially lossy  Non-standardized
    • Ingress - O(n) data integration ©2013 LinkedIn Corporation. All Rights Reserved. 13
    • 14 Ingress – Kafka  Distributed and elastic – Multi-broker system  Categorized topics – “PeopleYouMayKnowTopic” – “ConnectionUpdateTopic”
    • 15 Ingress  Standardized schemas – Avro – Central repository – Programmatic compatibility  Audited  ETL to Hadoop People you may know service Kafka brokers (dev) Kafka brokers Hadoop PeopleYouMayKnowTopic
    • “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 16  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results form offline to online systems  Key/Value  Streams  OLAP
    • 17 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream
    • 18 People You May Know – Workflow (in reality)
    • 19 Workflow Management - Azkaban  Dependency management – Historical logs  Diverse job types – Pig, Hive, Java  Scheduling  Monitoring  Visualization  Configuration  Retry/restart on failure  Resource locking
    • 20 People You May Know – Workflow Perform triangle closing for all members Rank by discounting previously shown recommendations Push recommendations to online service Connection stream Impression stream Member Id 1213 => [ Recommended member id 1734, Recommended member id 1523 … Recommended member id 6332 ]
    • “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 21  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
    • 22 Egress – Key/Value  Voldemort – Based on Amazon‟s Dynamo  Distributed and Elastic  Horizontally scalable  Bulk load pipeline from Hadoop  Simple to use store results into „url‟ using KeyValue(„member_id‟) People you may know service Voldemort Hadoop Batch load getRecommendations(member id)
    • 23 People You May Know - Summary People you may know service Kafka brokers (mirror) Kafka brokers Hadoop PeopleYouMayKnowTopic Voldemort Front end
    • 24 Application examples  People You May Know (2 people)  Year In Review Email (1 person, 1 month)  Skills and Endorsements (2 people)  Network Updates Digest (1 person, 3 months)  Who‟s Viewed My Profile (2 people)  Collaborative Filtering (1 person)  Related Searches (1 person, 3 months)  and more…
    • 25 Year In Review Email
    • 26 Year In Review Email memberPosition = LOAD '$latest_positions' USING BinaryJSON; memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high)) ) GENERATE member_id, start_date, end_date; allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON; allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest; allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct; memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON; pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name; resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id; connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name; joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address; resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset); -- Get the count of results per recipient resultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results; resultGroupPre = filter resultGroupCount by num_results > 2; resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results; } x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results; x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body; rmf $xir; STORE x_in_review INTO '$url' USING Kafka();
    • 27 Year In Review Email – Workflow Find users that have changed jobs Join with connections and metadata (pictures) Group by connections of these users Push content to email service
    • “Last mile” problems ©2013 LinkedIn Corporation. All Rights Reserved. 28  Ingress – Moving data from online to offline system  Workflow management – Managing offline processes  Egress – Moving results from offline to online systems  Key/Value  Streams  OLAP
    • 29 Egress - Streams  Service acts as consumer  “EmailContentTopic” store emails into „url‟ using Stream(“topic=x“) Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailSentTopic Email service Kafka brokers (mirror) Kafka brokers Hadoop EmailContentTopic
    • 30 Conclusion  Hadoop: simple programmatic model, rich developer ecosystem  Primitives for – Ingress:  Structured, complete data available  Automatically handles data evolution – Workflow management  Run and operate production processes – Egress  1-line command for data for exporting data  Horizontally scalable, little need for capacity planning  Empowers data scientists to focus on new product ideas, not infrastructure
    • Future work: models of computation • Alternating Direction Method of Multipliers (ADMM) • Distributed Conjugate Gradient Descent (DCGD) • Distributed L-BFGS • Bayesian Distributed Learning (BDL) Graphs Distributed learning Near-line processing
    • 32 data.linkedin.com