The "Big Data" Ecosystem at LinkedIn

The "Big Data" Ecosystem at LinkedIn
SIGMOD 2013
Roshan Sumbaly, Jay Kreps, & Sam Shah
June 2013

LinkedIn: the professional profile of record
©2012 LinkedIn Corporation. All Rights Reserved. 2
225MMembers 225M Member
Profiles
1 2

4
Application examples
 People You May Know (2 people)
 Year In Review Email (1 person, 1 month)
 Skills and Endorsements (2 people)
 Network Updates Digest (1 person, 3 months)
 Who‟s Viewed My Profile (2 people)
 Collaborative Filtering (1 person)
 Related Searches (1 person, 3 months)
 and more…

Rich Hadoop-based ecosystem

“Last mile” problems
 Ingress
– Moving data from online to offline system
 Workflow management
– Managing offline processes
 Egress
– Moving results from offline to online systems
 Key/Value
 Streams
 OLAP

8
 and more…

10
People You May Know – Workflow
Perform triangle closing
for all members
Ethan
Jacob
William
connected connected
Triangle closing
Rank by discounting previously
shown recommendations
Push recommendations
to online service
Connection
stream
Impression
stream

 Ingress
 Egress
 Key/Value
 Streams
 OLAP

Ingress - O(n2) data integration complexity
 Point to point
 Fragile, delayed and potentially lossy
 Non-standardized

Ingress - O(n) data integration

14
Ingress – Kafka
 Distributed and elastic
– Multi-broker system
 Categorized topics
– “PeopleYouMayKnowTopic”
– “ConnectionUpdateTopic”

15
Ingress
 Standardized schemas
– Avro
– Central repository
– Programmatic compatibility
 Audited
 ETL to Hadoop
People you may
know service
Kafka brokers (dev)
Kafka brokers
Hadoop
PeopleYouMayKnowTopic

 Ingress
 Egress
– Moving results form offline to online systems
 Key/Value
 Streams
 OLAP

17
for all members
to online service
Connection
stream
Impression
stream

18
People You May Know – Workflow (in
reality)

19
Workflow Management - Azkaban
 Dependency management
– Historical logs
 Diverse job types
– Pig, Hive, Java
 Scheduling
 Monitoring
 Visualization
 Configuration
 Retry/restart on failure
 Resource locking

20
for all members
to online service
Connection
stream
Impression
stream
Member Id 1213 =>
[ Recommended member id 1734,
Recommended member id 1523
…
Recommended member id 6332 ]

 Ingress
 Egress
 Key/Value
 Streams
 OLAP

22
Egress – Key/Value
 Voldemort
– Based on Amazon‟s Dynamo
 Distributed and Elastic
 Horizontally scalable
 Bulk load pipeline from Hadoop
 Simple to use
store results into „url‟ using KeyValue(„member_id‟)
People you may
know service
Voldemort
Hadoop
Batch load
getRecommendations(member id)

23
People You May Know - Summary
People you may
know service
Kafka brokers (mirror)
Kafka brokers
Hadoop
PeopleYouMayKnowTopic
Voldemort
Front end

24
 and more…

26
Year In Review Email
memberPosition = LOAD '$latest_positions' USING BinaryJSON;
memberWithPositionsChangedLastYear = FOREACH (
FILTER memberPosition BY ((start_date >= $start_date_low ) AND
(start_date <= $start_date_high))
) GENERATE member_id, start_date, end_date;
allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;
allConnectionsWithChange_nondistinct = FOREACH (
JOIN memberWithPositionsChangedLastYear BY member_id,
allConnections BY dest
) GENERATE allConnections::source AS source,
allConnections::dest AS dest;
allConnectionsWithChange = DISTINCT
allConnectionsWithChange_nondistinct;
memberinfowpics = LOAD '$latest_memberinfowpics' USING
BinaryJSON;
pictures = FOREACH ( FILTER memberinfowpics BY
((cropped_picture_id is not null) AND
( (member_picture_privacy == 'N') OR
(member_picture_privacy == 'E')))
) GENERATE member_id, cropped_picture_id, first_name as
dest_first_name, last_name as dest_last_name;
resultPic = JOIN allConnectionsWithChange BY dest, pictures
BY member_id;
connectionsWithChangeWithPic = FOREACH resultPic GENERATE
allConnectionsWithChange::source AS source_id,
allConnectionsWithChange::dest AS member_id,
pictures::cropped_picture_id AS pic_id,
pictures::dest_first_name AS dest_first_name,
pictures::dest_last_name AS dest_last_name;
joinResult = JOIN connectionsWithChangeWithPic BY source_id,
memberinfowpics BY member_id;
withName = FOREACH joinResult GENERATE
connectionsWithChangeWithPic::source_id AS source_id,
connectionsWithChangeWithPic::member_id AS member_id,
connectionsWithChangeWithPic::dest_first_name as first_name,
connectionsWithChangeWithPic::dest_last_name as last_name,
connectionsWithChangeWithPic::pic_id AS pic_id,
memberinfowpics::first_name AS firstName,
memberinfowpics::last_name AS lastName,
memberinfowpics::gmt_offset as gmt_offset,
memberinfowpics::email_locale as email_locale,
memberinfowpics::email_address as email_address;
resultGroup = GROUP withName BY (source_id, firstName,
lastName, email_address, email_locale, gmt_offset);
-- Get the count of results per recipient
resultGroupCount = FOREACH resultGroup GENERATE group ,
withName as toomany, COUNT_STAR(withName) as num_results;
resultGroupPre = filter resultGroupCount by num_results > 2;
resultGroup = FOREACH resultGroupPre {
withName = LIMIT toomany 64;
GENERATE group, withName, num_results;
}
x_in_review_pre_out = FOREACH resultGroup GENERATE
FLATTEN(group) as (source_id, firstName, lastName,
email_address, email_locale, gmt_offset),
withName.(member_id, pic_id, first_name, last_name) as
jobChanger, '2013' as changeYear:chararray,
num_results as num_results;
x_in_review = FOREACH x_in_review_pre_out GENERATE
source_id as recipientID, gmt_offset as gmtOffset,
firstName as first_name, lastName as last_name, email_address,
email_locale,
TOTUPLE( changeYear, source_id,firstName, lastName,
num_results,jobChanger) as body;
rmf $xir;
STORE x_in_review INTO '$url' USING Kafka();

27
Year In Review Email – Workflow
Find users that have
changed jobs
Join with connections
and metadata (pictures)
Group by connections of
these users
Push content to email
service

 Ingress
 Egress
 Key/Value
 Streams
 OLAP

29
Egress - Streams
 Service acts as consumer
 “EmailContentTopic”
store emails into „url‟ using Stream(“topic=x“)
Email service
Kafka brokers
Hadoop
EmailSentTopic
Email service
Kafka brokers
Hadoop
EmailContentTopic

30
Conclusion
 Hadoop: simple programmatic model, rich developer ecosystem
 Primitives for
– Ingress:
 Structured, complete data available
 Automatically handles data evolution
– Workflow management
 Run and operate production processes
– Egress
 1-line command for data for exporting data
 Horizontally scalable, little need for capacity planning
 Empowers data scientists to focus on new product ideas,
not infrastructure

Future work: models of computation
• Alternating Direction Method of Multipliers (ADMM)
• Distributed Conjugate Gradient Descent (DCGD)
• Distributed L-BFGS
• Bayesian Distributed Learning (BDL)
Graphs
Distributed learning
Near-line processing

The "Big Data" Ecosystem at LinkedIn

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The "Big Data" Ecosystem at LinkedIn

Similar to The "Big Data" Ecosystem at LinkedIn (20)

Recently uploaded

Recently uploaded (20)

The "Big Data" Ecosystem at LinkedIn