4. LinkedIn by the numbers
120,000,000+ users (August 4, 2011)
2+ new user registrations per second
81+ Million monthly unique users* (Comscore)
2 Billion People Searches in 2010
7.1 Billion page views in Q2 2010
2+ Million companies with LinkedIn Company Pages
2+ M LinkedIn Groups
4
20. Analytics Products: Key Ideas
Recommendations
People You May Know, Related Searches, Viewers of
this profile ...
Insight
Profile Stats: Who Viewed My Profile, Skills
Visualization
InMaps
20
21. Analytics Products: Challenges
LinkedIn: largest professional network
120+ million members on LinkedIn
Billions of pageviews
Terabytes of data to process
21
22. Outline
Analytics products at LinkedIn
Deep Dive - Connection Strength
Deep Dive - Related Searches
Fun Facts: Sequencing the Startup DNA
22
23. Connection Strength
Let’s build “Connection Strength”
Systems and Tools we use
Hadoop MapReduce
Managing workflow
Serving data in production
23
24. Connection Strength
How well do people know each other?
Connection Strength
Application: reorder updates to show updates
from close connections
24
27. Connection Strength
How well do
Alice
people know each
other?
Bob Carol
Triangle closing
27
28. Connection Strength
How well do
Alice
people know each
other?
Bob Carol
Triangle closing
Prob(Bob knows Carol) ~ the # of common connections
28
29. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
29
30. Pig Overview
Load: load data, specify format
Store: store data, specify format
Foreach, Generate: Projections, similar to select
Group by: group by column(s)
Join, Filter, Limit, Order, ...
User Defined Functions (UDFs)
30
31. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
31
32. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
32
33. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
33
34. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
34
35. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
35
36. Triangle Closing Example
Alice
Bob Carol
connections = LOAD `connections` USING
1.(A,B),(B,A),(A,C),(C,A) PigStorage();
2.(A,{B,C}),(B,{A}),(C,{A})
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
36
37. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
group_conn = GROUP connections BY
2.(A,{B,C}),(B,{A}),(C,{A}) source_id;
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
37
38. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})
pairs = FOREACH group_conn GENERATE
3.(A,{B,C}),(A,{C,B}) generatePair(connections.dest_id) as (id1, id2);
4.(B,C,1), (C,B,1)
38
39. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A}) common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn
3.(A,{B,C}),(A,{C,B}) GENERATE flatten(group) as (source_id, dest_id),
4.(B,C,1), (C,B,1) COUNT(pairs) as common_connections;
39
48. Related Searches
Let’s build “Related Searches”
Systems and Tools we use
Hadoop MapReduce
Managing workflow
Serving data in production
48
49. Hadoop MapReduce
Large-scale distributed data processing
Map phase: partition the problem in smaller
steps
Reduce phase: aggregate the output of smaller
steps
Allows parallelism and fault-tolerance
49
50. Related Searches
Let’s build “Related Searches”
Systems and Tools we use
Hadoop MapReduce
Managing workflow
Serving data in production
50
51. Our Workflow
Triangle
Age School Overlap
Closing
Union
push-to-prod
51
52. Our Workflow
Triangle
Age School Overlap
Closing
Union
push-to-qa push-to-prod
52
55. Our Workflow
Triangle
Age School Overlap
Closing
Union
push-to-prod
55
56. Our Workflow
Triangle
Age School Overlap
Closing
Union
push-to-prod
56
57. Related Searches
Let’s build “Related Searches”
Systems and Tools we use
Hadoop MapReduce
Managing workflow
Serving data in production
57
58. Voldemort
Large amount of data/Scalable
Quick lookup/low latency
Versioning and Rollback
Fault tolerance through replication
Read only
Offline index building
58
71. Things Covered
Analytics products at LinkedIn
Deep Dive - Connection Strength
Deep Dive - Related Searches
Fun Facts: Sequencing the Startup DNA
71
72. SNA Team
Thanks to SNA Team at LinkedIn
http://sna-projects.com
We are hiring!
72