3. Outline
• About LinkedIn
• Social Recommender Systems at LinkedIn
• Social Graph Analysis
• Virality in Social Recommender Systems
• Scaling Challenges
3
11. Outline
• About LinkedIn
• Social Recommender Systems at LinkedIn
• Social Graph Analysis
• Virality in Social Recommender Systems
• Scaling Challenges
11
14. Outline
• Social Recommender Systems at LinkedIn
• LinkedIn Today: Recommend News
• People You May Know and Social Graph Analysis
• Related Searches Recommendation
• Virality in Social Recommender Systems
• Skills Endorsements Suggestions and Social Virality
• Scaling Challenges
14
15. LinkedIn Today: News Recommendation
• Objective: serve valuable professional news, leading to
higher engagement as measured by metrics such as CTR
15
19. News Recommendations: Revised Algorithm
• Explore/Exploit scheme
• Explore: choose an item at random with a small probability (e.g., 5%)
• Exploit: choose highest scoring CTR item (e.g., 95%)
• Temporal smoothing: more weight to recent data
• Impression discounting: discount items with repeat views
• Segmented model: segment users in CTR estimation
19
20. Outline
• Social Recommender Systems at LinkedIn
• LinkedIn Today: Recommend News
• People You May Know and Social Graph Analysis
• Related Searches Recommendation
• Virality in Social Recommender Systems
• Skills Endorsements Suggestions and Social Virality
• Scaling Challenges
20
23. People You May Know
• > 50% of total connections and invitations
• Challenges
• Feature Engineering
• Machine Learning
• Scaling
23
24. People You May Know: Feature Engineering
Alice
Bob Carol
24
How do people
know each other?
25. People You May Know: Feature Engineering
Alice
Bob Carol
25
How do people
know each other?
26. People You May Know: Feature Engineering
Alice
Bob Carol
Triangle closing
26
How do people
know each other?
27. People You May Know: Feature Engineering
Alice
Bob Carol
Triangle closing
Prob(Bob knows Carol) ~ the # of common
connections
27
How do people
know each other?
28. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING
PigStorage();
28
29. People You May Know: Feature Engineering
• Member profile contains various types of organizations
• Company, Schools, Groups, ...
• Can we compute edge affinity based on these organization
information?
• Useful for many applications:
• Recommending members to connect (link prediction)
• Recommending other entities from the same community (community
detection)
29
30. Organizational Overlap: Feature Engineering
• Insight 1: Connection density increases with organizational
time overlap
30
Hsieh et. al, WWW’13
31. Organizational Overlap: Feature Engineering
• Insight 2: Connection density decreases with the size of
the organizational
31
33. How does PYMK work?
• Combine features using a Machine Learning model
33
34. How does diversity affects Conversion in PYMK
• Graph Structural Diversity Study
• Measure the effects of Structural Diversity in PYMK
recommendation
• Conversion: a connection invitation is sent to one of PYMK
recommendation
34
Huang et. al, RecSys RSWeb’13
35. How does diversity affects Conversion in PYMK
• Members in recommendation set mapped to a graph G
• Vertices represent members in the recommendation set
• Edges are the connections between those members on LinkedIn social
graph
• 3 measures of structural diversity
• Number of connected components
• Number of triangles
• Average local node degree
35
Huang et. al, RecSys RSWeb’13
36. Structural Diversity in PYMK
• A connected component
• any pair of vertices are connected by a path or an isolated vertex
• Number of connected components
• a measure of structural diversity [Ugander et al. 2012]
• Smaller number of components => less structural diversity
• Effect on Invitation rate or conversion rate
• ratio of the number of invitations sent and size of recommended set
36
37. Structural Diversity in PYMK
37
• Invitation rate increases as the number of components
decreases
38. Structural Diversity in PYMK
• Lower structural diversity among recommendation set results
in a higher invitation rate
• Different form Facebook data study [Ugander et al. 2012]
• Use case is slightly different
• Effect of structural diversity on recommender system highly depends on
the use
• Don’t generalize structural diversity effects on one recommender system
to all
38
39. Outline
• Social Recommender Systems at LinkedIn
• LinkedIn Today: Recommend News
• People You May Know and Social Graph Analysis
• Related Searches Recommendation
• Virality in Social Recommender Systems
• Skills Endorsements Suggestions and Social Virality
• Scaling Challenges
39
40. Related Searches Recommendation
• Millions of Searches everyday
• Help users to explore and refine their queries
40
Reda et. al, CIKM’12
43. Outline
• Social Recommender Systems at LinkedIn
• LinkedIn Today: Recommend News
• People You May Know and Social Graph Analysis
• Related Searches Recommendation
• Virality in Social Recommender Systems
• Skills Endorsements Suggestions and Social Virality
• Scaling Challenges
43
48. Skill Recommendation
• Predict a skill even if not
present in the profile
• Based on likelihood of
member having a skill
• Features: company, industry,
skills, ...
48
Profile
Tokenize
Skills Tagger
Phrases
Skills
Skills Classifier
Profile features
Recommended Skills
49. Suggested Skill Endorsements
• Binary Classification
• Features
• Company overlap, School overlap, Industrial
and functional area similarity, Title similarity,
Site interactions, Co-interactions, ...
Candidate
generation
Classifier
Features
-
Company
- Title
- Industry
...
Suggested
Endorsement
s
49
50. Social Recommendation and tagging
Skill Tagging
Skill Recommendation
Suggested Skill Endorsements
50
53. Outline
• Social Recommender Systems at LinkedIn
• LinkedIn Today: Recommend News
• People You May Know and Social Graph Analysis
• Related Searches Recommendation
• Virality in Social Recommender Systems
• Skills Endorsements Suggestions and Social Virality
• Scaling Challenges
53
54. Scaling Challenges: Related Searches
Example
• Kafka: publish-subscribe messaging system
• Hadoop: MapReduce data processing system
• Azkaban: Hadoop workflow management tool
• Voldemort: Key-value store
54
57. Summary
• Social Recommender Systems at LinkedIn
• LinkedIn Today: Recommend News
• People You May Know and Social Graph Analysis
• Related Searches Recommendation
• Virality in Social Recommender Systems
• Skills Endorsements Suggestions and Social Virality
• Scaling Challenges
57
59. Acknowledgement
• Thanks to Data Team at LinkedIn: http://data.linkedin.com
• We are hiring!
• Contact: mtiwari[at]linkedin.com
• Follow: @mitultiwari on Twitter
59
I am Mitul Tiwari. I work in Search, Network and Analytics group at LinkedIn and focus on recommendation problems such as people you may know, related searches, etc.
Here is the outline of the rest of my talk.
First, I will briefly talk about LinkedIn and set some context for recommender systems at LinkedIn
Then I am going to talk about recommendation systems at LinkedIn.
and also talk about social graph analysis and virality in social recommender systems with an example of skills endorsements recommendation
Finally, will conclude with addressing scaling challenges in building large scale social recommender systems
LinkedIn is the largest professional network with more than 259+ million members.
And it’s growing fast with more than 2 new members joining per second
LinkedIn offers a broad range of product features
Members can create profiles with their education and employment details
Members can connect with each other and maintain their professional network on linkedin.
Talent solutions help recruiters to search for the right candidates.
You can search for jobs on LinkedIn.
Companies can create pages and members can follow companies.
How does recommender systems fit in LinkedIn’s eco-system?
LinkedIn’s homepage is powered by recommendation engines: News, Connections, Jobs, Groups, Companies
Also, relevant Updates and Ads can be viewed as a form recommending updates from your network and ads
A rich recommender ecosystem at linkedin: from connections, news, skills, Jobs, companies, groups, search queries, talent, similar profiles, ...
Next I am going to talk about three recommendation systems at LinkedIn: news, people you may know, related search queries
and talk about virality in social recommender systems by giving an example of skills endorsements suggestions
LinkedIn Today is a personalized news recommendation based on your industry and other industries you follow
The objective here is to serve content that maximizes engagement metrics such as CTR
User i visits LinkedIn, we have industry from the profile, other industries user follow, behavioral features such as which articles user has clicked, demographic features such as age, gender, etc
Article item j: based on content which industry, skills the article is related to, industry of other members who shared the article or clicked on the article
(i, j): predict whether article will be clicked or not
Which items should we select?
Explore items to gain some clicks and
Exploit by showing highest CTR item
That looks pretty straight forward then what are challenges in news recommendation?
Clicks through rate drops on articles wrt time since interest in news articles is ephemeral
Another challenge is if a member is not interested in an article then the member is not going to click
This graph shows drop CTR wrt the number of views by the same member
Given these challenges here is a revised algorithm
First: explore/exploit scheme
Temporal smoothing, that is, give more weight to more recent data/clicks information. Old clicks matter less
Impression discounting: discount items with multiple views and no clicks
Segmented model: partition users based on their interest, industry, click behavior and
Opportunity in modeling the problem as multi-arm bandit problem, where we have single slot to show an article, and we have to pick the best article that maximizes probability of a click
Next I am going to talk about people recommendations called People You May Know at LinkedIn
LinkedIn is the largest professional network with more than 259 million members.
Members can connect with each other and maintain their professional network on linkedin.
People You May Know exposes LinkedIn’s link prediction system that recommends other members to connect with
More than 50% of connections at LinkedIn come from People You May Know
Challenges are in feature engineering, machine learning, and scaling to process 100s of terabytes of data
How do people know each other?
One good signal to indicate are common connections. That is Bob and Carol likely to know each other if they share a common connection.
Bob and Carol likely to know each other if they share a common connection. This is known as triangle closing, where Bob, Alice and Carol form a triangle.
Bob and Carol likely to know each other if they share a common connection. Also, as the number of common connections increases, the likelihood of the two people knowing each other increases.
Here is a pig script to do triangle closing, that is, find the number of common connections between any pair of members.
Let me talk about another feature derived from what type of organizations a member belongs to
P(t): probability of two people knowing each other
P(t) depends on time overlap and properties of an organization. First we fixed an organization, and vary time overlap
p^(t): connection density using C(n, 2) pairs
For a company A, this graph shows connection density, that is, the ratio of the # of connection with certain time overlap t within Company A and the total number of pairs with time overlap t within Company A
We observe that connection density increases with time overlap t
We see similar behavior with many companies, groups, and schools
We came to this insight that connection density increases with organizational time overlap
we sampled companies of different sizes
we calculated connection density with respect to company size
we observed that connection density decreases as the size of the organization increases
it makes sense since in a smaller organization people know each other
Empirical connection density value fits our model well.
In large companies it is not possible to have P(t) to be 1 for large t.
We observe an upper bound mu for the probability
After feature engineering and getting features such as triangle closing, organizational overlap scores for schools and companies, we apply a machine learning model to predict probability of two people knowing each other.
We also incorporate user feedback both explicit and implicit in enhancing the connection probability
To study structural diversity of connections among the recommended set of members in PYMK, we first map the recommended set of members to a graph G
We measure conversion rate or invitation rate from PYMK
where vertices represents members in the recommendation set, and edges are the connections between those members on LinkedIn social graph
To study structural diversity of connections among the recommended set of members in PYMK, we first map the recommended set of members to a graph G
where vertices represents members in the recommendation set, and edges are the connections between those members on LinkedIn social graph
We define 3 measures of structural diversity in terms of the number of connected components, the number of triangles, and Average local node degree.
I will go into connected components as a notion of structural diversity next
A connected component is defined as a maximal subgraph of the original graph such that any pair of vertices are connected by a path or the subgraph is just an isolated vertex
The number of connected components can be used a measure of structural diversity
where smaller number of components mean less structural diversity
This measure was also used by Ugander et al. in their study where they compared the effect of structural diversity in user recruitment
We aim to measure effect on invitation rate or conversion rate, which is defined as the ratio of the number of invitations to connect sent and the size of recommended set in People You May Know (PYMK)
This figure plots invitation rate vs the number of components for different sizes of recommendation set
Data set: PYMK recommendation sets of different sizes: 2, 3, 4, 5 and 6 in this graph
For each of this figure, we see that invitation rate increases with decrease in the number of components in the graph
That is, invitation rate increases as the recommendation set becomes less structurally diverse
Next I am going to talk about three recommendation systems at LinkedIn: news, jobs, and related search queries
Every day millions of searches are done on LinkedIn.
1. Users are searching for other members to connect with,
2. recruiters are searching for candidates with certain skills,
3. job seekers are searching for jobs.
a screenshot of search result page
CF: searches done in the same session by the same member
QRQ: queries that led to the same result clicks
Overlapping terms
Novel length bias: we found that members used to click on search query recommendations that are one word longer
Step wise union approach based on which signal results in highest CTR
Practical considerations:
Next I am going to talk about three recommendation systems at LinkedIn: news, jobs, and related search queries
On profile pages, you can endorse your connections for a particular skill
these are the skills endorsements i received
Virality
Gamification
How do we get to Skills Endorsements? That’s a long story over years
How did we built a collection of skills and extracted skills from profiles?
What is tagging?
Entity extraction, extract entities like tags, places or skills from free text.
What is standardization?
Deduplication of tags to entities or concepts. From the hundreds of thousands of different entities, which one are skills and what.
What is inference?
- Predict a skill even thought it’s not found in the text. If you have Hibernate, Spring, Java EE on your profile but not Java, we can infer that you know Java with 90% confidence.
Now we can prompt your connections to validate your skills and expertise through an endorsement
This moves more people through the loop faster
How would you think about this problem? How do you decide what people and skills to show?
Binary classification problem: given a pair of member and skill, we need to predict whether you will endorse that member, skill pair
Now we have all the pieces…
To reinforce how this works so well,
limited adopted by asking manual entry;
accelerate by asking them to confirm, but no validation;
social tagging, viral loops, and crowdsourcing -> provides the biggest win
You have a skills section -> people may enter their own skills, though not validated
You recommend skills to add -> more people add skills, still not validated
You provide a viral endorsement system -> don’t have the catalyst to get adoption
You need recommendations as a core piece of this ecosystem
So we have the data, what are the applications? Why is this important?
“Reid endorsed you for Venture Capital.”
It’s not just the number of endorsements, it’s the nature.
Long standing debate about what skills a data scientist should have
It’s pretty powerful to be able to just ask the skill endorsements data
Next I am going to talk about three recommendation systems at LinkedIn: news, jobs, and related search queries
Any deployed large-scale recommendation systems has to deal with scaling challenges
high level design
Kafka, Voldemort citations, url to Azkaban
Here is a production Azkaban Hadoop workflow, which involves dozens of hadoop jobs and dependencies
Looks complicated but it’s trivial to manage such workflows using Azkaban
Here is a diagram that shows how data gets pushed to Voldemort Read-only stores
Data gets processed in Hadoop applying feature engineering and machine learning algorithm
Final recommendation set is stored in Voldemort cluster
A hadoop job triggers the cluster to fetch data from HDFS
Any ideas why we don’t push data from Hadoop system? Easy to launch denial of service attack on your voldemort cluster
With that interesting bit of information I conclude my talk.
Talked about ....