Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
www.digsolab.ru
Large-Scale Parallel
Matching of Social
Network Profiles
30.03.2015
Alexander Panchenko1,2, Dmitry Babaev1,...
Outline
• The problem
• The data
• The method
• Results
Problem
• Motivation
• input: a user profile of one social network
• output: profile of the same person in another social ne...
Related work
Several researchers recently tried to tackle this problem:
• Balduzzi et al. Abusing social networks for auto...
Outline
• The problem
• The data
• The method
• Results
Dataset
VKontakte Facebook
Number of users in
our dataset
89,561,085 2,903,144
Number of users in
Russia 1 100,000,000 13,...
How training data can be obtained?
• . . . also valid for the “cheap matching”!
• Link to FB in VK profile
• Link to FB and...
Outline
• The problem
• The data
• The method
• Results
Profile matching algorithm
1. Candidate generation. For each VK profile we
retrieve a set of FB profiles with similar first an...
Candidate generation
• Retrieve FB users with names similar to an input VK
profile.
• Two names are similar if:
• the first ...
Candidate rankingCandidate ranking
The higher the number of friends with similar names in VK
and FB profiles, the greater t...
Best candidate selectionThe Problem The Data The Method
Best candidate selection
FB candidates are ranked according to sim...
Outline
• The problem
• The data
• The method
• Results
Results
Figure : Precision-recall plot of the matching method. The
bold line denotes the best precision at given recall
Results: numbers
First name
threshold, α
0.8
Second name
threshold, β
0.6
Profile score
threshold, γ
3
Profile ratio
thresho...
Execution parameters
• AWS EMR
• 100 nodes of type m2.xlarge (2 vCPU, 17 GB RAM)
• 4 hours of execution time
• Source code...
Thank you! Questions?
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel Matching of Social Network Profiles

Download to read offline

AIST Conference 2015 http://aistconf.org/

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Alexander Panchenko, Dmitry Babaev and Sergey Objedkov - Large-Scale Parallel Matching of Social Network Profiles

  1. 1. www.digsolab.ru Large-Scale Parallel Matching of Social Network Profiles 30.03.2015 Alexander Panchenko1,2, Dmitry Babaev1,4, Sergey Objedkov3 1 – Digital Society Laboratory, 2 – TU Darmstadt, 3 – HSE 4 - Tinkoff Bank
  2. 2. Outline • The problem • The data • The method • Results
  3. 3. Problem • Motivation • input: a user profile of one social network • output: profile of the same person in another social network • immediate applications in marketing, search, security, etc. • Contribution • precision of 0.98 and recall of 0.54 • the method is computationally effective and easily parallelizable
  4. 4. Related work Several researchers recently tried to tackle this problem: • Balduzzi et al. Abusing social networks for automated user profiling. Springer, 2010. • Bartunov et al. Joint link-attribute user identity resolution in online social networks. SNA-KDD Workshop at KDD, 2012. • P. Jain et al. I seek ’fb.me’: Identifying users across multiple online social networks. WWW, 2013. • Malhotra et al. Studying user footprints in different online social networks. IEEE Computer Society, 2012. • Sironi. Automatic alignment of user identities in heterogeneous social networks. 2012. • Veldman. Matching profiles from social network sites. 2009. BUT:
 Our experiment is the most large-scale up to date.
  5. 5. Outline • The problem • The data • The method • Results
  6. 6. Dataset VKontakte Facebook Number of users in our dataset 89,561,085 2,903,144 Number of users in Russia 1 100,000,000 13,000,000 User overlap 88% 29% • training set: 92,488 matched FB-VK profiles 1 According to to comScore and http://vk.com/about
  7. 7. How training data can be obtained? • . . . also valid for the “cheap matching”! • Link to FB in VK profile • Link to FB and VK in a third network, e.g. LJ or Foursquare • Linking by email • Linking by phone
  8. 8. Outline • The problem • The data • The method • Results
  9. 9. Profile matching algorithm 1. Candidate generation. For each VK profile we retrieve a set of FB profiles with similar first and second names. 2. Candidate ranking. The candidates are ranked according to similarity of their friends. 3. Selection of the best candidate. The goal of the final step is to select the best match from the list of candidates.
  10. 10. Candidate generation • Retrieve FB users with names similar to an input VK profile. • Two names are similar if: • the first letters are the same • the edit distance between names ≤ 2 • Levenshtein Automata for edit distance of names • Use an automatically extracted dictionary of name synonyms: • “Alexander”, “Sasha”, “Sanya”, “Sanek”, etc.
  11. 11. Candidate rankingCandidate ranking The higher the number of friends with similar names in VK and FB profiles, the greater the similarity of these profiles. Two friends are considered to be similar if: First two letters of their last names match Similarity between first/last names sims are greater than thresholds ↵, : sims(si , sj ) = 1 lev(si , sj ) max(|si |, |sj |) , Contribution of each friend to similarity simp of two profiles pvk and pfb is inverse of name expectation frequency: simp(pvk, pfb) = X j:sims (sf i ,sf j )>↵^sims (ss i ,ss j )> min(1, N |sf j | · |ss j | ). Here sf i and ss i are first and second names of a VK profile, correspondingly, while sf j and ss j refer to a FB profile. Alexander Panchenko Matching Profiles of Facebook and VK Users The Problem The Data The Method Candidate ranking The higher the number of friends with similar names in VK and FB profiles, the greater the similarity of these profiles. Two friends are considered to be similar if: First two letters of their last names match Similarity between first/last names sims are greater than thresholds ↵, : sims(si , sj ) = 1 lev(si , sj ) max(|si |, |sj |) , Contribution of each friend to similarity simp of two profiles pvk and pfb is inverse of name expectation frequency: simp(pvk, pfb) = X j:sims (sf i ,sf j )>↵^sims (ss i ,ss j )> min(1, N |sf j | · |ss j | ). Here sf i and ss i are first and second names of a VK profile, correspondingly, while sf j and ss j refer to a FB profile. Alexander Panchenko Matching Profiles of Facebook and VK Users • The higher the number of friends with similar names in VK and FB profiles, the greater the similarity of these profiles. • Two friends are considered to be similar if: • First two letters of their last names match • Similarity between first/last names sims are greater than thresholds α, β: • Contribution of each friend to similarity simp of two profiles pvk and pfb is in inverse proportion to name popularity: • Here sif and sis are first and second names of a VK profile, correspondingly, while sjf and sjs refer to a FB profile.
  12. 12. Best candidate selectionThe Problem The Data The Method Best candidate selection FB candidates are ranked according to similarity simp to an input profile pvk The best candidate pfb should pass two thresholds to match: its score should be higher than the score threshold : simp(pvk , pfb) > . either the only candidate or score ratio between it and the next best candidate p0 fb should be higher than the ratio threshold : simp(pvk , pfb) simp(pvk , p0 fb) > . The Problem The Data The Method Best candidate selection FB candidates are ranked according to similarity simp to an input profile pvk The best candidate pfb should pass two thresholds to match: its score should be higher than the score threshold : simp(pvk , pfb) > . either the only candidate or score ratio between it and the next best candidate p0 fb should be higher than the ratio threshold : simp(pvk , pfb) simp(pvk , p0 fb) > . Alexander Panchenko Matching Profiles of Facebook and VK Users • FB candidates are ranked according to similarity simp to an input profile pvk • The best candidate pfb should pass two thresholds to match: • its score should be higher than the score threshold γ: • either the only candidate or score ratio between it and the next best candidate p′fb should be higher than the ratio threshold δ:
  13. 13. Outline • The problem • The data • The method • Results
  14. 14. Results Figure : Precision-recall plot of the matching method. The bold line denotes the best precision at given recall
  15. 15. Results: numbers First name threshold, α 0.8 Second name threshold, β 0.6 Profile score threshold, γ 3 Profile ratio threshold, δ 5 Number of matched profiles 644,334 (22%) Expected precision 0.98 Expected recall 0.54
  16. 16. Execution parameters • AWS EMR • 100 nodes of type m2.xlarge (2 vCPU, 17 GB RAM) • 4 hours of execution time • Source code: https://github.com/dmitrib/sn-profile- matching
  17. 17. Thank you! Questions?

AIST Conference 2015 http://aistconf.org/

Views

Total views

336

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×