Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

User Identity Linkage: Data Collection, DataSet Biases, Method, Control and Application


Published on

Online Social Networks (OSNs) are popular platforms for online users. Users typically register and maintain their accounts (user identities) across different OSNs to share a variety of content and remain connected with their friends. Consequently, linking user identities across OSN platforms, referred to as user identity linkage (UIL) becomes a critical problem. Solving this problem enables us to build a more comprehensive view of user’s activities across OSNs, which is highly beneficial for targeted advertisements, recommendations, and many more applications. In the thesis, we propose approaches for analyzing data collection methods, investigating biases in identity linkage datasets, linkage of user identities across social networks, control-ability of user identity linkage, and application of user identity linkage solutions to solve related problems.

Published in: Engineering
  • Be the first to comment

User Identity Linkage: Data Collection, DataSet Biases, Method, Control and Application

  1. 1. User Identity Linkage: Data Collection, Dataset Biases, Method, Control and Application Rishabh Kaushal PhD15008 Committee Members: Prof. Sanjay Jha Dr. Alessandra Sala Prof. Anwitaman Datta Prof. Ponnurangam Kumaraguru (PK), Advisor PhD Defense Presentation
  2. 2. Who Am I ? Sponsored PhD Student, Precog Research Group, IIIT, Delhi. Serving as Assistant Professor, IT Dept, IGDTUW. MS by Research from IIIT, Hyderabad. Research Interest: Social Computing. 2
  3. 3. Outline of Talk 3
  4. 4. Identity in Physical World 4 Identity Physical World Student Teacher Software Engineer Father
  5. 5. Identity in Online World Identity has three dimensions - profile, content, and network User joins multiple social networks 5 World of Social Networks Professional Personal News
  6. 6. Problem: User Identity Linkage (UIL) UIL refers to the problem of determining whether two input user identities, taken from two different social networks A and B, belong to the same person or not. (Ia , Ib ): Linked User Identity Pair 6
  7. 7. Motivation 7
  8. 8. Motivation 8
  9. 9. Thesis Statement “Computational approaches can be proposed for the analysis of data collection methods, investigation of biases in identity linkage datasets, linkage of user identities across social networks, control-ability of user identity linkage, and application of user identity linkage solution to solve extraneous problems.” 9
  10. 10. Outline of Talk 10 Accepted at 12th IEEE International Conference on Social Computing (SocialCom 2019). Xiamen, China.
  11. 11. Data Collection Methods 11
  12. 12. Social Aggregation (SA) We refer to such sites as social aggregation platforms on which users create an account and provide details of their multiple social network accounts. Perito et al. → Google profiles, Liu et al. → profiles 12
  13. 13. Cross Platform Sharing Cross platform sharing refers to a user behavior in which user posts the same content across multiple social network (Correa et al.) 13
  14. 14. Self Disclosure 14 On user profile page, user himself/herself discloses their identity on other social network platform (Chen et al.)
  15. 15. Social Network Coverage 15
  16. 16. Distribution of #Identities per User 16
  17. 17. Linked Identity Pairs Only top-6 social networks where we got best coverage are plotted. 17
  18. 18. Data Collection - Conclusion Computational approaches to collect linked user identity pairs can be implemented. Each data collection method depends upon a particular user behavior which is leverage to collect linked identities of that user. 18
  19. 19. Outline of Talk 19 Accepted at 35th ACM/SIGAPP Symposium on Applied Computing (SAC 2020). Brno, Czech Republic.
  20. 20. Why study dataset biases ? 20 Every data collection approach depend on the typical behaviors of users who maintain identities across multiple social networks As a consequence, these behavioral biases exhibited by users get manifested in these user identity linkage datasets.
  21. 21. Scope of our work We focus on two identity linkage datasets (SD and CPS) derived by leveraging two user behaviors namely, self-disclosure and cross platform posting, respectively on Twitter and Instagram. (1) Detection & Impact: Does dataset bias exist? What is the impact of dataset biases on ML models? (2) Quantification: How to measure the amount of dataset biases? 21
  22. 22. UIL as Supervised Learning Problem 22 Negative Class Generation: To create unlinked user identity pairs i.e. user identities that do not belong to the same person, done in two ways - random pairing and similar pairing. 1. Jaccard Similarity on ‘username’ of user identity pair. 2. Edit Distance on ‘display name’ of user identity pair. +ve Pairs: (rishabhk_, rk.iiit) -ve Pair: (rishab, rk.iiit) (rahul, rk.iiit)
  23. 23. DataSet Details 23
  24. 24. User Behavioral Features Jaccard Similarity (JS) on usernames 24 50% of user identity pairs from SD have JS value as 0.9 as opposed to only 23% from CPS Proportionofusers
  25. 25. User Behavioral Features 25 Edit Distance (ED) on display names Proportionofusers 58% display names of user identity pairs obtained through SD have 0.0 ED as compared to 35% from CPS
  26. 26. Impact of biases on model 26 Across all learning algorithms adopted, precision of models trained and tested on same datasets are better than the models trained & tested on different datasets. Experiments in two ways. (1) Same dataset for train-test (2) Different dataset for train-test
  27. 27. Quantification of Bias We have detected behavioral biases in user identities, characterized them and measured their impact on identity linkage models. We propose a design that quantifies biases by leveraging from a well-established discrimination measurement approach namely ‘situational testing’. 27
  28. 28. Situational Testing (ST) 28 Background Quantification Metric
  29. 29. Applying ST to quantify biases Data Record: Person → User Identity Pair Protected Attribute: Gender (male or female) → Data Collection Method (SD or CPS) Class Label: (Selected / Not-Selected) → (Linked / Not-Linked) 29
  30. 30. Results RQ: Are both decision classes (linked and unlinked) equally affected by biases? 30 t-value=0, means no bias. But, it is evident that probability distributions of t−values are spread on both positive (t>0) and negative (t<0) sides which indicates that behavioral biases affect many data records.
  31. 31. Dataset Biases - Conclusion Behavioral biases exist in identity linkage datasets. They can be detected and quantified. We recommend to collect linked user identities using more than one data collection method. Mitigation of biases in identity dataset - open problem. 31
  32. 32. Outline of Talk 32 Accepted at International School & Conference on Network Science (NetSciX, 2020), Tokyo, Japan.
  33. 33. Propose: NeXLink Framework Can we obtain effective node representations such that node embeddings of users belonging to Cross-Network Linkages (CNLs) are closer in embedding space than other nodes? 33 Input Output
  34. 34. More formally The goal of embedding function is to transform each user identity ui X and uj Y into low dimensional vectors zi X and zj Y of size d such that if ui X and uj Y belong to the same person, then their embedding vectors zi X and zj Y are closer in embedding space else far apart. 34
  35. 35. NeXLink Framework 35 Structural similarities of node within their respective networks are preserved Similarities of nodes across the two networks are preserved based on common friendship relation
  36. 36. Local Node Embeddings* The joint probability of ui X and uk X represented by their embedding vectors zi X and zj X can be expressed as below The empirical probability between ui X and uk X within same network is defined by their normalized weights as below Optimization: Minimize the KL-divergence between these distributions 36 * LINE algorithm: Tang et al.
  37. 37. Global Node Embeddings To construct global node embeddings, we construct a global graph (G) as follows. G(V) = VX + VY G(E) = CNL + NCNL Positive Edge Generation (CNL): Linked identity pairs belonging to same person across social networks. 37 Negative Edge Generation (NCNL): For every node pair (ui X ,uj Y ) we perform a random walk of t length starting at node ui X and add (ui X ,uk Y ) to NCNL (Non Cross Network Links) if uk Y appears in the random walk.
  38. 38. Global Node Embeddings To learn node embeddings, we perform biased walks (node2vec*) guided by common friends (CF) metric such that transition probability is 38 * node2vec algorithm: Grover et al.
  39. 39. Datasets We evaluated NeXLink framework on two datasets. Augmented Dataset: Sampled two sub-graphs from a large Facebook friendship network data comprising of 63,713 nodes and 817,090 edges. (Man et al.) Real-world Dataset: Twitter (5,120 users and 130,575 edges) and Instagram (5,313 users and 54,233 edges) with 1,288 common users. (Kong et al.) 39
  40. 40. Evaluation Metric For a given node ui X , our goal is find node uj Y which belong to the same person. Therefore, we count a hit if zj Y is present in top-k node embeddings, ordered based on cosine similarity. 40
  41. 41. Evaluation - Comparison with others We evaluate our proposed NeXLink (LINE-node2vec) framework with two other approaches. IONE: Input-Output Network Embedding (IONE) for the task of network alignment REGAL: Representation Learning based Graph Alignment 41
  42. 42. NeXLink Framework - Conclusion Node representation learning based approach can be proposed to effectively learn embedding vectors for extracting linked user identities . 42
  43. 43. Outline of Talk 43 Accepted at 9th International Conference on Social Informatics (SocInfo, 2017), University of Oxford, London.
  44. 44. Linkability Nudge Can we help users control linkability of their identities across social networks ? We design and implement a linkability nudge, gentle interventions to help users towards making an informed decision. User decides a range of linkability threshold (score) for each identity pair. (dynamic web portal) Whenever user behavior goes beyond the pre-configured range, the user is nudged. (web browser extension) 44
  45. 45. Linkability Nudge Architecture 45
  46. 46. Linkability Score - Displayed to User 46
  47. 47. Content Driven Color Nudge 47
  48. 48. Attribute Driven Notify Nudge 48
  49. 49. Nudge Evaluation Controlled lab experiment, control vs treatment period. Participants were recruited and told to perform tasks related to making a post and changing their profile attribute. We observed the impact of linkability nudge on participants. 49
  50. 50. Nudge Evaluation 50 Minutes since the start of experiment Participants
  51. 51. Outline of Talk 51 Accepted at 7th International Conference on Mining Intelligence & Knowledge Exploration (MIKE 2019), NIT, Goa.
  52. 52. Clone Detection Clone: User identity looking similar to the victim identity within the same social network 52
  53. 53. Why detect clone identities ? 53
  54. 54. Contributions Summary Performed comparative analysis of data collection methods. Investigated biases in identity linkage datasets. Proposed node embedding framework for user identity linkage. Helped users control linkability of their identities across OSNs. Applied UIL solution to detect clones and flag their behaviors. 54
  55. 55. Limitations & Future Directions Data collection is a challenge. Need to explore other social media platforms goodreads, strava, etc. We employed situational testing in detection of dataset biases. Other methods from fairness algorithm studies need to be explored. Our NeXLink node embedding framework takes only network information. Leveraging content and profile features can be helpful. We performed controlled lab study. Deploying linkability nudge for field trials. 55
  56. 56. Acknowledgements PhD Advisor: Prof PK Monitoring Committee: Prof Arun Balaji Buduru, Prof Rajiv Ratn Shah Co-authors and Peers Members of Precog My family 56
  57. 57. 57 Thanks