Successfully reported this slideshow.
Your SlideShare is downloading. ×

Automated Methods for Identity Resolution across Online Social Networks

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 51 Ad

Automated Methods for Identity Resolution across Online Social Networks

Download to read offline

Today, more than two hundred Online Social Networks (OSNs) exist where each OSN extends to offer distinct services to its users such as eased access to news or better business opportunities. To enjoy each distinct service, a user innocuously registers herself on multiple OSNs. For each OSN, she defines her identity with a different set of attributes, genre of content and friends to suit the purpose of using that OSN. Thus, the quality, quantity and veracity of the identity varies with the OSN. This results in dissimilar identities of the same user, scattered across Internet, with no explicit links directing to one another. These disparate unlinked identities worry various stakeholders. For instance, security practitioners find it difficult to verify attributes across unlinked identities; enterprises fail to create a holistic overview of their customers.

Research that finds and links disconnected identities of a user across OSNs is termed as identity resolution. Accessibility to unique and private attributes of a user like ‘email’ makes the task trivial, however in absence of such attributes, identity resolution is challenging. In this dissertation, we make an effort to leverage intelligent cues and patterns extracted from partially overlapping list of public attributes of compared identities. These patterns emerge due to consistent user behavior like sharing same mobile number, content or profile picture across OSNs. Translating these patterns into features, we devise novel heuristic, unsupervised and supervised frameworks to search and link user identities across social networks. Proposed search methods use an exhaustive set of public attributes looking for consistent behavior patterns and fetch correct identity of the searched user in the candidate set for an additional 11% users. An improvement on the proposed search mechanisms further optimizes time and space complexity. Suggested linking method compares past attribute value sets and correctly connect identities of an additional 48% users, earlier missed by literature methods that compare only current values. Evaluations on popular OSNs like Twitter, Instagram and Facebook prove significance and generalizability of the linking method.

Today, more than two hundred Online Social Networks (OSNs) exist where each OSN extends to offer distinct services to its users such as eased access to news or better business opportunities. To enjoy each distinct service, a user innocuously registers herself on multiple OSNs. For each OSN, she defines her identity with a different set of attributes, genre of content and friends to suit the purpose of using that OSN. Thus, the quality, quantity and veracity of the identity varies with the OSN. This results in dissimilar identities of the same user, scattered across Internet, with no explicit links directing to one another. These disparate unlinked identities worry various stakeholders. For instance, security practitioners find it difficult to verify attributes across unlinked identities; enterprises fail to create a holistic overview of their customers.

Research that finds and links disconnected identities of a user across OSNs is termed as identity resolution. Accessibility to unique and private attributes of a user like ‘email’ makes the task trivial, however in absence of such attributes, identity resolution is challenging. In this dissertation, we make an effort to leverage intelligent cues and patterns extracted from partially overlapping list of public attributes of compared identities. These patterns emerge due to consistent user behavior like sharing same mobile number, content or profile picture across OSNs. Translating these patterns into features, we devise novel heuristic, unsupervised and supervised frameworks to search and link user identities across social networks. Proposed search methods use an exhaustive set of public attributes looking for consistent behavior patterns and fetch correct identity of the searched user in the candidate set for an additional 11% users. An improvement on the proposed search mechanisms further optimizes time and space complexity. Suggested linking method compares past attribute value sets and correctly connect identities of an additional 48% users, earlier missed by literature methods that compare only current values. Evaluations on popular OSNs like Twitter, Instagram and Facebook prove significance and generalizability of the linking method.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Advertisement

Similar to Automated Methods for Identity Resolution across Online Social Networks (20)

More from Cybersecurity Education and Research Centre (15)

Advertisement

Automated Methods for Identity Resolution across Online Social Networks

  1. 1. Automated Methods for Identity Resolution across Online Social Networks Paridhi Jain April 25th, 2016 Prof. Ponnurangam Kumaraguru (Advisor) Prof. Alan Mislove (Northeastern University) Prof. Amitabha Bagchi (IIT-Delhi) Dr. Sachin Lodha (TRDDC)
  2. 2. Online Social Network (OSN) “a pla&orm to build social rela2ons among people who share similar interests, ac2vi2es, backgrounds or real-life connec2ons.” [Boyd et al.] 3 209 acPve OSNs in 2015 cerc.iiitd.ac.in
  3. 3. Coverage of Social Networks 4 •  Unique Service •  At least 200 million users register on OSNs •  A user is bounded to maintain mulPple accounts
  4. 4. 5 Single User on Multiple OSNs! cerc.iiitd.ac.in Can we predict the link? Can we find and link disconnected iden00es of a single user? = Iden0ty Resolu0on
  5. 5. Why do Identity Resolution? 6cerc.iiitd.ac.in
  6. 6. Enterprises: (De-duplicating audience) Tip: Create verified enterprise profile, Campaign pages, product pages and invite users to like / follow the pages. 7cerc.iiitd.ac.in Return of investment? Calculate Social Audience
  7. 7. De-duplicating audience Social audience = 437,632 + 153,000 + 805,097 or less?? 8cerc.iiitd.ac.in
  8. 8. Security Practitioners (Attribute Aggregation) “The Twi`er account has no real name a>ached to it. But Buzzfeed contributor found her Tumblr iden0ty and idenPfied the account owner as Shashank ***, a hedge fund analyst and campaign manager. False Sandy Update Source: h`p://ediPon.cnn.com/2012/10/31/tech/social-media/sandy-twi`er-hoax/ 9cerc.iiitd.ac.in
  9. 9. Challenges cerc.iiitd.ac.in 10 Professional Opinion DaPng Heterogeneous OSNs Personal Degree of Details Quality and descripPve personal And professional informaPon Li`le personal informaPon DescripPve opinions A>ribute Evolu0on Time InformaPon evolved on one but not on other {jainpari, Bangalore} RegistraPon with same informaPon on both OSNs {paridhij, New Delhi}
  10. 10. Thesis Statement A user’s iden22es across online social networks can be searched and linked using past and present values of the iden*fiable and discrimina*ve public aDributes. Comparison using “Past and present values” take advantage of a`ribute evoluPon “IdenPfiable and public a`ributes” address challenge of heterogeneous OSNs 11cerc.iiitd.ac.in
  11. 11. Formulation followers. An individual is denoted by I and her identity on a social network SNA is denoted by IA. The task of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an identity IA of user I on social network SNA, find her identity IB on social network SNB using a search function S and a linking function L. IB = max 1jN (L(IA, IBj)) where IBj 2 S(IA)) Observing the two functions involved, the process of identity resolution in online social networks can be divided into two subprocesses – identity search and identity linking. Identity search lists a set of candidate identities on SNB, which are similar to the given known identity IA in accordance to the search function S and are suspected to belong to user I. Such a set of candidate identities is represented as S(IA) and its size is denoted by N. The search function S inputs IA’s attribute value, a defined similarity metric simS, and search space (SNB in this scenario) as arguments, and selects all identities (IB1 · · · IBj · · · IBN ) from the search space for whom similarity simS between the candidate’s attribute value and IA attribute values is greater than a threshold. The threshold 12 12 Search Func2on S: •  Input: an idenPty, a search space •  Output: candidate set Linking Func2on L: •  Input: an idenPty, a candidate set •  Output: Best matching candidate cerc.iiitd.ac.in Iden0ty Search Iden0ty Linking @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to th to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space ( selects all identities (IB1 · · · IBj · · · IBN ) from the search the candidate’s attribute value and IA attribute values is
  12. 12. Generic Identity Resolution 13cerc.iiitd.ac.in Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons
  13. 13. My Contributions –  Iden0ty Search: Novel methods for creaPng candidate set by exploiPng public and discriminaPve a`ributes; increase idenPty resoluPon accuracy by 13% –  Iden0ty Linking: Novel method for effecPve linking idenPPes by leveraging a`ribute history; reducPon in miss rate by 48% 14cerc.iiitd.ac.in
  14. 14. Identity Search cerc.iiitd.ac.in 15 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Aim: To retrieve a candidate set containing the idenPty we search for.
  15. 15. Formulation 16cerc.iiitd.ac.in 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IB) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13Search Func2on S: •  Can be computed with parPal informaPon •  Can be computed with different genre of informaPon (text, image) @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to th to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space ( selects all identities (IB1 · · · IBj · · · IBN ) from the search the candidate’s attribute value and IA attribute values is
  16. 16. State of the art –  Only profile a`ributes (private and public) for IdenPty Search [Motoyama et al., Malhotra et. al., Liu et al.] –  LimitaPons of Profile Search - –  RestricPve search, owing to non-availability of common a`ributes across networks. [Gender on Facebook, but not on Twi`er] –  Search with Limited a`ributes → Large candidate set size → Intensive IdenPty Linking computaPons –  Users may choose different profile a`ributes → Miss out correct idenPty in the candidate set –  Li`le research on using content and network a`ributes to search for candidate idenPPes [consistent user behavior and not profile] –  Extensive use of both private and public a`ributes. Need user authorizaPon for idenPty search 17cerc.iiitd.ac.in
  17. 17. Heuris0c Search on available a>ributes –  Addresses the gap of literature by using content and network idenPty search. –  Similarity based rules to find candidate idenPPes matching with given idenPty –  Aim to improve recall –  Real-Time search Unsupervised search on discrimina0ve a>ributes –  Real-Pme approaches are computaPonally and Pme expensive (Search in the complete social network) –  Pre-segment the social network –  Reduces Pme complexity from O(n2) to O(n) 18 Proposed Methods cerc.iiitd.ac.in
  18. 18. Heuristic Identity Search 19cerc.iiitd.ac.in Profile Content Self-mention Network Syntactic and Image Search Linking If self-identified / returned by more than one search method No Yes Candidate Identities name, location, username mobile no, post, friends, followers Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2013. @I seek ‘L.me’: Iden2fying Users across Mul2ple Online Social Networks. In Proceedings of the 22nd InternaPonal Conference on World Wide Web, WWW ’13 Companion. ACM, New York, NY, USA, 1259- 1268. DOI=h`p://dx.doi.org/10.1145/2487788.2488160 [Honorable MenPon Award}
  19. 19. 20 Content Search Algorithm 2 Heuristic Search Methods procedure Content Search IA known identity on SNA S {IA.source, IA.posts} if S[0] 2 {HootSuite, TwitterFeed, Facebook} then posts S[1] for each m in posts do remove stop-words and non-ascii characters from m limi to 75 characters query SNB API with m and retrieve candidates with similar posts Cxs candidates for each c in Cxs do if sim(c.post, m)  0 then delete c from Cxs add Cxs to Cx return Cxs cerc.iiitd.ac.in
  20. 20. Evaluation 21 Ground Truth Dataset: 543 users from FriendFeed and SocialGraph Selec0on Strategy: Random selecPon Why: To avoid any bias in evaluaPon. The methods are produced to be generalizable. Accuracy = correctly identified Total users Precision = Prelevant ∩ Pretrieved Pretrieved Recall = Prelevant ∩ Pretrieved Prelevant Figure 3.1: Architecture of the identity resolution framework using proposed heuristic search methods and linking methods from literature. Table 3.2: Evaluation of the identity resolution framework with contribution of each search algorithm in the resolution accuracy. Search methods based on profile (url), content, self-mention and network attributes improve resolution accuracy by 13.1%. Search Algorithm Ucorrect Accuracy Profile Search (P) 205 37.7% Content Search (C) 3 0.5% Self-mention Search (SM) 31 5.7% Network Search (N) 1 0.2% Identity Search (P+C+SM+N) 220 40.5% P (without URL) 149 27.4% P (with URL) + (C+SM) + N 149+71 27.4% +13.1% with the traditional profile search used in the literature, assuming access to only public profile attributes. Traditional profile search method finds candidate identities by search parameters – Improvised profile, content and network search methods successfully improved the accuracy and the recall by 13.1%. cerc.iiitd.ac.in
  21. 21. Unsupervised Identity Search 22 v/s complete search space v/s available a>ributes NiyaP Chhaya, Dhwanit Agarwal, Nikaash Puri, Paridhi Jain, Deepak Pai, and Ponnurangam Kumaraguru. 2015. EnTwine: Feature Analysis and Candidate Selec2on for Social User Iden2ty Aggrega2on. In Proceedings of the 2015 IEEE/ACM InternaPonal Conference on Advances in Social Networks Analysis and Mining, ASONAM ’15.
  22. 22. Find discriminative features 23 Class Majority Index (CMI) Match No-Match RaPo: Encroachment Index (EI) DiscriminaPve if: •  Low Encroachment Index •  Low Error Index Username Jaro Distance Username LCS Distance Username Levenshtein Distance Username Character Bi-gram Jaccard Index Username Character Bi-gram Cosine similarity Name Jaro Distance Name LCS Distance Name Character Bi-gram Jaccard Index Name Character Bi-gram Cosine similarity Sample Features cerc.iiitd.ac.in Match: {paridhi, paridhij} No-match: {paridhij,parineeta.c} Error Index (type-I/II) error
  23. 23. 24 Modified Canopy Clustering decreases to O(n). The search algorithm is modified and a concept of ‘sibling’ clusters is intro- duced. As non–overlapping clustering tend to miss out some probable candidates, extending this constrained set with siblings results in higher accuracy. The algorithm is given as Algorithm 6. Algorithm 6 Modification to the Canopies procedure Mod-Canopies U set of user-profiles on the network T threshold d(x, y) distance measure for each user-profile x in U : create canopy Cx such that for each user-profile y in U, insert y into Cx if d(x, y) < T; Remove all user profiles y added in the previous step from U. loop while U is not empty; The algorithm is similar to canopy clustering and its time complexity is still O(n2) in the worst 45 ModificaPons: •  Earlier overlapping canopies •  Overlapping canopies may not reflect similarity with given user idenPty We create: •  Non-overlapping canopies Discrimina-ve'' Features' Iden--es' IDENTITY'SEARCH' IDENTITY'LINKING' @darkma'er_* *John*Marget* St.*Anthony*School* @holy.james** James*Marget* St.*Anthony*School* * @dark.ma'er* John*M* New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist* (John,*John)** .* .* *'Implies'comparison'of'complete'iden--es' @dark.ma(er* John*M* New*Delhi* Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 cerc.iiitd.ac.in paridhij pari.nidhi paridhijain ridhi_jain paritosh_jain Parineeta.jain parineeta_joshi r_jain Raghav_jain riju_ Discrimina-ve'' Features' Iden--es' IDENTITY'SEARCH' IDENTITY'LIN @darkma'er_* *John*Marget* St.*Anthony*School* @holy.james** James*Marget* St.*Anthony*School* * @dark.ma'er* John*M* New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist* (John,*James (John,*John .* .* *'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social networ function S, find a set of identities IBj on social network SNB such that sim defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as can 13 paridhij pari.nidhi paridhi ridhi_jain paritosh_jain Parineeta.jain parineeta_joshi r_jain Raghav_jain riju_ Algorithmic Pme complexity reduces to O(n)
  24. 24. 25 Unsupervised search for a candidate set their distance. We experimente with different values of threshold T to determine the most optimal one. With a very small value, we cannot be able to expand our candidate set since we will not find any sibling clusters whereas with a extremely value, the candidate set can be too large making the algorithm computationally expensive. The empirical threshold T for our dataset is set to 12. Algorithm 7 Unsupervised search method 1: procedure Modified-Search 2: U User profile we are looking for 3: C set of non overlapping clusters 4: T threshold 5: d(Cx, Cy) distance measure 6: for each cluster Cx in C: 7: compute the distance d(U, Cx) 8: select cluster Cm such that d(U, Cm) is minimum of all distances computed above, this is the most suitable cluster; 9: L List of suitable clusters, initially empty 10: for each cluster Cx in C: 11: if d(Cm,Cx) < T then if d(U, Cx) < T then append Cx to L 12: L holds our list of candidate clusters For search, look for: •  Sibling Canopies •  Similar to most suitable canopy AND similar to the searched user profile cerc.iiitd.ac.in
  25. 25. Evaluation 26 M = Match class; NM = No-Match Class # of Users (M:NM::1:1) Threshold Precision (Canopy) Recall (Canopy) Precision (MOD- Canopy) Recall (MOD- Canopy) 20000 0.95 0.15 0.90 0.25 0.79 20000 0.97 0.20 0.70 0.30 0.55 20000 0.98 0.24 0.62 0.33 0.69 Increasing the threshold, increases precision, degrades recall Facebook-Twi`er
  26. 26. So far… cerc.iiitd.ac.in 27 @darkmaDer_ J Marget St. Anthony School @holy.james James Marget St. Anthony School @dark.maDer John M New Delhi . . . @janemargetkitchen Mrs. Marget Cookie Specialist IdenPty Search
  27. 27. Identity Linking cerc.iiitd.ac.in 28 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Aim: To retrieve best among the candidate set, i.e. the correct idenPty of the user
  28. 28. Formulation 29 little has contributed to address these challenges and drawbacks of profile search. 2.2.2 Identity Linking Problem Definition 3: Given an identity IA of user I on social network SNA, a set of candidate identities Q = S(IA) = {IB1, . . . , IBj, . . . , IBN } on social network SNB and a linking function L, locate an identity pair (IA, IBj) such that L(IA , IBj) = max{L(IA, IB1),. . . , L(IA, IBN )}. IBj with highest link-score is inferred as IB. IB = max 1jN (L(IA, IBj)) where IBj 2 Q) An identity linking method estimates the correspondence between identity IA and each candidate identity IBj by calculating a link-score L(IA, IBj) between their respective attributes and then rank the candidate set on the basis of link-score. Candidate identity IBj with highest link-score is con- cluded, as IB. The function L can be computed for all variety of data – text, date, image and location. The function can either be a supervised classifier decision boundary or a heuristic rule, in both scenarios, the function can be computed with partial and complete information. cerc.iiitd.ac.in Linking Func2on L: •  Can be a rule or a supervised classifier •  Can be computed with parPal informaPon •  Can be computed with different genre of informaPon (text, image) New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Content attributes describe the content she creates o post. • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to t to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space selects all identities (IB1 · · · IBj · · · IBN ) from the search
  29. 29. State of the art –  Methods link idenPPes using –  Profile a`ributes [Zafarani et al., Perito et al., Malhotra et al., Liu et al. ]! –  Content a`ributes [Iofciu et al., Liu et al., Goga et al.]! –  Network a`ributes [Bartunov et al., Narayanan et al., Labitzke et al.]! –  Crowd sourced mechanisms [Shehab et al.]! –  Search Engines [Bilge et al.]! –  Most literature methods assume, compare and match access to present (current) a`ributes of the idenPPes. –  But, current versions of the idenPPes may fail to match due to –  User choice –  A`ribute EvoluPon 30cerc.iiitd.ac.in
  30. 30. User choice A private user may consciously choose to de-link her idenPPes across OSNs, hence current versions display different personaliPes of the same user A>ribute Evolu0on An acPve user may keep on evolving their a`ributes to suit trends, requirements, or purpose. Thus, the current versions may differ 31 Why current versions may fail to match Username Name Descrip. Location Lang. Zone ProfilePic 0 10 20 30 40 50 60 70 %ofusers 2 values 3 values 4 values 5 values cerc.iiitd.ac.in
  31. 31. Proposed Identity Linking –  If current versions do not match and if the user behavior is consistent across OSNs, any of the past versions “may” match. 32 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Username Sets PredicPon 3Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2015. Other Times, Other Values: Leveraging ADribute History to Link User Profiles across Online Social Networks. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15. ACM, New York, NY, USA, 247-255. DOI=h`p://dx.doi.org/10.1145/2700171.2791040.
  32. 32. Username Set Collection 33 Tumblr username on the URL Twi`er username 33 cerc.iiitd.ac.in •  Past usernames: •  Automated Tracking System that queries a user’s ID via API to record her changed profile a`ributes •  her username on the OSN •  her URL a`ribute signifying change to her other OSN username •  Old Twi`er URL – abcd_efgh.tumblr.com •  New Twi`er URL – xyz.tumblr.com •  Ground Truth: •  Self-idenPficaPon behavior [Cross-referencing one’s OSN accounts]
  33. 33. Example –  User ID: 595**942* –  Past usernames on Twi>er: –  ["bigeasye_", "reezy11_", "epiceric_", "soulanola", "swampson_", "hebetheeeric", "swampkidd_"] –  Past Usernames on Tumblr: –  ["bigeasye_", "epiceric17", "swampson", "hebetheeeric"]} 34cerc.iiitd.ac.in
  34. 34. Methodology 35 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Collected Username Sets PredicPon cerc.iiitd.ac.in
  35. 35. Features Username Set Similarities Syntactic Static Creation Similar Length Similar Choice of Characters Similar Arrangement of Characters Evolutionary Creation Stylistic Occasional Reuse Common username? Best similarity score Second Best similarity score Frequent Reuse Common username set Temporal ordering? Temporal sync? Evolution of Length Evolution of Choice of Characters Evolution of Arrangement of Characters Temporal Case LeetSpeak Emphasizer Prefix / Suffix Slang words Bad words Function words Phonetic Replacement Grammar 36cerc.iiitd.ac.in
  36. 36. Evaluation 37 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Collected Username Sets PredicPon cerc.iiitd.ac.in
  37. 37. Datasets –  Linking profiles –  Twi`er – Tumblr –  Twi`er – Facebook –  Twi`er – Instagram –  Past usernames available for both profiles: –  18,959 posiPve pairs, 18,959 negaPve pairs –  Past usernames available only on Twi`er but current username available on other profile: –  109,292 posiPve pairs, 109,292 negaPve pairs 38cerc.iiitd.ac.in Network-Pair Twi>er-Tumblr Twi>er-Facebook Twi>er-Instagram Total Users History on both 14,301 1,166 3,492 18,959 History on source only 58,285 31,076 19,931 109,292
  38. 38. 1.  Independent Supervised Framework 2.  Fusion Supervised Framework Supervised Classification 39 3. Cascaded Supervised Framework Classifier I Current Username Features [Exact Match, Substring Match] Classifier II Username Set Features [Naive Bayes, SVM, DecisionTree, Random Forest] Negative? Positive? Same User Different Users Negative? US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} {‘tobsregbo' ‘isabelnevilles} Us - UC (or US - uc ) cerc.iiitd.ac.in
  39. 39. Prediction 40 Framework Config. Accuracy FNR FPR Exact Match (b1) 55.38 89.34 0.00 Substring Match (b2) 60.99 78.46 0.00 Independent [b1→Naive Bayes] 72.10 53.81 1.91 Fusion [b1→Naive Bayes] 72.93 51.89 0.19 Cascaded [b1→Naive Bayes] 73.12 48.87 3.07 Cascaded [b1 → SVM [Linear]] 76.97 40.87 3.71 Cascaded [b2 → Naive Bayes] 73.27 48.52 3.14 Cascaded [b2 → SVM [Linear]] 76.93 40.87 3.78 - 48.47% cerc.iiitd.ac.in
  40. 40. So far… cerc.iiitd.ac.in 41 @darkmaDer_ @holy.james @magascus, @hello_kiDy @darkmaDer_ @hello_kiDy, @magascu_, @holy.james Exis0ng iden0ty linking Same user Different users Proposed iden0ty linking Reduc0on of FNR from 89% to 40%
  41. 41. cerc.iiitd.ac.in 42 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Uses A`ribute EvoluPon Uses a`ributes that are shared across OSNs Proposed methods exploit…
  42. 42. A`ribute EvoluPon –  Implies: Out of sync idenPPes in Pme –  IdenPfy possible reasons and characterisPcs –  ImplicaPons? A`ribute Sharing –  Implies: sharing sensiPve informaPon –  IdenPfy possible reasons and characterisPcs? –  Risks? Privacy ImplicaPons? Do users care? 43 Understanding … cerc.iiitd.ac.in
  43. 43. Attribute Evolution –  Aim: To understand how, why, and what fracPon of users have “out-of-sync” idenPPes across OSNs –  Tracked about 8.7 million random Twi`er users and analyzed 10K users in depth who evolved over Pme [selecPve sampled] –  Studied a unique idenPfiable public a`ribute - username –  Observa0ons: –  20% of users consPtute 80% of username changes observed on Twi`er –  New usernames are disPnctly different from the old usernames –  A secPon of these users change for benign reasons like space gain, change of idenPfiability while others are suspected with malicious intenPons –  Implica0on: Due to a`ribute evoluPon, quality dataset of past idenPPes of a user is available. This instead of a challenge, becomes an opportunity for our proposed idenPty linking. cerc.iiitd.ac.in 44
  44. 44. Attribute Sharing –  Aim: To understand the reasons and risks of sharing sensiPve idenPfiable informaPon about oneself –  Collected 2,492 Indian mobile numbers from OSNs like Twi`er and Facebook public posts, bio and name –  Observa0ons: –  Mobiles numbers are pushed across mulPple OSNs, intenPonally and unintenPonally –  Publicly shared sensiPve informaPon like mobile number can expose idenPfiable details (ID, name, family) if collated with external data sources –  Implica0ons: –  Awareness of collaPon risks associated with sensiPve sharing is necessary. Technological soluPons should a`end to it. –  Sharing sensiPve informaPon can implicitly resolve idenPPes cerc.iiitd.ac.in 45
  45. 45. Contributions Summary –  Methods for idenPty search that exploit public a`ributes and user behavior across OSNs –  We address the challenge of heterogeneous OSNs by considering only public and universally available a`ributes –  Method for idenPty linking that leverage user evoluPon over Pme –  We exploit the challenge of a`ribute evoluPon to our advantage. Compare both past and current versions of the idenPPes –  Observed and characterized user behavior that aids our proposed methods –  We add to exisPng knowledge for development of our methods as well as future idenPty resoluPon methods 46cerc.iiitd.ac.in
  46. 46. Implications to? –  Enterprises can carry out: –  Automated audience de-duplicaPon –  Automated psychographic segmentaPon based on aggregated user profiles and inferred a`ributes. –  Security pracPPoners can de-anonymize malicious users –  Users –  Can be`er understand their idenPty leaks and patch them to avoid idenPty resoluPon –  E.g. “should not share same content”, “should not create similar histories of username” –  Risks of sharing sensiPve informaPon needs to the communicated by new Over-the-top (OTT) applicaPons 47cerc.iiitd.ac.in
  47. 47. Limitations and Future Work –  Dependency on API –  LimiPng to only usernames for idenPty linking –  EvaluaPon on self-idenPfied users –  Future work: –  Extend to include past versions of idenPPes for be`er idenPty search methods –  Extend to exploit evoluPon of mulPple a`ributes in a Pme synchronized manner for idenPty linking –  Develop an OTT messenger that highlights possible leaks of sensiPve informaPon, privacy and idenPty to a user 48cerc.iiitd.ac.in
  48. 48. Peer-reviewed Publications (1) –  Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2013. @I seek ‘L.me’: Iden2fying Users across Mul2ple Online Social Networks. In Proceedings of the 22nd InternaPonal Conference on World Wide Web, WWW ’13 Companion. ACM, New York, NY, USA, 1259- 1268. DOI=h`p://dx.doi.org/10.1145/2487788.2488160 –  NiyaP Chhaya, Dhwanit Agarwal, Nikaash Puri, Paridhi Jain, Deepak Pai, and Ponnurangam Kumaraguru. 2015. EnTwine: Feature Analysis and Candidate Selec2on for Social User Iden2ty Aggrega2on. In Proceedings of the 2015 IEEE/ACM InternaPonal Conference on Advances in Social Networks Analysis and Mining, ASONAM ’15. ACM, New York, NY, USA, 1575-1576, DOI=h`p://dx.doi.org/10.1145/2808797.2809340. –  Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2015. Other Times, Other Values: Leveraging ADribute History to Link User Profiles across Online Social Networks. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15. ACM, New York, NY, USA, 247-255. DOI=h`p://dx.doi.org/10.1145/2700171.2791040. 49cerc.iiitd.ac.in
  49. 49. Peer-reviewed Publications (2) –  Paridhi Jain and Ponnurangam Kumaraguru. 2016. On the Dynamics of Username Changing Behavior on TwiDer. In Proceedings of the 3rd IKDD Conference on Data Science, 2016, CODS ’16. ACM, New York, NY, USA, ArPcle 6 , 6 pages. DOI=h`p:// dx.doi.org/10.1145/ 2888451.2888452. –  Prachi Jain, Paridhi Jain, and Ponnurangam Kumaraguru. 2013. Call me Maybe: Un- derstanding Nature and Risks of sharing Mobile Numbers on Online Social Networks. In Proceedings of the first ACM Conference on Online social networks, COSN ’13. ACM, New York, NY, USA, 101-106, DOI=h`p://dx.doi.org/10.1145/2512938.2512959. –  Paridhi Jain, Tiago Rodrigues, Gabriel Magno, Ponnurangam Kumaraguru, and Virgilio Almeida. Cross-Pollina2on of Informa2on in Online Social Media: A Case Study on Popular Social Networks. In Proceedings of the 2011 IEEE 3rd InternaPonal Conference on Social CompuPng, SocialCom ʹ11, pages 477–482, Oct 2011. 50cerc.iiitd.ac.in
  50. 50. Acknowledgments 51 •  My advisor ‘PK’ •  Prof. Anupam Joshi and Prof. Rahul Purandare •  Members of Precog@IIITD and CERC@IIITD •  Supported by TCS Research Fellowship (2010 – 2016) •  Friends, Colleagues and Family Niharika Siddhartha AdiP Prateek Anupama SrishP cerc.iiitd.ac.in
  51. 51. Thanks! Paridhi.jain@xerox.com 52cerc.iiitd.ac.in

×