Automated Methods for Identity Resolution
across Online Social Networks
Paridhi	Jain	
April	25th,	2016	
Prof.	Ponnurangam	...
Online Social Network (OSN)
“a	pla&orm	to	build	social	rela2ons	among	people	who	share	similar	interests,	
ac2vi2es,	backg...
Coverage of Social Networks
4
•  Unique	Service	
	
•  At	least	200	million	
users	register	on	
OSNs	
•  A	user	is	bounded	...
5
Single User on Multiple OSNs!
cerc.iiitd.ac.in
Can	we	
predict	the	
link?		
Can	we	find	and	link	disconnected	iden00es	of...
Why do Identity Resolution?
6cerc.iiitd.ac.in
Enterprises: (De-duplicating audience)
Tip:		
Create	verified	enterprise	profile,		
Campaign	pages,	product	pages		
and	invi...
De-duplicating audience
Social	audience		=	437,632	+	153,000	+	805,097	or	less??	
8cerc.iiitd.ac.in
Security Practitioners (Attribute Aggregation)
“The	Twi`er	account	has	no	
real	name	a>ached	to	it.	But	
Buzzfeed	contribu...
Challenges
cerc.iiitd.ac.in 10
Professional	Opinion	
DaPng	
Heterogeneous	OSNs	
Personal	
Degree	of	Details	
Quality	and	d...
Thesis Statement
	
	
A	user’s	iden22es	across	online	social	networks	can	be	
searched	and	linked	using	past	and	present	va...
Formulation
followers.
An individual is denoted by I and her identity on a social network SNA is denoted by IA. The task
o...
Generic Identity Resolution
13cerc.iiitd.ac.in
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDEN...
My Contributions
–  Iden0ty	Search:	Novel	methods	for	creaPng	candidate	set	by	exploiPng	
public	and	discriminaPve	a`ribut...
Identity Search
cerc.iiitd.ac.in 15
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDENTITY	SEARCH...
Formulation
16cerc.iiitd.ac.in
2.2.1 Identity Search
Problem Definition 2: For a user I, given her identity IA on social ne...
State of the art
–  Only	profile	a`ributes	(private	and	public)	for	IdenPty	Search	[Motoyama	et	al.,	Malhotra	et.	al.,	
Liu...
Heuris0c	Search	on		
available	a>ributes	
–  Addresses	the	gap	of	literature	by	using	
content	and	network	idenPty	search....
Heuristic Identity Search
19cerc.iiitd.ac.in
Profile
Content
Self-mention
Network
Syntactic
and Image
Search Linking
If sel...
20
Content Search
Algorithm 2 Heuristic Search Methods
procedure Content Search
IA known identity on SNA
S {IA.source, IA....
Evaluation
21
Ground	Truth	Dataset:	543	users	from	FriendFeed	and	
SocialGraph	
Selec0on	Strategy:	Random	selecPon	
Why:	T...
Unsupervised Identity Search
22
v/s	complete	
search	space	
v/s	available	
a>ributes	
NiyaP	Chhaya,	Dhwanit	Agarwal,	Nikaa...
Find discriminative features
23
Class	Majority	
Index	(CMI)	
Match	
No-Match	
RaPo:	
Encroachment	
Index	(EI)	
DiscriminaP...
24
Modified Canopy Clustering
decreases to O(n). The search algorithm is modified and a concept of ‘sibling’ clusters is in...
25
Unsupervised search for a candidate set
their distance. We experimente with different values of threshold T to determine...
Evaluation
26
M	=	Match	class;	NM	=	No-Match	Class	
#	of	Users	
(M:NM::1:1)	
Threshold	 Precision	
(Canopy)	
Recall		
(Can...
So far…
cerc.iiitd.ac.in 27
@darkmaDer_	
	J	Marget	
St.	Anthony	School	
@holy.james		
James	Marget	
St.	Anthony	School	
	
...
Identity Linking
cerc.iiitd.ac.in 28
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDENTITY	SEARC...
Formulation
29
little has contributed to address these challenges and drawbacks of profile search.
2.2.2 Identity Linking
P...
State of the art
–  Methods	link	idenPPes	using	
–  Profile	a`ributes	[Zafarani et al., Perito et al., Malhotra et al., Liu...
User	choice	
A	private	user	may	consciously	choose	to	
de-link	her	idenPPes	across	OSNs,	hence	
current	versions	display	d...
Proposed Identity Linking
–  If	current	versions	do	not	match	and	if	the	user	behavior	is	consistent	across	OSNs,	any	
of	...
Username Set Collection
33
Tumblr	username	on	the	URL	
Twi`er	username	
33
cerc.iiitd.ac.in
•  Past	usernames:		
•  Automa...
Example
–  User	ID:	595**942*	
–  Past	usernames	on	Twi>er:	
–  	["bigeasye_",	"reezy11_",	"epiceric_",	"soulanola",	"swam...
Methodology
35
Supervised
Classification
Feature: 1
Feature: n
Similarity: 1
Similarity: n
Patterns of username creation be...
Features
Username Set Similarities
Syntactic
Static Creation
Similar Length
Similar Choice of Characters
Similar Arrangeme...
Evaluation
37
Supervised
Classification
Feature: 1
Feature: n
Similarity: 1
Similarity: n
Patterns of username creation beh...
Datasets
–  Linking	profiles	
–  Twi`er	–	Tumblr	
–  Twi`er	–	Facebook	
–  Twi`er	–	Instagram	
–  Past	usernames	available	...
1.  Independent	Supervised	Framework	
	
2.  Fusion	Supervised	Framework	
Supervised Classification
39
3.	Cascaded	Supervis...
Prediction
40
Framework	Config.	 Accuracy FNR FPR
Exact	Match	(b1) 55.38 89.34 0.00
Substring	Match	(b2)	 60.99 78.46 0.00
...
So far…
cerc.iiitd.ac.in 41
@darkmaDer_	 @holy.james		
@magascus,	@hello_kiDy	
@darkmaDer_	
@hello_kiDy,	@magascu_,	
@holy...
cerc.iiitd.ac.in 42
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDENTITY	SEARCH	 IDENTITY	LINKI...
A`ribute	EvoluPon	
–  Implies:	Out	of	sync	idenPPes	in	Pme	
–  IdenPfy	possible	reasons	and	
characterisPcs		
–  ImplicaPo...
Attribute Evolution
–  Aim:	To	understand	how,	why,	and	what	fracPon	of		users	have	“out-of-sync”	
idenPPes	across	OSNs	
–...
Attribute Sharing
–  Aim:	To	understand	the	reasons	and	risks	of	sharing	sensiPve	idenPfiable	
informaPon	about	oneself		
–...
Contributions Summary
–  Methods	for	idenPty	search	that	exploit	public	a`ributes	and	user	
behavior	across	OSNs	
–  We	ad...
Implications to?
–  Enterprises	can	carry	out:	
–  Automated	audience	de-duplicaPon	
–  Automated	psychographic	segmentaPo...
Limitations and Future Work
–  Dependency	on	API	
–  LimiPng	to	only	usernames	for	idenPty	linking	
–  EvaluaPon	on	self-i...
Peer-reviewed Publications (1)
–  Paridhi	Jain,	Ponnurangam	Kumaraguru,	and	Anupam	Joshi.	2013.	@I	seek	‘L.me’:	
Iden2fyin...
Peer-reviewed Publications (2)
–  Paridhi	Jain	and	Ponnurangam	Kumaraguru.	2016.	On	the	Dynamics	of	Username	
Changing	Beh...
Acknowledgments
51
•  My	advisor	‘PK’	
	
•  Prof.	Anupam	Joshi	and	Prof.	Rahul	Purandare	
•  Members	of	Precog@IIITD	and	C...
Thanks!
Paridhi.jain@xerox.com	
52cerc.iiitd.ac.in
Upcoming SlideShare
Loading in …5
×

Automated Methods for Identity Resolution across Online Social Networks

742 views

Published on

Today, more than two hundred Online Social Networks (OSNs) exist where each OSN extends to offer distinct services to its users such as eased access to news or better business opportunities. To enjoy each distinct service, a user innocuously registers herself on multiple OSNs. For each OSN, she defines her identity with a different set of attributes, genre of content and friends to suit the purpose of using that OSN. Thus, the quality, quantity and veracity of the identity varies with the OSN. This results in dissimilar identities of the same user, scattered across Internet, with no explicit links directing to one another. These disparate unlinked identities worry various stakeholders. For instance, security practitioners find it difficult to verify attributes across unlinked identities; enterprises fail to create a holistic overview of their customers.

Research that finds and links disconnected identities of a user across OSNs is termed as identity resolution. Accessibility to unique and private attributes of a user like ‘email’ makes the task trivial, however in absence of such attributes, identity resolution is challenging. In this dissertation, we make an effort to leverage intelligent cues and patterns extracted from partially overlapping list of public attributes of compared identities. These patterns emerge due to consistent user behavior like sharing same mobile number, content or profile picture across OSNs. Translating these patterns into features, we devise novel heuristic, unsupervised and supervised frameworks to search and link user identities across social networks. Proposed search methods use an exhaustive set of public attributes looking for consistent behavior patterns and fetch correct identity of the searched user in the candidate set for an additional 11% users. An improvement on the proposed search mechanisms further optimizes time and space complexity. Suggested linking method compares past attribute value sets and correctly connect identities of an additional 48% users, earlier missed by literature methods that compare only current values. Evaluations on popular OSNs like Twitter, Instagram and Facebook prove significance and generalizability of the linking method.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
742
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
43
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Automated Methods for Identity Resolution across Online Social Networks

  1. 1. Automated Methods for Identity Resolution across Online Social Networks Paridhi Jain April 25th, 2016 Prof. Ponnurangam Kumaraguru (Advisor) Prof. Alan Mislove (Northeastern University) Prof. Amitabha Bagchi (IIT-Delhi) Dr. Sachin Lodha (TRDDC)
  2. 2. Online Social Network (OSN) “a pla&orm to build social rela2ons among people who share similar interests, ac2vi2es, backgrounds or real-life connec2ons.” [Boyd et al.] 3 209 acPve OSNs in 2015 cerc.iiitd.ac.in
  3. 3. Coverage of Social Networks 4 •  Unique Service •  At least 200 million users register on OSNs •  A user is bounded to maintain mulPple accounts
  4. 4. 5 Single User on Multiple OSNs! cerc.iiitd.ac.in Can we predict the link? Can we find and link disconnected iden00es of a single user? = Iden0ty Resolu0on
  5. 5. Why do Identity Resolution? 6cerc.iiitd.ac.in
  6. 6. Enterprises: (De-duplicating audience) Tip: Create verified enterprise profile, Campaign pages, product pages and invite users to like / follow the pages. 7cerc.iiitd.ac.in Return of investment? Calculate Social Audience
  7. 7. De-duplicating audience Social audience = 437,632 + 153,000 + 805,097 or less?? 8cerc.iiitd.ac.in
  8. 8. Security Practitioners (Attribute Aggregation) “The Twi`er account has no real name a>ached to it. But Buzzfeed contributor found her Tumblr iden0ty and idenPfied the account owner as Shashank ***, a hedge fund analyst and campaign manager. False Sandy Update Source: h`p://ediPon.cnn.com/2012/10/31/tech/social-media/sandy-twi`er-hoax/ 9cerc.iiitd.ac.in
  9. 9. Challenges cerc.iiitd.ac.in 10 Professional Opinion DaPng Heterogeneous OSNs Personal Degree of Details Quality and descripPve personal And professional informaPon Li`le personal informaPon DescripPve opinions A>ribute Evolu0on Time InformaPon evolved on one but not on other {jainpari, Bangalore} RegistraPon with same informaPon on both OSNs {paridhij, New Delhi}
  10. 10. Thesis Statement A user’s iden22es across online social networks can be searched and linked using past and present values of the iden*fiable and discrimina*ve public aDributes. Comparison using “Past and present values” take advantage of a`ribute evoluPon “IdenPfiable and public a`ributes” address challenge of heterogeneous OSNs 11cerc.iiitd.ac.in
  11. 11. Formulation followers. An individual is denoted by I and her identity on a social network SNA is denoted by IA. The task of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an identity IA of user I on social network SNA, find her identity IB on social network SNB using a search function S and a linking function L. IB = max 1jN (L(IA, IBj)) where IBj 2 S(IA)) Observing the two functions involved, the process of identity resolution in online social networks can be divided into two subprocesses – identity search and identity linking. Identity search lists a set of candidate identities on SNB, which are similar to the given known identity IA in accordance to the search function S and are suspected to belong to user I. Such a set of candidate identities is represented as S(IA) and its size is denoted by N. The search function S inputs IA’s attribute value, a defined similarity metric simS, and search space (SNB in this scenario) as arguments, and selects all identities (IB1 · · · IBj · · · IBN ) from the search space for whom similarity simS between the candidate’s attribute value and IA attribute values is greater than a threshold. The threshold 12 12 Search Func2on S: •  Input: an idenPty, a search space •  Output: candidate set Linking Func2on L: •  Input: an idenPty, a candidate set •  Output: Best matching candidate cerc.iiitd.ac.in Iden0ty Search Iden0ty Linking @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to th to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space ( selects all identities (IB1 · · · IBj · · · IBN ) from the search the candidate’s attribute value and IA attribute values is
  12. 12. Generic Identity Resolution 13cerc.iiitd.ac.in Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons
  13. 13. My Contributions –  Iden0ty Search: Novel methods for creaPng candidate set by exploiPng public and discriminaPve a`ributes; increase idenPty resoluPon accuracy by 13% –  Iden0ty Linking: Novel method for effecPve linking idenPPes by leveraging a`ribute history; reducPon in miss rate by 48% 14cerc.iiitd.ac.in
  14. 14. Identity Search cerc.iiitd.ac.in 15 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Aim: To retrieve a candidate set containing the idenPty we search for.
  15. 15. Formulation 16cerc.iiitd.ac.in 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IB) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13Search Func2on S: •  Can be computed with parPal informaPon •  Can be computed with different genre of informaPon (text, image) @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to th to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space ( selects all identities (IB1 · · · IBj · · · IBN ) from the search the candidate’s attribute value and IA attribute values is
  16. 16. State of the art –  Only profile a`ributes (private and public) for IdenPty Search [Motoyama et al., Malhotra et. al., Liu et al.] –  LimitaPons of Profile Search - –  RestricPve search, owing to non-availability of common a`ributes across networks. [Gender on Facebook, but not on Twi`er] –  Search with Limited a`ributes → Large candidate set size → Intensive IdenPty Linking computaPons –  Users may choose different profile a`ributes → Miss out correct idenPty in the candidate set –  Li`le research on using content and network a`ributes to search for candidate idenPPes [consistent user behavior and not profile] –  Extensive use of both private and public a`ributes. Need user authorizaPon for idenPty search 17cerc.iiitd.ac.in
  17. 17. Heuris0c Search on available a>ributes –  Addresses the gap of literature by using content and network idenPty search. –  Similarity based rules to find candidate idenPPes matching with given idenPty –  Aim to improve recall –  Real-Time search Unsupervised search on discrimina0ve a>ributes –  Real-Pme approaches are computaPonally and Pme expensive (Search in the complete social network) –  Pre-segment the social network –  Reduces Pme complexity from O(n2) to O(n) 18 Proposed Methods cerc.iiitd.ac.in
  18. 18. Heuristic Identity Search 19cerc.iiitd.ac.in Profile Content Self-mention Network Syntactic and Image Search Linking If self-identified / returned by more than one search method No Yes Candidate Identities name, location, username mobile no, post, friends, followers Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2013. @I seek ‘L.me’: Iden2fying Users across Mul2ple Online Social Networks. In Proceedings of the 22nd InternaPonal Conference on World Wide Web, WWW ’13 Companion. ACM, New York, NY, USA, 1259- 1268. DOI=h`p://dx.doi.org/10.1145/2487788.2488160 [Honorable MenPon Award}
  19. 19. 20 Content Search Algorithm 2 Heuristic Search Methods procedure Content Search IA known identity on SNA S {IA.source, IA.posts} if S[0] 2 {HootSuite, TwitterFeed, Facebook} then posts S[1] for each m in posts do remove stop-words and non-ascii characters from m limi to 75 characters query SNB API with m and retrieve candidates with similar posts Cxs candidates for each c in Cxs do if sim(c.post, m)  0 then delete c from Cxs add Cxs to Cx return Cxs cerc.iiitd.ac.in
  20. 20. Evaluation 21 Ground Truth Dataset: 543 users from FriendFeed and SocialGraph Selec0on Strategy: Random selecPon Why: To avoid any bias in evaluaPon. The methods are produced to be generalizable. Accuracy = correctly identified Total users Precision = Prelevant ∩ Pretrieved Pretrieved Recall = Prelevant ∩ Pretrieved Prelevant Figure 3.1: Architecture of the identity resolution framework using proposed heuristic search methods and linking methods from literature. Table 3.2: Evaluation of the identity resolution framework with contribution of each search algorithm in the resolution accuracy. Search methods based on profile (url), content, self-mention and network attributes improve resolution accuracy by 13.1%. Search Algorithm Ucorrect Accuracy Profile Search (P) 205 37.7% Content Search (C) 3 0.5% Self-mention Search (SM) 31 5.7% Network Search (N) 1 0.2% Identity Search (P+C+SM+N) 220 40.5% P (without URL) 149 27.4% P (with URL) + (C+SM) + N 149+71 27.4% +13.1% with the traditional profile search used in the literature, assuming access to only public profile attributes. Traditional profile search method finds candidate identities by search parameters – Improvised profile, content and network search methods successfully improved the accuracy and the recall by 13.1%. cerc.iiitd.ac.in
  21. 21. Unsupervised Identity Search 22 v/s complete search space v/s available a>ributes NiyaP Chhaya, Dhwanit Agarwal, Nikaash Puri, Paridhi Jain, Deepak Pai, and Ponnurangam Kumaraguru. 2015. EnTwine: Feature Analysis and Candidate Selec2on for Social User Iden2ty Aggrega2on. In Proceedings of the 2015 IEEE/ACM InternaPonal Conference on Advances in Social Networks Analysis and Mining, ASONAM ’15.
  22. 22. Find discriminative features 23 Class Majority Index (CMI) Match No-Match RaPo: Encroachment Index (EI) DiscriminaPve if: •  Low Encroachment Index •  Low Error Index Username Jaro Distance Username LCS Distance Username Levenshtein Distance Username Character Bi-gram Jaccard Index Username Character Bi-gram Cosine similarity Name Jaro Distance Name LCS Distance Name Character Bi-gram Jaccard Index Name Character Bi-gram Cosine similarity Sample Features cerc.iiitd.ac.in Match: {paridhi, paridhij} No-match: {paridhij,parineeta.c} Error Index (type-I/II) error
  23. 23. 24 Modified Canopy Clustering decreases to O(n). The search algorithm is modified and a concept of ‘sibling’ clusters is intro- duced. As non–overlapping clustering tend to miss out some probable candidates, extending this constrained set with siblings results in higher accuracy. The algorithm is given as Algorithm 6. Algorithm 6 Modification to the Canopies procedure Mod-Canopies U set of user-profiles on the network T threshold d(x, y) distance measure for each user-profile x in U : create canopy Cx such that for each user-profile y in U, insert y into Cx if d(x, y) < T; Remove all user profiles y added in the previous step from U. loop while U is not empty; The algorithm is similar to canopy clustering and its time complexity is still O(n2) in the worst 45 ModificaPons: •  Earlier overlapping canopies •  Overlapping canopies may not reflect similarity with given user idenPty We create: •  Non-overlapping canopies Discrimina-ve'' Features' Iden--es' IDENTITY'SEARCH' IDENTITY'LINKING' @darkma'er_* *John*Marget* St.*Anthony*School* @holy.james** James*Marget* St.*Anthony*School* * @dark.ma'er* John*M* New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist* (John,*John)** .* .* *'Implies'comparison'of'complete'iden--es' @dark.ma(er* John*M* New*Delhi* Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 cerc.iiitd.ac.in paridhij pari.nidhi paridhijain ridhi_jain paritosh_jain Parineeta.jain parineeta_joshi r_jain Raghav_jain riju_ Discrimina-ve'' Features' Iden--es' IDENTITY'SEARCH' IDENTITY'LIN @darkma'er_* *John*Marget* St.*Anthony*School* @holy.james** James*Marget* St.*Anthony*School* * @dark.ma'er* John*M* New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist* (John,*James (John,*John .* .* *'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social networ function S, find a set of identities IBj on social network SNB such that sim defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as can 13 paridhij pari.nidhi paridhi ridhi_jain paritosh_jain Parineeta.jain parineeta_joshi r_jain Raghav_jain riju_ Algorithmic Pme complexity reduces to O(n)
  24. 24. 25 Unsupervised search for a candidate set their distance. We experimente with different values of threshold T to determine the most optimal one. With a very small value, we cannot be able to expand our candidate set since we will not find any sibling clusters whereas with a extremely value, the candidate set can be too large making the algorithm computationally expensive. The empirical threshold T for our dataset is set to 12. Algorithm 7 Unsupervised search method 1: procedure Modified-Search 2: U User profile we are looking for 3: C set of non overlapping clusters 4: T threshold 5: d(Cx, Cy) distance measure 6: for each cluster Cx in C: 7: compute the distance d(U, Cx) 8: select cluster Cm such that d(U, Cm) is minimum of all distances computed above, this is the most suitable cluster; 9: L List of suitable clusters, initially empty 10: for each cluster Cx in C: 11: if d(Cm,Cx) < T then if d(U, Cx) < T then append Cx to L 12: L holds our list of candidate clusters For search, look for: •  Sibling Canopies •  Similar to most suitable canopy AND similar to the searched user profile cerc.iiitd.ac.in
  25. 25. Evaluation 26 M = Match class; NM = No-Match Class # of Users (M:NM::1:1) Threshold Precision (Canopy) Recall (Canopy) Precision (MOD- Canopy) Recall (MOD- Canopy) 20000 0.95 0.15 0.90 0.25 0.79 20000 0.97 0.20 0.70 0.30 0.55 20000 0.98 0.24 0.62 0.33 0.69 Increasing the threshold, increases precision, degrades recall Facebook-Twi`er
  26. 26. So far… cerc.iiitd.ac.in 27 @darkmaDer_ J Marget St. Anthony School @holy.james James Marget St. Anthony School @dark.maDer John M New Delhi . . . @janemargetkitchen Mrs. Marget Cookie Specialist IdenPty Search
  27. 27. Identity Linking cerc.iiitd.ac.in 28 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Aim: To retrieve best among the candidate set, i.e. the correct idenPty of the user
  28. 28. Formulation 29 little has contributed to address these challenges and drawbacks of profile search. 2.2.2 Identity Linking Problem Definition 3: Given an identity IA of user I on social network SNA, a set of candidate identities Q = S(IA) = {IB1, . . . , IBj, . . . , IBN } on social network SNB and a linking function L, locate an identity pair (IA, IBj) such that L(IA , IBj) = max{L(IA, IB1),. . . , L(IA, IBN )}. IBj with highest link-score is inferred as IB. IB = max 1jN (L(IA, IBj)) where IBj 2 Q) An identity linking method estimates the correspondence between identity IA and each candidate identity IBj by calculating a link-score L(IA, IBj) between their respective attributes and then rank the candidate set on the basis of link-score. Candidate identity IBj with highest link-score is con- cluded, as IB. The function L can be computed for all variety of data – text, date, image and location. The function can either be a supervised classifier decision boundary or a heuristic rule, in both scenarios, the function can be computed with partial and complete information. cerc.iiitd.ac.in Linking Func2on L: •  Can be a rule or a supervised classifier •  Can be computed with parPal informaPon •  Can be computed with different genre of informaPon (text, image) New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Content attributes describe the content she creates o post. • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to t to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space selects all identities (IB1 · · · IBj · · · IBN ) from the search
  29. 29. State of the art –  Methods link idenPPes using –  Profile a`ributes [Zafarani et al., Perito et al., Malhotra et al., Liu et al. ]! –  Content a`ributes [Iofciu et al., Liu et al., Goga et al.]! –  Network a`ributes [Bartunov et al., Narayanan et al., Labitzke et al.]! –  Crowd sourced mechanisms [Shehab et al.]! –  Search Engines [Bilge et al.]! –  Most literature methods assume, compare and match access to present (current) a`ributes of the idenPPes. –  But, current versions of the idenPPes may fail to match due to –  User choice –  A`ribute EvoluPon 30cerc.iiitd.ac.in
  30. 30. User choice A private user may consciously choose to de-link her idenPPes across OSNs, hence current versions display different personaliPes of the same user A>ribute Evolu0on An acPve user may keep on evolving their a`ributes to suit trends, requirements, or purpose. Thus, the current versions may differ 31 Why current versions may fail to match Username Name Descrip. Location Lang. Zone ProfilePic 0 10 20 30 40 50 60 70 %ofusers 2 values 3 values 4 values 5 values cerc.iiitd.ac.in
  31. 31. Proposed Identity Linking –  If current versions do not match and if the user behavior is consistent across OSNs, any of the past versions “may” match. 32 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Username Sets PredicPon 3Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2015. Other Times, Other Values: Leveraging ADribute History to Link User Profiles across Online Social Networks. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15. ACM, New York, NY, USA, 247-255. DOI=h`p://dx.doi.org/10.1145/2700171.2791040.
  32. 32. Username Set Collection 33 Tumblr username on the URL Twi`er username 33 cerc.iiitd.ac.in •  Past usernames: •  Automated Tracking System that queries a user’s ID via API to record her changed profile a`ributes •  her username on the OSN •  her URL a`ribute signifying change to her other OSN username •  Old Twi`er URL – abcd_efgh.tumblr.com •  New Twi`er URL – xyz.tumblr.com •  Ground Truth: •  Self-idenPficaPon behavior [Cross-referencing one’s OSN accounts]
  33. 33. Example –  User ID: 595**942* –  Past usernames on Twi>er: –  ["bigeasye_", "reezy11_", "epiceric_", "soulanola", "swampson_", "hebetheeeric", "swampkidd_"] –  Past Usernames on Tumblr: –  ["bigeasye_", "epiceric17", "swampson", "hebetheeeric"]} 34cerc.iiitd.ac.in
  34. 34. Methodology 35 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Collected Username Sets PredicPon cerc.iiitd.ac.in
  35. 35. Features Username Set Similarities Syntactic Static Creation Similar Length Similar Choice of Characters Similar Arrangement of Characters Evolutionary Creation Stylistic Occasional Reuse Common username? Best similarity score Second Best similarity score Frequent Reuse Common username set Temporal ordering? Temporal sync? Evolution of Length Evolution of Choice of Characters Evolution of Arrangement of Characters Temporal Case LeetSpeak Emphasizer Prefix / Suffix Slang words Bad words Function words Phonetic Replacement Grammar 36cerc.iiitd.ac.in
  36. 36. Evaluation 37 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Collected Username Sets PredicPon cerc.iiitd.ac.in
  37. 37. Datasets –  Linking profiles –  Twi`er – Tumblr –  Twi`er – Facebook –  Twi`er – Instagram –  Past usernames available for both profiles: –  18,959 posiPve pairs, 18,959 negaPve pairs –  Past usernames available only on Twi`er but current username available on other profile: –  109,292 posiPve pairs, 109,292 negaPve pairs 38cerc.iiitd.ac.in Network-Pair Twi>er-Tumblr Twi>er-Facebook Twi>er-Instagram Total Users History on both 14,301 1,166 3,492 18,959 History on source only 58,285 31,076 19,931 109,292
  38. 38. 1.  Independent Supervised Framework 2.  Fusion Supervised Framework Supervised Classification 39 3. Cascaded Supervised Framework Classifier I Current Username Features [Exact Match, Substring Match] Classifier II Username Set Features [Naive Bayes, SVM, DecisionTree, Random Forest] Negative? Positive? Same User Different Users Negative? US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} {‘tobsregbo' ‘isabelnevilles} Us - UC (or US - uc ) cerc.iiitd.ac.in
  39. 39. Prediction 40 Framework Config. Accuracy FNR FPR Exact Match (b1) 55.38 89.34 0.00 Substring Match (b2) 60.99 78.46 0.00 Independent [b1→Naive Bayes] 72.10 53.81 1.91 Fusion [b1→Naive Bayes] 72.93 51.89 0.19 Cascaded [b1→Naive Bayes] 73.12 48.87 3.07 Cascaded [b1 → SVM [Linear]] 76.97 40.87 3.71 Cascaded [b2 → Naive Bayes] 73.27 48.52 3.14 Cascaded [b2 → SVM [Linear]] 76.93 40.87 3.78 - 48.47% cerc.iiitd.ac.in
  40. 40. So far… cerc.iiitd.ac.in 41 @darkmaDer_ @holy.james @magascus, @hello_kiDy @darkmaDer_ @hello_kiDy, @magascu_, @holy.james Exis0ng iden0ty linking Same user Different users Proposed iden0ty linking Reduc0on of FNR from 89% to 40%
  41. 41. cerc.iiitd.ac.in 42 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Uses A`ribute EvoluPon Uses a`ributes that are shared across OSNs Proposed methods exploit…
  42. 42. A`ribute EvoluPon –  Implies: Out of sync idenPPes in Pme –  IdenPfy possible reasons and characterisPcs –  ImplicaPons? A`ribute Sharing –  Implies: sharing sensiPve informaPon –  IdenPfy possible reasons and characterisPcs? –  Risks? Privacy ImplicaPons? Do users care? 43 Understanding … cerc.iiitd.ac.in
  43. 43. Attribute Evolution –  Aim: To understand how, why, and what fracPon of users have “out-of-sync” idenPPes across OSNs –  Tracked about 8.7 million random Twi`er users and analyzed 10K users in depth who evolved over Pme [selecPve sampled] –  Studied a unique idenPfiable public a`ribute - username –  Observa0ons: –  20% of users consPtute 80% of username changes observed on Twi`er –  New usernames are disPnctly different from the old usernames –  A secPon of these users change for benign reasons like space gain, change of idenPfiability while others are suspected with malicious intenPons –  Implica0on: Due to a`ribute evoluPon, quality dataset of past idenPPes of a user is available. This instead of a challenge, becomes an opportunity for our proposed idenPty linking. cerc.iiitd.ac.in 44
  44. 44. Attribute Sharing –  Aim: To understand the reasons and risks of sharing sensiPve idenPfiable informaPon about oneself –  Collected 2,492 Indian mobile numbers from OSNs like Twi`er and Facebook public posts, bio and name –  Observa0ons: –  Mobiles numbers are pushed across mulPple OSNs, intenPonally and unintenPonally –  Publicly shared sensiPve informaPon like mobile number can expose idenPfiable details (ID, name, family) if collated with external data sources –  Implica0ons: –  Awareness of collaPon risks associated with sensiPve sharing is necessary. Technological soluPons should a`end to it. –  Sharing sensiPve informaPon can implicitly resolve idenPPes cerc.iiitd.ac.in 45
  45. 45. Contributions Summary –  Methods for idenPty search that exploit public a`ributes and user behavior across OSNs –  We address the challenge of heterogeneous OSNs by considering only public and universally available a`ributes –  Method for idenPty linking that leverage user evoluPon over Pme –  We exploit the challenge of a`ribute evoluPon to our advantage. Compare both past and current versions of the idenPPes –  Observed and characterized user behavior that aids our proposed methods –  We add to exisPng knowledge for development of our methods as well as future idenPty resoluPon methods 46cerc.iiitd.ac.in
  46. 46. Implications to? –  Enterprises can carry out: –  Automated audience de-duplicaPon –  Automated psychographic segmentaPon based on aggregated user profiles and inferred a`ributes. –  Security pracPPoners can de-anonymize malicious users –  Users –  Can be`er understand their idenPty leaks and patch them to avoid idenPty resoluPon –  E.g. “should not share same content”, “should not create similar histories of username” –  Risks of sharing sensiPve informaPon needs to the communicated by new Over-the-top (OTT) applicaPons 47cerc.iiitd.ac.in
  47. 47. Limitations and Future Work –  Dependency on API –  LimiPng to only usernames for idenPty linking –  EvaluaPon on self-idenPfied users –  Future work: –  Extend to include past versions of idenPPes for be`er idenPty search methods –  Extend to exploit evoluPon of mulPple a`ributes in a Pme synchronized manner for idenPty linking –  Develop an OTT messenger that highlights possible leaks of sensiPve informaPon, privacy and idenPty to a user 48cerc.iiitd.ac.in
  48. 48. Peer-reviewed Publications (1) –  Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2013. @I seek ‘L.me’: Iden2fying Users across Mul2ple Online Social Networks. In Proceedings of the 22nd InternaPonal Conference on World Wide Web, WWW ’13 Companion. ACM, New York, NY, USA, 1259- 1268. DOI=h`p://dx.doi.org/10.1145/2487788.2488160 –  NiyaP Chhaya, Dhwanit Agarwal, Nikaash Puri, Paridhi Jain, Deepak Pai, and Ponnurangam Kumaraguru. 2015. EnTwine: Feature Analysis and Candidate Selec2on for Social User Iden2ty Aggrega2on. In Proceedings of the 2015 IEEE/ACM InternaPonal Conference on Advances in Social Networks Analysis and Mining, ASONAM ’15. ACM, New York, NY, USA, 1575-1576, DOI=h`p://dx.doi.org/10.1145/2808797.2809340. –  Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2015. Other Times, Other Values: Leveraging ADribute History to Link User Profiles across Online Social Networks. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15. ACM, New York, NY, USA, 247-255. DOI=h`p://dx.doi.org/10.1145/2700171.2791040. 49cerc.iiitd.ac.in
  49. 49. Peer-reviewed Publications (2) –  Paridhi Jain and Ponnurangam Kumaraguru. 2016. On the Dynamics of Username Changing Behavior on TwiDer. In Proceedings of the 3rd IKDD Conference on Data Science, 2016, CODS ’16. ACM, New York, NY, USA, ArPcle 6 , 6 pages. DOI=h`p:// dx.doi.org/10.1145/ 2888451.2888452. –  Prachi Jain, Paridhi Jain, and Ponnurangam Kumaraguru. 2013. Call me Maybe: Un- derstanding Nature and Risks of sharing Mobile Numbers on Online Social Networks. In Proceedings of the first ACM Conference on Online social networks, COSN ’13. ACM, New York, NY, USA, 101-106, DOI=h`p://dx.doi.org/10.1145/2512938.2512959. –  Paridhi Jain, Tiago Rodrigues, Gabriel Magno, Ponnurangam Kumaraguru, and Virgilio Almeida. Cross-Pollina2on of Informa2on in Online Social Media: A Case Study on Popular Social Networks. In Proceedings of the 2011 IEEE 3rd InternaPonal Conference on Social CompuPng, SocialCom ʹ11, pages 477–482, Oct 2011. 50cerc.iiitd.ac.in
  50. 50. Acknowledgments 51 •  My advisor ‘PK’ •  Prof. Anupam Joshi and Prof. Rahul Purandare •  Members of Precog@IIITD and CERC@IIITD •  Supported by TCS Research Fellowship (2010 – 2016) •  Friends, Colleagues and Family Niharika Siddhartha AdiP Prateek Anupama SrishP cerc.iiitd.ac.in
  51. 51. Thanks! Paridhi.jain@xerox.com 52cerc.iiitd.ac.in

×