SlideShare a Scribd company logo
1 of 51
Download to read offline
Automated Methods for Identity Resolution
across Online Social Networks
Paridhi	Jain	
April	25th,	2016	
Prof.	Ponnurangam	Kumaraguru	(Advisor)	
Prof.	Alan	Mislove	(Northeastern	University)	
Prof.	Amitabha	Bagchi	(IIT-Delhi)	
Dr.	Sachin	Lodha	(TRDDC)
Online Social Network (OSN)
“a	pla&orm	to	build	social	rela2ons	among	people	who	share	similar	interests,	
ac2vi2es,	backgrounds	or	real-life		connec2ons.”	[Boyd		et		al.]	
	
3
209	acPve	OSNs	in	2015	
cerc.iiitd.ac.in
Coverage of Social Networks
4
•  Unique	Service	
	
•  At	least	200	million	
users	register	on	
OSNs	
•  A	user	is	bounded	to	
maintain	mulPple	
accounts
5
Single User on Multiple OSNs!
cerc.iiitd.ac.in
Can	we	
predict	the	
link?		
Can	we	find	and	link	disconnected	iden00es	of	a	single	user?	
=	
Iden0ty	Resolu0on
Why do Identity Resolution?
6cerc.iiitd.ac.in
Enterprises: (De-duplicating audience)
Tip:		
Create	verified	enterprise	profile,		
Campaign	pages,	product	pages		
and	invite	users	to	like	/	follow	the	
pages.		
7cerc.iiitd.ac.in
Return	of	investment?	
Calculate	Social	Audience
De-duplicating audience
Social	audience		=	437,632	+	153,000	+	805,097	or	less??	
8cerc.iiitd.ac.in
Security Practitioners (Attribute Aggregation)
“The	Twi`er	account	has	no	
real	name	a>ached	to	it.	But	
Buzzfeed	contributor	found	her	
Tumblr	iden0ty	and	idenPfied	
the	account	owner	as	Shashank	
***,	a	hedge	fund	analyst	and	
campaign	manager.		
False Sandy Update
Source:	h`p://ediPon.cnn.com/2012/10/31/tech/social-media/sandy-twi`er-hoax/	
9cerc.iiitd.ac.in
Challenges
cerc.iiitd.ac.in 10
Professional	Opinion	
DaPng	
Heterogeneous	OSNs	
Personal	
Degree	of	Details	
Quality	and	descripPve	personal		
And	professional	informaPon	
Li`le	personal	informaPon		
DescripPve	opinions	
A>ribute	Evolu0on	
Time	
InformaPon	evolved	on	one	but		
not	on	other	
{jainpari,	Bangalore}	
RegistraPon	with	same	
informaPon	on	both	OSNs	
{paridhij,	New	Delhi}
Thesis Statement
	
	
A	user’s	iden22es	across	online	social	networks	can	be	
searched	and	linked	using	past	and	present	values	of	the	
iden*fiable	and	discrimina*ve	public	aDributes.	
	
Comparison	using	“Past	and	present	values”	take	advantage	
of	a`ribute	evoluPon	
“IdenPfiable	and	public	a`ributes”	address	challenge	of	
heterogeneous	OSNs	
	 11cerc.iiitd.ac.in
Formulation
followers.
An individual is denoted by I and her identity on a social network SNA is denoted by IA. The task
of identity resolution can be formally defined as follows.
Problem Definition 1: Identity Resolution: Given an identity IA of user I on social network
SNA, find her identity IB on social network SNB using a search function S and a linking function
L.
IB = max
1jN
(L(IA, IBj)) where IBj 2 S(IA))
Observing the two functions involved, the process of identity resolution in online social networks
can be divided into two subprocesses – identity search and identity linking. Identity search lists a
set of candidate identities on SNB, which are similar to the given known identity IA in accordance
to the search function S and are suspected to belong to user I. Such a set of candidate identities
is represented as S(IA) and its size is denoted by N. The search function S inputs IA’s attribute
value, a defined similarity metric simS, and search space (SNB in this scenario) as arguments, and
selects all identities (IB1 · · · IBj · · · IBN ) from the search space for whom similarity simS between
the candidate’s attribute value and IA attribute values is greater than a threshold. The threshold
12
12
Search	Func2on	S:	
•  Input:	an	idenPty,	a	search	space	
•  Output:	candidate	set	
Linking	Func2on	L:		
•  Input:	an	idenPty,	a	candidate	set	
•  Output:	Best	matching	candidate	
cerc.iiitd.ac.in
Iden0ty	Search	
Iden0ty	Linking	
@janemargetkitchen*
Mrs.*Marget*
Cookie*Specialist**'Implies'comparison'of'complete'iden--es'
Figure 2.3: Architecture of an identity resolution process.
2.2.1 Identity Search
Problem Definition 2: For a user I, given her identity IA on social network SNA and a search
function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on
defined similarity metric simS and empirically calculated threshold ✓.
{IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓
Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size
13
• Network attributes refer to the connections of the
followers.
An individual is denoted by I and her identity on a social
of identity resolution can be formally defined as follows.
Problem Definition 1: Identity Resolution: Given an
SNA, find her identity IB on social network SNB using a
L.
IB = max
1jN
(L(IA, IBj)) where
Observing the two functions involved, the process of iden
can be divided into two subprocesses – identity search an
set of candidate identities on SNB, which are similar to th
to the search function S and are suspected to belong to u
is represented as S(IA) and its size is denoted by N. Th
value, a defined similarity metric simS, and search space (
selects all identities (IB1 · · · IBj · · · IBN ) from the search
the candidate’s attribute value and IA attribute values is
Generic Identity Resolution
13cerc.iiitd.ac.in
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDENTITY	SEARCH	 IDENTITY	LINKING	
Pairwise		
Comparisons
My Contributions
–  Iden0ty	Search:	Novel	methods	for	creaPng	candidate	set	by	exploiPng	
public	and	discriminaPve	a`ributes;	increase	idenPty	resoluPon	
accuracy	by	13%	
–  Iden0ty	Linking:	Novel	method	for	effecPve	linking	idenPPes	by	
leveraging	a`ribute	history;	reducPon	in	miss	rate	by	48%	
14cerc.iiitd.ac.in
Identity Search
cerc.iiitd.ac.in 15
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDENTITY	SEARCH	 IDENTITY	LINKING	
Pairwise		
Comparisons	
Aim:	To	retrieve	a	candidate	set	containing	the	idenPty	we	search	for.
Formulation
16cerc.iiitd.ac.in
2.2.1 Identity Search
Problem Definition 2: For a user I, given her identity IA on social network SNA and a search
function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on
defined similarity metric simS and empirically calculated threshold ✓.
{IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IB) ✓
Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size
13Search	Func2on	S:	
•  Can	be	computed	with	parPal	informaPon	
•  Can	be	computed	with	different	genre	of	informaPon	(text,	image)	
	
@janemargetkitchen*
Mrs.*Marget*
Cookie*Specialist**'Implies'comparison'of'complete'iden--es'
Figure 2.3: Architecture of an identity resolution process.
2.2.1 Identity Search
Problem Definition 2: For a user I, given her identity IA on social network SNA and a search
function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on
defined similarity metric simS and empirically calculated threshold ✓.
{IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓
Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size
13
• Network attributes refer to the connections of the
followers.
An individual is denoted by I and her identity on a social
of identity resolution can be formally defined as follows.
Problem Definition 1: Identity Resolution: Given an
SNA, find her identity IB on social network SNB using a
L.
IB = max
1jN
(L(IA, IBj)) where
Observing the two functions involved, the process of iden
can be divided into two subprocesses – identity search an
set of candidate identities on SNB, which are similar to th
to the search function S and are suspected to belong to u
is represented as S(IA) and its size is denoted by N. Th
value, a defined similarity metric simS, and search space (
selects all identities (IB1 · · · IBj · · · IBN ) from the search
the candidate’s attribute value and IA attribute values is
State of the art
–  Only	profile	a`ributes	(private	and	public)	for	IdenPty	Search	[Motoyama	et	al.,	Malhotra	et.	al.,	
Liu	et	al.]	
–  LimitaPons	of	Profile	Search	-	
–  RestricPve	search,	owing	to	non-availability	of	common	a`ributes	across	networks.	
[Gender	on	Facebook,	but	not	on	Twi`er]	
–  Search	with	Limited	a`ributes	→	Large	candidate	set	size	→	Intensive	IdenPty	
Linking	computaPons	
–  Users	may	choose	different	profile	a`ributes	→	Miss	out	correct	idenPty	in	the	
candidate	set	
–  Li`le	research	on	using	content	and	network	a`ributes	to	search	for	candidate	idenPPes	
[consistent	user	behavior	and	not	profile]	
–  Extensive	use	of	both	private	and	public	a`ributes.	Need	user	authorizaPon	for	idenPty	
search	
17cerc.iiitd.ac.in
Heuris0c	Search	on		
available	a>ributes	
–  Addresses	the	gap	of	literature	by	using	
content	and	network	idenPty	search.		
–  Similarity	based	rules	to	find	candidate	
idenPPes	matching	with	given	idenPty	
–  Aim	to	improve	recall	
–  Real-Time	search	
Unsupervised	search	on	
discrimina0ve	a>ributes	
–  Real-Pme	approaches	are	
computaPonally	and	Pme	expensive	
(Search	in	the	complete	social	network)	
–  Pre-segment	the	social	network	
–  Reduces	Pme	complexity	from	O(n2)	to	
O(n)	
18
Proposed Methods
cerc.iiitd.ac.in
Heuristic Identity Search
19cerc.iiitd.ac.in
Profile
Content
Self-mention
Network
Syntactic
and Image
Search Linking
If self-identified /
returned by
more than one
search method
No
Yes
Candidate
Identities
name,
location,
username
mobile no,
post,
friends,
followers
Paridhi	Jain,	Ponnurangam	Kumaraguru,	and	Anupam	Joshi.	2013.	@I	seek	‘L.me’:	Iden2fying	Users	across	Mul2ple	Online	Social	
Networks.	In	Proceedings	of	the	22nd	InternaPonal	Conference	on	World	Wide	Web,	WWW	’13	Companion.	ACM,	New	York,	NY,	USA,	
1259-	1268.	DOI=h`p://dx.doi.org/10.1145/2487788.2488160		[Honorable	MenPon	Award}
20
Content Search
Algorithm 2 Heuristic Search Methods
procedure Content Search
IA known identity on SNA
S {IA.source, IA.posts}
if S[0] 2 {HootSuite, TwitterFeed, Facebook} then
posts S[1]
for each m in posts do
remove stop-words and non-ascii characters from m
limi to 75 characters
query SNB API with m and retrieve candidates with similar posts
Cxs candidates
for each c in Cxs do
if sim(c.post, m)  0 then
delete c from Cxs
add Cxs to Cx
return Cxs
cerc.iiitd.ac.in
Evaluation
21
Ground	Truth	Dataset:	543	users	from	FriendFeed	and	
SocialGraph	
Selec0on	Strategy:	Random	selecPon	
Why:	To	avoid	any	bias	in	evaluaPon.	The	methods	are	
produced	to	be	generalizable.	
Accuracy =
correctly identified
Total users
Precision =
Prelevant ∩ Pretrieved
Pretrieved
Recall =
Prelevant ∩ Pretrieved
Prelevant
Figure 3.1: Architecture of the identity resolution framework using proposed heuristic search methods and
linking methods from literature.
Table 3.2: Evaluation of the identity resolution framework with contribution of each search algorithm in
the resolution accuracy. Search methods based on profile (url), content, self-mention and network attributes
improve resolution accuracy by 13.1%.
Search Algorithm Ucorrect Accuracy
Profile Search (P) 205 37.7%
Content Search (C) 3 0.5%
Self-mention Search (SM) 31 5.7%
Network Search (N) 1 0.2%
Identity Search (P+C+SM+N) 220 40.5%
P (without URL) 149 27.4%
P (with URL) + (C+SM) + N 149+71 27.4% +13.1%
with the traditional profile search used in the literature, assuming access to only public profile
attributes. Traditional profile search method finds candidate identities by search parameters –
Improvised	profile,	content	and	network	search	methods	successfully	
improved	the	accuracy	and	the	recall	by	13.1%.	
cerc.iiitd.ac.in
Unsupervised Identity Search
22
v/s	complete	
search	space	
v/s	available	
a>ributes	
NiyaP	Chhaya,	Dhwanit	Agarwal,	Nikaash	Puri,	Paridhi	Jain,	Deepak	Pai,	and	Ponnurangam	Kumaraguru.	2015.	EnTwine:	Feature	Analysis	and	
Candidate	Selec2on	for	Social	User	Iden2ty	Aggrega2on.	In	Proceedings	of	the	2015	IEEE/ACM	InternaPonal	Conference	on	Advances	in	Social	
Networks	Analysis	and	Mining,	ASONAM	’15.
Find discriminative features
23
Class	Majority	
Index	(CMI)	
Match	
No-Match	
RaPo:	
Encroachment	
Index	(EI)	
DiscriminaPve	if:	
•  Low	
Encroachment	
Index		
•  Low	Error	Index	
	
Username	Jaro	Distance	
Username	LCS	Distance	
Username	Levenshtein	Distance		
Username	Character	Bi-gram	
Jaccard	Index	
Username	Character	Bi-gram	
Cosine	similarity	
Name	Jaro	Distance	
Name	LCS	Distance	
Name	Character	Bi-gram	Jaccard	
Index	
Name	Character	Bi-gram	Cosine	
similarity	
Sample	Features	
cerc.iiitd.ac.in
Match:		
{paridhi,	paridhij}	
	
No-match:	
{paridhij,parineeta.c}	
Error	Index	
(type-I/II)	
error
24
Modified Canopy Clustering
decreases to O(n). The search algorithm is modified and a concept of ‘sibling’ clusters is intro-
duced. As non–overlapping clustering tend to miss out some probable candidates, extending this
constrained set with siblings results in higher accuracy. The algorithm is given as Algorithm 6.
Algorithm 6 Modification to the Canopies
procedure Mod-Canopies
U set of user-profiles on the network
T threshold
d(x, y) distance measure
for each user-profile x in U :
create canopy Cx such that
for each user-profile y in U,
insert y into Cx if d(x, y) < T;
Remove all user profiles y added in the previous step from U.
loop while U is not empty;
The algorithm is similar to canopy clustering and its time complexity is still O(n2) in the worst
45
ModificaPons:	
•  Earlier	overlapping	canopies	
•  Overlapping	canopies	may	not	
reflect	similarity	with	given	
user	idenPty	
We	create:	
•  Non-overlapping	canopies	
	
Discrimina-ve''
Features'
Iden--es'
IDENTITY'SEARCH' IDENTITY'LINKING'
@darkma'er_*
*John*Marget*
St.*Anthony*School*
@holy.james**
James*Marget*
St.*Anthony*School*
*
@dark.ma'er*
John*M*
New*Delhi*
.*
.*
.*
@janemargetkitchen*
Mrs.*Marget*
Cookie*Specialist*
(John,*John)**
.*
.*
*'Implies'comparison'of'complete'iden--es'
@dark.ma(er*
John*M*
New*Delhi*
Figure 2.3: Architecture of an identity resolution process.
2.2.1 Identity Search
Problem Definition 2: For a user I, given her identity IA on social network SNA and a search
function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on
defined similarity metric simS and empirically calculated threshold ✓.
{IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓
Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size
13
cerc.iiitd.ac.in
paridhij	
pari.nidhi	
paridhijain	
	
ridhi_jain	
paritosh_jain	 Parineeta.jain	
parineeta_joshi	
r_jain	
Raghav_jain	
riju_	
Discrimina-ve''
Features'
Iden--es'
IDENTITY'SEARCH' IDENTITY'LIN
@darkma'er_*
*John*Marget*
St.*Anthony*School*
@holy.james**
James*Marget*
St.*Anthony*School*
*
@dark.ma'er*
John*M*
New*Delhi*
.*
.*
.*
@janemargetkitchen*
Mrs.*Marget*
Cookie*Specialist*
(John,*James
(John,*John
.*
.*
*'Implies'comparison'of'complete'iden--es'
Figure 2.3: Architecture of an identity resolution process.
2.2.1 Identity Search
Problem Definition 2: For a user I, given her identity IA on social networ
function S, find a set of identities IBj on social network SNB such that sim
defined similarity metric simS and empirically calculated threshold ✓.
{IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓
Each identity IBj in the set is termed as candidate identity and the set as can
13
paridhij	
pari.nidhi	
paridhi	
ridhi_jain	paritosh_jain	
Parineeta.jain	
parineeta_joshi	
r_jain	
Raghav_jain	
riju_	
Algorithmic	Pme	complexity	reduces	to	O(n)
25
Unsupervised search for a candidate set
their distance. We experimente with different values of threshold T to determine the most optimal
one. With a very small value, we cannot be able to expand our candidate set since we will not find
any sibling clusters whereas with a extremely value, the candidate set can be too large making the
algorithm computationally expensive. The empirical threshold T for our dataset is set to 12.
Algorithm 7 Unsupervised search method
1: procedure Modified-Search
2: U User profile we are looking for
3: C set of non overlapping clusters
4: T threshold
5: d(Cx, Cy) distance measure
6: for each cluster Cx in C:
7: compute the distance d(U, Cx)
8: select cluster Cm such that
d(U, Cm) is minimum of all distances
computed above, this is the most suitable cluster;
9: L List of suitable clusters, initially empty
10: for each cluster Cx in C:
11: if d(Cm,Cx) < T then
if d(U, Cx) < T then
append Cx to L
12: L holds our list of candidate clusters
For	search,	look	for:	
•  Sibling	Canopies	
•  Similar	to	most	suitable	
canopy	AND	similar	to	the	
searched	user	profile	
cerc.iiitd.ac.in
Evaluation
26
M	=	Match	class;	NM	=	No-Match	Class	
#	of	Users	
(M:NM::1:1)	
Threshold	 Precision	
(Canopy)	
Recall		
(Canopy)	
Precision	
(MOD-
Canopy)	
Recall	
(MOD-
Canopy)	
20000	 0.95	 0.15	 0.90	 0.25	 0.79	
20000	 0.97	 0.20	 0.70	 0.30	 0.55	
20000	 0.98	 0.24	 0.62	 0.33	 0.69	
Increasing	the	threshold,	increases	precision,	degrades	recall	
Facebook-Twi`er
So far…
cerc.iiitd.ac.in 27
@darkmaDer_	
	J	Marget	
St.	Anthony	School	
@holy.james		
James	Marget	
St.	Anthony	School	
	
@dark.maDer	
John	M	
New	Delhi	
.	
.	
.	
@janemargetkitchen	
Mrs.	Marget	
Cookie	Specialist	
IdenPty	
Search
Identity Linking
cerc.iiitd.ac.in 28
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDENTITY	SEARCH	 IDENTITY	LINKING	
Pairwise		
Comparisons	
Aim:	To	retrieve	best	among	the	candidate	set,	i.e.	the	correct	idenPty	of	the	user
Formulation
29
little has contributed to address these challenges and drawbacks of profile search.
2.2.2 Identity Linking
Problem Definition 3: Given an identity IA of user I on social network SNA, a set of candidate
identities Q = S(IA) = {IB1, . . . , IBj, . . . , IBN } on social network SNB and a linking function L,
locate an identity pair (IA, IBj) such that L(IA , IBj) = max{L(IA, IB1),. . . , L(IA, IBN )}. IBj with
highest link-score is inferred as IB.
IB = max
1jN
(L(IA, IBj)) where IBj 2 Q)
An identity linking method estimates the correspondence between identity IA and each candidate
identity IBj by calculating a link-score L(IA, IBj) between their respective attributes and then rank
the candidate set on the basis of link-score. Candidate identity IBj with highest link-score is con-
cluded, as IB. The function L can be computed for all variety of data – text, date, image and
location. The function can either be a supervised classifier decision boundary or a heuristic rule, in
both scenarios, the function can be computed with partial and complete information.
cerc.iiitd.ac.in
Linking	Func2on	L:	
•  Can	be	a	rule	or	a	supervised	classifier	
•  Can	be	computed	with	parPal	informaPon	
•  Can	be	computed	with	different	genre	of	informaPon	(text,	image)	
	
New*Delhi*
.*
.*
.*
@janemargetkitchen*
Mrs.*Marget*
Cookie*Specialist**'Implies'comparison'of'complete'iden--es'
Figure 2.3: Architecture of an identity resolution process.
2.2.1 Identity Search
Problem Definition 2: For a user I, given her identity IA on social network SNA and a search
function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on
defined similarity metric simS and empirically calculated threshold ✓.
{IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓
Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size
13
• Content attributes describe the content she creates o
post.
• Network attributes refer to the connections of the
followers.
An individual is denoted by I and her identity on a social
of identity resolution can be formally defined as follows.
Problem Definition 1: Identity Resolution: Given an
SNA, find her identity IB on social network SNB using a
L.
IB = max
1jN
(L(IA, IBj)) where
Observing the two functions involved, the process of iden
can be divided into two subprocesses – identity search an
set of candidate identities on SNB, which are similar to t
to the search function S and are suspected to belong to u
is represented as S(IA) and its size is denoted by N. Th
value, a defined similarity metric simS, and search space
selects all identities (IB1 · · · IBj · · · IBN ) from the search
State of the art
–  Methods	link	idenPPes	using	
–  Profile	a`ributes	[Zafarani et al., Perito et al., Malhotra et al., Liu et al. ]!
–  Content	a`ributes	[Iofciu et al., Liu et al., Goga et al.]!
–  Network	a`ributes	[Bartunov et al., Narayanan et al., Labitzke et al.]!
–  Crowd	sourced	mechanisms	[Shehab et al.]!
–  Search	Engines	[Bilge et al.]!
–  Most	literature	methods	assume,	compare	and	match	access	to	
present	(current)	a`ributes	of	the	idenPPes.		
–  But,	current	versions	of	the	idenPPes	may	fail	to	match	due	to		
–  User	choice	
–  A`ribute	EvoluPon	
30cerc.iiitd.ac.in
User	choice	
A	private	user	may	consciously	choose	to	
de-link	her	idenPPes	across	OSNs,	hence	
current	versions	display	different	
personaliPes	of	the	same	user	
	
	
A>ribute	Evolu0on	
An	acPve	user	may	keep	on	evolving	their	
a`ributes	to	suit	trends,	requirements,	or	
purpose.	Thus,	the	current	versions	may	
differ	
31
Why current versions may fail to match
Username Name Descrip. Location Lang. Zone ProfilePic
0
10
20
30
40
50
60
70
%ofusers
2 values
3 values
4 values
5 values
cerc.iiitd.ac.in
Proposed Identity Linking
–  If	current	versions	do	not	match	and	if	the	user	behavior	is	consistent	across	OSNs,	any	
of	the	past	versions	“may”	match.	
32
Supervised
Classification
Feature: 1
Feature: n
Similarity: 1
Similarity: n
Patterns of username creation behavior across OSNs
Patterns of username reuse behavior across OSNs
.
.
.
.
.
.
Labeled
datasets
US: {‘eenjolrass',‘isabelnevills',
‘giuliettacapuleti',‘tobsregbo'}
UC: {‘enjoolras',‘isabelnevilles'}
uc: {‘isabelnevilles'}
SNA
SNB
Feature: 1
Feature: m
Similarity: 1
Similarity: m
.
.
.
.
.
.
Username	Sets	 PredicPon	
3Paridhi	Jain,	Ponnurangam	Kumaraguru,	and	Anupam	Joshi.	2015.	Other	Times,	Other	Values:	Leveraging	ADribute	History	to	Link	User	
Profiles	across	Online	Social	Networks.	In	Proceedings	of	the	26th	ACM	Conference	on	Hypertext	&	Social	Media,	HT	’15.	ACM,	New	York,	
NY,	USA,	247-255.	DOI=h`p://dx.doi.org/10.1145/2700171.2791040.
Username Set Collection
33
Tumblr	username	on	the	URL	
Twi`er	username	
33
cerc.iiitd.ac.in
•  Past	usernames:		
•  Automated	Tracking	System	that	queries	a	user’s	ID	via	API	to	
record	her	changed	profile	a`ributes		
•  her	username	on	the	OSN	
•  her	URL	a`ribute	signifying	change	to	her	other	OSN	username	
•  Old	Twi`er	URL	–	abcd_efgh.tumblr.com	
•  New	Twi`er	URL	–	xyz.tumblr.com	
•  Ground	Truth:		
•  Self-idenPficaPon	behavior	[Cross-referencing	one’s	OSN	accounts]
Example
–  User	ID:	595**942*	
–  Past	usernames	on	Twi>er:	
–  	["bigeasye_",	"reezy11_",	"epiceric_",	"soulanola",	"swampson_",	"hebetheeeric",	
"swampkidd_"]		
–  Past	Usernames	on	Tumblr:		
–  ["bigeasye_",	"epiceric17",	"swampson",	"hebetheeeric"]}	
34cerc.iiitd.ac.in
Methodology
35
Supervised
Classification
Feature: 1
Feature: n
Similarity: 1
Similarity: n
Patterns of username creation behavior across OSNs
Patterns of username reuse behavior across OSNs
.
.
.
.
.
.
Labeled
datasets
US: {‘eenjolrass',‘isabelnevills',
‘giuliettacapuleti',‘tobsregbo'}
UC: {‘enjoolras',‘isabelnevilles'}
uc: {‘isabelnevilles'}
SNA
SNB
Feature: 1
Feature: m
Similarity: 1
Similarity: m
.
.
.
.
.
.
Collected	Username	Sets	 PredicPon	
cerc.iiitd.ac.in
Features
Username Set Similarities
Syntactic
Static Creation
Similar Length
Similar Choice of Characters
Similar Arrangement of Characters
Evolutionary Creation
Stylistic
Occasional Reuse
Common username?
Best similarity score
Second Best similarity score
Frequent Reuse
Common username set
Temporal ordering?
Temporal sync?
Evolution of Length
Evolution of Choice of Characters
Evolution of Arrangement of Characters
Temporal
Case
LeetSpeak
Emphasizer
Prefix / Suffix
Slang words
Bad words
Function words
Phonetic Replacement
Grammar
36cerc.iiitd.ac.in
Evaluation
37
Supervised
Classification
Feature: 1
Feature: n
Similarity: 1
Similarity: n
Patterns of username creation behavior across OSNs
Patterns of username reuse behavior across OSNs
.
.
.
.
.
.
Labeled
datasets
US: {‘eenjolrass',‘isabelnevills',
‘giuliettacapuleti',‘tobsregbo'}
UC: {‘enjoolras',‘isabelnevilles'}
uc: {‘isabelnevilles'}
SNA
SNB
Feature: 1
Feature: m
Similarity: 1
Similarity: m
.
.
.
.
.
.
Collected	Username	Sets	 PredicPon	
cerc.iiitd.ac.in
Datasets
–  Linking	profiles	
–  Twi`er	–	Tumblr	
–  Twi`er	–	Facebook	
–  Twi`er	–	Instagram	
–  Past	usernames	available	for	both	profiles:	
–  18,959	posiPve	pairs,	18,959	negaPve	pairs		
–  Past	usernames	available	only	on	Twi`er	but	current	username	available	on	other	
profile:	
–  109,292	posiPve	pairs,	109,292	negaPve	pairs	
	
38cerc.iiitd.ac.in
Network-Pair	 Twi>er-Tumblr	 Twi>er-Facebook	 Twi>er-Instagram	 Total	Users	
History	on	both	 14,301	 1,166	 3,492	 18,959	
History	on	source	only	 58,285	 31,076	 19,931	 109,292
1.  Independent	Supervised	Framework	
	
2.  Fusion	Supervised	Framework	
Supervised Classification
39
3.	Cascaded	Supervised	Framework	
Classifier I
Current Username Features
[Exact Match, Substring Match]
Classifier II
Username Set Features
[Naive Bayes, SVM, DecisionTree, Random Forest]
Negative?
Positive?
Same User
Different Users
Negative?
US: {‘eenjolrass',‘isabelnevills',
‘giuliettacapuleti',‘tobsregbo'}
UC: {‘enjoolras',‘isabelnevilles'}
uc: {‘isabelnevilles'}
{‘tobsregbo'
‘isabelnevilles}
Us - UC
(or US - uc )
cerc.iiitd.ac.in
Prediction
40
Framework	Config.	 Accuracy FNR FPR
Exact	Match	(b1) 55.38 89.34 0.00
Substring	Match	(b2)	 60.99 78.46 0.00
Independent	[b1→Naive	Bayes]	 72.10	 53.81	 1.91	
Fusion	[b1→Naive	Bayes]	 72.93	 51.89	 0.19	
Cascaded	[b1→Naive	Bayes]	 73.12	 48.87	 3.07	
Cascaded	[b1	→	SVM	[Linear]] 76.97 40.87 3.71
Cascaded	[b2	→	Naive	Bayes]	 73.27 48.52 3.14
Cascaded	[b2	→	SVM	[Linear]]	 76.93 40.87 3.78
	-	48.47%	cerc.iiitd.ac.in
So far…
cerc.iiitd.ac.in 41
@darkmaDer_	 @holy.james		
@magascus,	@hello_kiDy	
@darkmaDer_	
@hello_kiDy,	@magascu_,	
@holy.james	
Exis0ng	iden0ty	linking	
Same	user	
Different	users	
Proposed	iden0ty	linking	
Reduc0on	of	
FNR	from	
89%	to	40%
cerc.iiitd.ac.in 42
Extract		
available	&		
discriminaPve	
features	
Candidate		
IdenPPes	
IDENTITY	SEARCH	 IDENTITY	LINKING	
Pairwise		
Comparisons	
Uses	A`ribute	EvoluPon	Uses	a`ributes	that	are	shared		
across	OSNs	
Proposed methods exploit…
A`ribute	EvoluPon	
–  Implies:	Out	of	sync	idenPPes	in	Pme	
–  IdenPfy	possible	reasons	and	
characterisPcs		
–  ImplicaPons?	
A`ribute	Sharing	
–  Implies:	sharing	sensiPve	informaPon	
–  IdenPfy	possible	reasons	and	
characterisPcs?	
–  Risks?	Privacy	ImplicaPons?	Do	users	
care?		
43
Understanding …
cerc.iiitd.ac.in
Attribute Evolution
–  Aim:	To	understand	how,	why,	and	what	fracPon	of		users	have	“out-of-sync”	
idenPPes	across	OSNs	
–  Tracked	about	8.7	million	random	Twi`er	users	and	analyzed	10K	users	in	
depth	who	evolved	over	Pme	[selecPve	sampled]	
–  Studied	a	unique	idenPfiable	public	a`ribute	-	username	
–  Observa0ons:		
–  20%	of	users	consPtute	80%	of	username	changes	observed	on	Twi`er	
–  New	usernames	are	disPnctly	different	from	the	old	usernames	
–  A	secPon	of	these	users	change	for	benign	reasons	like	space	gain,	change	of	
idenPfiability	while	others	are	suspected	with	malicious	intenPons	
–  Implica0on:	Due	to	a`ribute	evoluPon,	quality	dataset	of	past		idenPPes	of	a	
user	is	available.	This	instead	of	a	challenge,	becomes	an	opportunity	for	our	
proposed	idenPty	linking.	
cerc.iiitd.ac.in 44
Attribute Sharing
–  Aim:	To	understand	the	reasons	and	risks	of	sharing	sensiPve	idenPfiable	
informaPon	about	oneself		
–  Collected	2,492	Indian	mobile	numbers	from	OSNs	like	Twi`er	and	Facebook	
public	posts,	bio	and	name	
–  Observa0ons:	
–  Mobiles	numbers	are	pushed	across	mulPple	OSNs,	intenPonally	and	
unintenPonally	
–  Publicly	shared	sensiPve	informaPon	like	mobile	number	can	expose	idenPfiable	
details	(ID,	name,	family)	if	collated	with	external	data	sources	
–  Implica0ons:		
–  Awareness	of	collaPon	risks	associated	with	sensiPve	sharing	is	necessary.	
Technological	soluPons	should	a`end	to	it.	
–  Sharing	sensiPve	informaPon	can	implicitly	resolve	idenPPes	
cerc.iiitd.ac.in 45
Contributions Summary
–  Methods	for	idenPty	search	that	exploit	public	a`ributes	and	user	
behavior	across	OSNs	
–  We	address	the	challenge	of	heterogeneous	OSNs	by	considering	only	public	and	
universally	available	a`ributes	
–  Method	for	idenPty	linking	that	leverage	user	evoluPon	over	Pme		
–  We	exploit	the	challenge	of	a`ribute	evoluPon	to	our	advantage.	Compare	both	
past	and	current	versions	of	the	idenPPes	
–  Observed	and	characterized	user	behavior	that	aids	our	proposed	
methods	
–  We	add	to	exisPng	knowledge	for	development	of	our	methods	as	well	as	future	
idenPty	resoluPon	methods	
46cerc.iiitd.ac.in
Implications to?
–  Enterprises	can	carry	out:	
–  Automated	audience	de-duplicaPon	
–  Automated	psychographic	segmentaPon	based	on	aggregated	user	
profiles	and	inferred	a`ributes.	
–  Security	pracPPoners	can	de-anonymize	malicious	users	
–  Users	
–  Can	be`er	understand	their	idenPty	leaks	and	patch	them	to	avoid	
idenPty	resoluPon	
–  E.g.	“should	not	share	same	content”,	“should	not	create	similar	histories	
of	username”	
–  Risks	of	sharing	sensiPve	informaPon	needs	to	the	communicated	by	new	
Over-the-top	(OTT)	applicaPons	
47cerc.iiitd.ac.in
Limitations and Future Work
–  Dependency	on	API	
–  LimiPng	to	only	usernames	for	idenPty	linking	
–  EvaluaPon	on	self-idenPfied	users	
–  Future	work:		
–  Extend	to	include	past	versions	of	idenPPes	for	be`er	idenPty	
search	methods	
–  Extend	to	exploit	evoluPon	of	mulPple	a`ributes	in	a	Pme	
synchronized	manner	for	idenPty	linking	
–  Develop	an	OTT	messenger	that	highlights	possible	leaks	of	
sensiPve	informaPon,	privacy	and	idenPty	to	a	user	
48cerc.iiitd.ac.in
Peer-reviewed Publications (1)
–  Paridhi	Jain,	Ponnurangam	Kumaraguru,	and	Anupam	Joshi.	2013.	@I	seek	‘L.me’:	
Iden2fying	Users	across	Mul2ple	Online	Social	Networks.	In	Proceedings	of	the	22nd	
InternaPonal	Conference	on	World	Wide	Web,	WWW	’13	Companion.	ACM,	New	York,	
NY,	USA,	1259-	1268.	DOI=h`p://dx.doi.org/10.1145/2487788.2488160		
–  NiyaP	Chhaya,	Dhwanit	Agarwal,	Nikaash	Puri,	Paridhi	Jain,	Deepak	Pai,	and	
Ponnurangam	Kumaraguru.	2015.	EnTwine:	Feature	Analysis	and	Candidate	Selec2on	for	
Social	User	Iden2ty	Aggrega2on.	In	Proceedings	of	the	2015	IEEE/ACM	InternaPonal	
Conference	on	Advances	in	Social	Networks	Analysis	and	Mining,	ASONAM	’15.	ACM,	
New	York,	NY,	USA,	1575-1576,	DOI=h`p://dx.doi.org/10.1145/2808797.2809340.		
–  Paridhi	Jain,	Ponnurangam	Kumaraguru,	and	Anupam	Joshi.	2015.	Other	Times,	Other	
Values:	Leveraging	ADribute	History	to	Link	User	Profiles	across	Online	Social	Networks.	
In	Proceedings	of	the	26th	ACM	Conference	on	Hypertext	&	Social	Media,	HT	’15.	ACM,	
New	York,	NY,	USA,	247-255.	DOI=h`p://dx.doi.org/10.1145/2700171.2791040.		
49cerc.iiitd.ac.in
Peer-reviewed Publications (2)
–  Paridhi	Jain	and	Ponnurangam	Kumaraguru.	2016.	On	the	Dynamics	of	Username	
Changing	Behavior	on	TwiDer.	In	Proceedings	of	the	3rd	IKDD	Conference	on	Data	
Science,	2016,	CODS	’16.	ACM,	New	York,	NY,	USA,	ArPcle	6	,	6	pages.	DOI=h`p://
dx.doi.org/10.1145/	2888451.2888452.		
–  Prachi	Jain,	Paridhi	Jain,	and	Ponnurangam	Kumaraguru.	2013.	Call	me	Maybe:	Un-	
derstanding	Nature	and	Risks	of	sharing	Mobile	Numbers	on	Online	Social	Networks.	In	
Proceedings	of	the	first	ACM	Conference	on	Online	social	networks,	COSN	’13.	ACM,	
New	York,	NY,	USA,	101-106,	DOI=h`p://dx.doi.org/10.1145/2512938.2512959.	
–  Paridhi	Jain,	Tiago	Rodrigues,	Gabriel	Magno,	Ponnurangam	Kumaraguru,	and	Virgilio	
Almeida.	Cross-Pollina2on	of	Informa2on	in	Online	Social	Media:	A	Case	Study	on	Popular	
Social	Networks.	In	Proceedings	of	the	2011	IEEE	3rd	InternaPonal	Conference	on	Social	
CompuPng,	SocialCom	ʹ11,	pages	477–482,	Oct	2011.		
50cerc.iiitd.ac.in
Acknowledgments
51
•  My	advisor	‘PK’	
	
•  Prof.	Anupam	Joshi	and	Prof.	Rahul	Purandare	
•  Members	of	Precog@IIITD	and	CERC@IIITD	
•  Supported	by	TCS	Research	Fellowship	(2010	–	2016)	
•  Friends,	Colleagues	and	Family	
Niharika	 	Siddhartha 								AdiP 								Prateek 						Anupama						SrishP	cerc.iiitd.ac.in
Thanks!
Paridhi.jain@xerox.com	
52cerc.iiitd.ac.in

More Related Content

What's hot

Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
ijtsrd
 
Classification of instagram fake users using supervised machine learning algo...
Classification of instagram fake users using supervised machine learning algo...Classification of instagram fake users using supervised machine learning algo...
Classification of instagram fake users using supervised machine learning algo...
IJECEIAES
 
Structural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service PortalStructural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service Portal
YogeshIJTSRD
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3
Dave King
 

What's hot (20)

Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
 
IRJET- Fake Profile Identification using Machine Learning
IRJET-  	  Fake Profile Identification using Machine LearningIRJET-  	  Fake Profile Identification using Machine Learning
IRJET- Fake Profile Identification using Machine Learning
 
PHISHING MITIGATION TECHNIQUES: A LITERATURE SURVEY
PHISHING MITIGATION TECHNIQUES: A LITERATURE SURVEYPHISHING MITIGATION TECHNIQUES: A LITERATURE SURVEY
PHISHING MITIGATION TECHNIQUES: A LITERATURE SURVEY
 
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNINGDETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...IRJET-  	  An Effective Analysis of Anti Troll System using Artificial Intell...
IRJET- An Effective Analysis of Anti Troll System using Artificial Intell...
 
Classification of instagram fake users using supervised machine learning algo...
Classification of instagram fake users using supervised machine learning algo...Classification of instagram fake users using supervised machine learning algo...
Classification of instagram fake users using supervised machine learning algo...
 
F017433947
F017433947F017433947
F017433947
 
Tweet sentiment analysis
Tweet sentiment analysisTweet sentiment analysis
Tweet sentiment analysis
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental Analysis
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
Structural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service PortalStructural Balance Theory Based Recommendation for Social Service Portal
Structural Balance Theory Based Recommendation for Social Service Portal
 
A Survey Of Collaborative Filtering Techniques
A Survey Of Collaborative Filtering TechniquesA Survey Of Collaborative Filtering Techniques
A Survey Of Collaborative Filtering Techniques
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013Vol 7 No 1 - November 2013
Vol 7 No 1 - November 2013
 
Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...Automatic detection of online abuse and analysis of problematic users in wiki...
Automatic detection of online abuse and analysis of problematic users in wiki...
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3
 
IRJET- Sentimental Prediction of Users Perspective through Live Streaming : T...
IRJET- Sentimental Prediction of Users Perspective through Live Streaming : T...IRJET- Sentimental Prediction of Users Perspective through Live Streaming : T...
IRJET- Sentimental Prediction of Users Perspective through Live Streaming : T...
 
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
 
Data Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training SimulationData Collection Methods for Building a Free Response Training Simulation
Data Collection Methods for Building a Free Response Training Simulation
 

Viewers also liked

Viewers also liked (6)

GDPR by Identity Methods
GDPR by Identity MethodsGDPR by Identity Methods
GDPR by Identity Methods
 
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
Novel Instruction Set Architecture Based Side Channels in popular SSL/TLS Imp...
 
Linguistic Considerations of Identity Resolution (2008)
Linguistic Considerations of Identity Resolution (2008)Linguistic Considerations of Identity Resolution (2008)
Linguistic Considerations of Identity Resolution (2008)
 
Camouflage
Camouflage Camouflage
Camouflage
 
2016 Digital Yearbook
2016 Digital Yearbook2016 Digital Yearbook
2016 Digital Yearbook
 
2017 Digital Yearbook
2017 Digital Yearbook2017 Digital Yearbook
2017 Digital Yearbook
 

Similar to Automated Methods for Identity Resolution across Online Social Networks

Classification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my spaceClassification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my space
es712
 
User Identities Across Social Networks: Quantifying Linkability and Nudging U...
User Identities Across Social Networks: Quantifying Linkability and Nudging U...User Identities Across Social Networks: Quantifying Linkability and Nudging U...
User Identities Across Social Networks: Quantifying Linkability and Nudging U...
IIIT Hyderabad
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networks
IIIT Hyderabad
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)
es712
 
An API of One's Own: Individual Identities as First-Class Citizens in the Ope...
An API of One's Own: Individual Identities as First-Class Citizens in the Ope...An API of One's Own: Individual Identities as First-Class Citizens in the Ope...
An API of One's Own: Individual Identities as First-Class Citizens in the Ope...
Nate Otto
 

Similar to Automated Methods for Identity Resolution across Online Social Networks (20)

Candidate Ranking and Evaluation System based on Digital Footprints
Candidate Ranking and Evaluation System based on Digital FootprintsCandidate Ranking and Evaluation System based on Digital Footprints
Candidate Ranking and Evaluation System based on Digital Footprints
 
Classification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my spaceClassification of commercial and personal profiles on my space
Classification of commercial and personal profiles on my space
 
Large scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationLarge scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluation
 
Fairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedInFairness, Transparency, and Privacy in AI @ LinkedIn
Fairness, Transparency, and Privacy in AI @ LinkedIn
 
User Identities Across Social Networks: Quantifying Linkability and Nudging U...
User Identities Across Social Networks: Quantifying Linkability and Nudging U...User Identities Across Social Networks: Quantifying Linkability and Nudging U...
User Identities Across Social Networks: Quantifying Linkability and Nudging U...
 
Responsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillarsResponsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillars
 
Week3- Face Identification with K-Nears Neighbour .pptx
Week3- Face Identification with K-Nears Neighbour .pptxWeek3- Face Identification with K-Nears Neighbour .pptx
Week3- Face Identification with K-Nears Neighbour .pptx
 
Graph Data Science DEMO for fraud analysis
Graph Data Science DEMO for fraud analysisGraph Data Science DEMO for fraud analysis
Graph Data Science DEMO for fraud analysis
 
Trusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open SourceTrusted, Transparent and Fair AI using Open Source
Trusted, Transparent and Fair AI using Open Source
 
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
 It's MY JOB: Identifying and Improving Content Quality for Online recruitmen... It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
It's MY JOB: Identifying and Improving Content Quality for Online recruitmen...
 
Aiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversionAiinpractice2017deepaklongversion
Aiinpractice2017deepaklongversion
 
Studying user footprints in different online social networks
Studying user footprints in different online social networksStudying user footprints in different online social networks
Studying user footprints in different online social networks
 
Discovering Semantic Equivalence of People behind Online Profiles (RED 2012 -...
Discovering Semantic Equivalence of People behind Online Profiles (RED 2012 -...Discovering Semantic Equivalence of People behind Online Profiles (RED 2012 -...
Discovering Semantic Equivalence of People behind Online Profiles (RED 2012 -...
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)
 
IRJET- Identification of Clone Attacks in Social Networking Sites
IRJET-  	  Identification of Clone Attacks in Social Networking SitesIRJET-  	  Identification of Clone Attacks in Social Networking Sites
IRJET- Identification of Clone Attacks in Social Networking Sites
 
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
 
An API of One’s Own
An API of One’s OwnAn API of One’s Own
An API of One’s Own
 
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
 
An API of One's Own: Individual Identities as First-Class Citizens in the Ope...
An API of One's Own: Individual Identities as First-Class Citizens in the Ope...An API of One's Own: Individual Identities as First-Class Citizens in the Ope...
An API of One's Own: Individual Identities as First-Class Citizens in the Ope...
 

More from Cybersecurity Education and Research Centre

Video Inpainting detection using inconsistencies in optical Flow
Video Inpainting detection using inconsistencies in optical FlowVideo Inpainting detection using inconsistencies in optical Flow
Video Inpainting detection using inconsistencies in optical Flow
Cybersecurity Education and Research Centre
 
Identification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A SurveyIdentification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A Survey
Cybersecurity Education and Research Centre
 
Clotho : Saving Programs from Malformed Strings and Incorrect
Clotho : Saving Programs from Malformed Strings and IncorrectClotho : Saving Programs from Malformed Strings and Incorrect
Clotho : Saving Programs from Malformed Strings and Incorrect
Cybersecurity Education and Research Centre
 
Clotho: Saving Programs from Malformed Strings and Incorrect String-handling
Clotho: Saving Programs from Malformed Strings and Incorrect String-handling�Clotho: Saving Programs from Malformed Strings and Incorrect String-handling�
Clotho: Saving Programs from Malformed Strings and Incorrect String-handling
Cybersecurity Education and Research Centre
 
Analyzing Social and Stylometric Features to Identify Spear phishing Emails
Analyzing Social and Stylometric Features to Identify Spear phishing EmailsAnalyzing Social and Stylometric Features to Identify Spear phishing Emails
Analyzing Social and Stylometric Features to Identify Spear phishing Emails
Cybersecurity Education and Research Centre
 
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing PageEmerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
Cybersecurity Education and Research Centre
 
Securing the Digital Enterprise
Securing the Digital EnterpriseSecuring the Digital Enterprise
Securing the Digital Enterprise
Cybersecurity Education and Research Centre
 
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on TwitterBroker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Cybersecurity Education and Research Centre
 
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
Cybersecurity Education and Research Centre
 

More from Cybersecurity Education and Research Centre (15)

Video Inpainting detection using inconsistencies in optical Flow
Video Inpainting detection using inconsistencies in optical FlowVideo Inpainting detection using inconsistencies in optical Flow
Video Inpainting detection using inconsistencies in optical Flow
 
TASVEER : Tomography of India’s Internet Infrastructure
TASVEER : Tomography of India’s Internet InfrastructureTASVEER : Tomography of India’s Internet Infrastructure
TASVEER : Tomography of India’s Internet Infrastructure
 
Data-Driven Assessment of Cyber Risk: Challenges in Assessing and Migrating C...
Data-Driven Assessment of Cyber Risk: Challenges in Assessing and Migrating C...Data-Driven Assessment of Cyber Risk: Challenges in Assessing and Migrating C...
Data-Driven Assessment of Cyber Risk: Challenges in Assessing and Migrating C...
 
A Strategy for Addressing Cyber Security Challenges
A Strategy for Addressing Cyber Security Challenges A Strategy for Addressing Cyber Security Challenges
A Strategy for Addressing Cyber Security Challenges
 
Identification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A SurveyIdentification and Analysis of Malicious Content on Facebook: A Survey
Identification and Analysis of Malicious Content on Facebook: A Survey
 
Clotho : Saving Programs from Malformed Strings and Incorrect
Clotho : Saving Programs from Malformed Strings and IncorrectClotho : Saving Programs from Malformed Strings and Incorrect
Clotho : Saving Programs from Malformed Strings and Incorrect
 
National Critical Information Infrastructure Protection Centre (NCIIPC): Role...
National Critical Information Infrastructure Protection Centre (NCIIPC): Role...National Critical Information Infrastructure Protection Centre (NCIIPC): Role...
National Critical Information Infrastructure Protection Centre (NCIIPC): Role...
 
Clotho: Saving Programs from Malformed Strings and Incorrect String-handling
Clotho: Saving Programs from Malformed Strings and Incorrect String-handling�Clotho: Saving Programs from Malformed Strings and Incorrect String-handling�
Clotho: Saving Programs from Malformed Strings and Incorrect String-handling
 
Analyzing Social and Stylometric Features to Identify Spear phishing Emails
Analyzing Social and Stylometric Features to Identify Spear phishing EmailsAnalyzing Social and Stylometric Features to Identify Spear phishing Emails
Analyzing Social and Stylometric Features to Identify Spear phishing Emails
 
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing PageEmerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
Emerging Phishing Trends and Effectiveness of the Anti-Phishing Landing Page
 
Securing the Digital Enterprise
Securing the Digital EnterpriseSecuring the Digital Enterprise
Securing the Digital Enterprise
 
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on TwitterBroker Bots: Analyzing automated activity during High Impact Events on Twitter
Broker Bots: Analyzing automated activity during High Impact Events on Twitter
 
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
Twitter and Polls: What Do 140 Characters Say About India General Elections 2014
 
Web Application Security 101
Web Application Security 101Web Application Security 101
Web Application Security 101
 
The future of interaction & its security challenges
The future of interaction & its security challengesThe future of interaction & its security challenges
The future of interaction & its security challenges
 

Recently uploaded

Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
中 央社
 

Recently uploaded (20)

Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
 
The Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDFThe Story of Village Palampur Class 9 Free Study Material PDF
The Story of Village Palampur Class 9 Free Study Material PDF
 
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
ĐỀ THAM KHẢO KÌ THI TUYỂN SINH VÀO LỚP 10 MÔN TIẾNG ANH FORM 50 CÂU TRẮC NGHI...
 
Supporting Newcomer Multilingual Learners
Supporting Newcomer  Multilingual LearnersSupporting Newcomer  Multilingual Learners
Supporting Newcomer Multilingual Learners
 
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)ESSENTIAL of (CS/IT/IS) class 07 (Networks)
ESSENTIAL of (CS/IT/IS) class 07 (Networks)
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading RoomSternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
Sternal Fractures & Dislocations - EMGuidewire Radiology Reading Room
 
male presentation...pdf.................
male presentation...pdf.................male presentation...pdf.................
male presentation...pdf.................
 
PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptx
 
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdfRich Dad Poor Dad ( PDFDrive.com )--.pdf
Rich Dad Poor Dad ( PDFDrive.com )--.pdf
 
e-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopale-Sealing at EADTU by Kamakshi Rajagopal
e-Sealing at EADTU by Kamakshi Rajagopal
 

Automated Methods for Identity Resolution across Online Social Networks

  • 1. Automated Methods for Identity Resolution across Online Social Networks Paridhi Jain April 25th, 2016 Prof. Ponnurangam Kumaraguru (Advisor) Prof. Alan Mislove (Northeastern University) Prof. Amitabha Bagchi (IIT-Delhi) Dr. Sachin Lodha (TRDDC)
  • 2. Online Social Network (OSN) “a pla&orm to build social rela2ons among people who share similar interests, ac2vi2es, backgrounds or real-life connec2ons.” [Boyd et al.] 3 209 acPve OSNs in 2015 cerc.iiitd.ac.in
  • 3. Coverage of Social Networks 4 •  Unique Service •  At least 200 million users register on OSNs •  A user is bounded to maintain mulPple accounts
  • 4. 5 Single User on Multiple OSNs! cerc.iiitd.ac.in Can we predict the link? Can we find and link disconnected iden00es of a single user? = Iden0ty Resolu0on
  • 5. Why do Identity Resolution? 6cerc.iiitd.ac.in
  • 8. Security Practitioners (Attribute Aggregation) “The Twi`er account has no real name a>ached to it. But Buzzfeed contributor found her Tumblr iden0ty and idenPfied the account owner as Shashank ***, a hedge fund analyst and campaign manager. False Sandy Update Source: h`p://ediPon.cnn.com/2012/10/31/tech/social-media/sandy-twi`er-hoax/ 9cerc.iiitd.ac.in
  • 11. Formulation followers. An individual is denoted by I and her identity on a social network SNA is denoted by IA. The task of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an identity IA of user I on social network SNA, find her identity IB on social network SNB using a search function S and a linking function L. IB = max 1jN (L(IA, IBj)) where IBj 2 S(IA)) Observing the two functions involved, the process of identity resolution in online social networks can be divided into two subprocesses – identity search and identity linking. Identity search lists a set of candidate identities on SNB, which are similar to the given known identity IA in accordance to the search function S and are suspected to belong to user I. Such a set of candidate identities is represented as S(IA) and its size is denoted by N. The search function S inputs IA’s attribute value, a defined similarity metric simS, and search space (SNB in this scenario) as arguments, and selects all identities (IB1 · · · IBj · · · IBN ) from the search space for whom similarity simS between the candidate’s attribute value and IA attribute values is greater than a threshold. The threshold 12 12 Search Func2on S: •  Input: an idenPty, a search space •  Output: candidate set Linking Func2on L: •  Input: an idenPty, a candidate set •  Output: Best matching candidate cerc.iiitd.ac.in Iden0ty Search Iden0ty Linking @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to th to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space ( selects all identities (IB1 · · · IBj · · · IBN ) from the search the candidate’s attribute value and IA attribute values is
  • 13. My Contributions –  Iden0ty Search: Novel methods for creaPng candidate set by exploiPng public and discriminaPve a`ributes; increase idenPty resoluPon accuracy by 13% –  Iden0ty Linking: Novel method for effecPve linking idenPPes by leveraging a`ribute history; reducPon in miss rate by 48% 14cerc.iiitd.ac.in
  • 14. Identity Search cerc.iiitd.ac.in 15 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Aim: To retrieve a candidate set containing the idenPty we search for.
  • 15. Formulation 16cerc.iiitd.ac.in 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IB) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13Search Func2on S: •  Can be computed with parPal informaPon •  Can be computed with different genre of informaPon (text, image) @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to th to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space ( selects all identities (IB1 · · · IBj · · · IBN ) from the search the candidate’s attribute value and IA attribute values is
  • 16. State of the art –  Only profile a`ributes (private and public) for IdenPty Search [Motoyama et al., Malhotra et. al., Liu et al.] –  LimitaPons of Profile Search - –  RestricPve search, owing to non-availability of common a`ributes across networks. [Gender on Facebook, but not on Twi`er] –  Search with Limited a`ributes → Large candidate set size → Intensive IdenPty Linking computaPons –  Users may choose different profile a`ributes → Miss out correct idenPty in the candidate set –  Li`le research on using content and network a`ributes to search for candidate idenPPes [consistent user behavior and not profile] –  Extensive use of both private and public a`ributes. Need user authorizaPon for idenPty search 17cerc.iiitd.ac.in
  • 17. Heuris0c Search on available a>ributes –  Addresses the gap of literature by using content and network idenPty search. –  Similarity based rules to find candidate idenPPes matching with given idenPty –  Aim to improve recall –  Real-Time search Unsupervised search on discrimina0ve a>ributes –  Real-Pme approaches are computaPonally and Pme expensive (Search in the complete social network) –  Pre-segment the social network –  Reduces Pme complexity from O(n2) to O(n) 18 Proposed Methods cerc.iiitd.ac.in
  • 18. Heuristic Identity Search 19cerc.iiitd.ac.in Profile Content Self-mention Network Syntactic and Image Search Linking If self-identified / returned by more than one search method No Yes Candidate Identities name, location, username mobile no, post, friends, followers Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2013. @I seek ‘L.me’: Iden2fying Users across Mul2ple Online Social Networks. In Proceedings of the 22nd InternaPonal Conference on World Wide Web, WWW ’13 Companion. ACM, New York, NY, USA, 1259- 1268. DOI=h`p://dx.doi.org/10.1145/2487788.2488160 [Honorable MenPon Award}
  • 19. 20 Content Search Algorithm 2 Heuristic Search Methods procedure Content Search IA known identity on SNA S {IA.source, IA.posts} if S[0] 2 {HootSuite, TwitterFeed, Facebook} then posts S[1] for each m in posts do remove stop-words and non-ascii characters from m limi to 75 characters query SNB API with m and retrieve candidates with similar posts Cxs candidates for each c in Cxs do if sim(c.post, m)  0 then delete c from Cxs add Cxs to Cx return Cxs cerc.iiitd.ac.in
  • 20. Evaluation 21 Ground Truth Dataset: 543 users from FriendFeed and SocialGraph Selec0on Strategy: Random selecPon Why: To avoid any bias in evaluaPon. The methods are produced to be generalizable. Accuracy = correctly identified Total users Precision = Prelevant ∩ Pretrieved Pretrieved Recall = Prelevant ∩ Pretrieved Prelevant Figure 3.1: Architecture of the identity resolution framework using proposed heuristic search methods and linking methods from literature. Table 3.2: Evaluation of the identity resolution framework with contribution of each search algorithm in the resolution accuracy. Search methods based on profile (url), content, self-mention and network attributes improve resolution accuracy by 13.1%. Search Algorithm Ucorrect Accuracy Profile Search (P) 205 37.7% Content Search (C) 3 0.5% Self-mention Search (SM) 31 5.7% Network Search (N) 1 0.2% Identity Search (P+C+SM+N) 220 40.5% P (without URL) 149 27.4% P (with URL) + (C+SM) + N 149+71 27.4% +13.1% with the traditional profile search used in the literature, assuming access to only public profile attributes. Traditional profile search method finds candidate identities by search parameters – Improvised profile, content and network search methods successfully improved the accuracy and the recall by 13.1%. cerc.iiitd.ac.in
  • 22. Find discriminative features 23 Class Majority Index (CMI) Match No-Match RaPo: Encroachment Index (EI) DiscriminaPve if: •  Low Encroachment Index •  Low Error Index Username Jaro Distance Username LCS Distance Username Levenshtein Distance Username Character Bi-gram Jaccard Index Username Character Bi-gram Cosine similarity Name Jaro Distance Name LCS Distance Name Character Bi-gram Jaccard Index Name Character Bi-gram Cosine similarity Sample Features cerc.iiitd.ac.in Match: {paridhi, paridhij} No-match: {paridhij,parineeta.c} Error Index (type-I/II) error
  • 23. 24 Modified Canopy Clustering decreases to O(n). The search algorithm is modified and a concept of ‘sibling’ clusters is intro- duced. As non–overlapping clustering tend to miss out some probable candidates, extending this constrained set with siblings results in higher accuracy. The algorithm is given as Algorithm 6. Algorithm 6 Modification to the Canopies procedure Mod-Canopies U set of user-profiles on the network T threshold d(x, y) distance measure for each user-profile x in U : create canopy Cx such that for each user-profile y in U, insert y into Cx if d(x, y) < T; Remove all user profiles y added in the previous step from U. loop while U is not empty; The algorithm is similar to canopy clustering and its time complexity is still O(n2) in the worst 45 ModificaPons: •  Earlier overlapping canopies •  Overlapping canopies may not reflect similarity with given user idenPty We create: •  Non-overlapping canopies Discrimina-ve'' Features' Iden--es' IDENTITY'SEARCH' IDENTITY'LINKING' @darkma'er_* *John*Marget* St.*Anthony*School* @holy.james** James*Marget* St.*Anthony*School* * @dark.ma'er* John*M* New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist* (John,*John)** .* .* *'Implies'comparison'of'complete'iden--es' @dark.ma(er* John*M* New*Delhi* Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 cerc.iiitd.ac.in paridhij pari.nidhi paridhijain ridhi_jain paritosh_jain Parineeta.jain parineeta_joshi r_jain Raghav_jain riju_ Discrimina-ve'' Features' Iden--es' IDENTITY'SEARCH' IDENTITY'LIN @darkma'er_* *John*Marget* St.*Anthony*School* @holy.james** James*Marget* St.*Anthony*School* * @dark.ma'er* John*M* New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist* (John,*James (John,*John .* .* *'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social networ function S, find a set of identities IBj on social network SNB such that sim defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as can 13 paridhij pari.nidhi paridhi ridhi_jain paritosh_jain Parineeta.jain parineeta_joshi r_jain Raghav_jain riju_ Algorithmic Pme complexity reduces to O(n)
  • 24. 25 Unsupervised search for a candidate set their distance. We experimente with different values of threshold T to determine the most optimal one. With a very small value, we cannot be able to expand our candidate set since we will not find any sibling clusters whereas with a extremely value, the candidate set can be too large making the algorithm computationally expensive. The empirical threshold T for our dataset is set to 12. Algorithm 7 Unsupervised search method 1: procedure Modified-Search 2: U User profile we are looking for 3: C set of non overlapping clusters 4: T threshold 5: d(Cx, Cy) distance measure 6: for each cluster Cx in C: 7: compute the distance d(U, Cx) 8: select cluster Cm such that d(U, Cm) is minimum of all distances computed above, this is the most suitable cluster; 9: L List of suitable clusters, initially empty 10: for each cluster Cx in C: 11: if d(Cm,Cx) < T then if d(U, Cx) < T then append Cx to L 12: L holds our list of candidate clusters For search, look for: •  Sibling Canopies •  Similar to most suitable canopy AND similar to the searched user profile cerc.iiitd.ac.in
  • 25. Evaluation 26 M = Match class; NM = No-Match Class # of Users (M:NM::1:1) Threshold Precision (Canopy) Recall (Canopy) Precision (MOD- Canopy) Recall (MOD- Canopy) 20000 0.95 0.15 0.90 0.25 0.79 20000 0.97 0.20 0.70 0.30 0.55 20000 0.98 0.24 0.62 0.33 0.69 Increasing the threshold, increases precision, degrades recall Facebook-Twi`er
  • 27. Identity Linking cerc.iiitd.ac.in 28 Extract available & discriminaPve features Candidate IdenPPes IDENTITY SEARCH IDENTITY LINKING Pairwise Comparisons Aim: To retrieve best among the candidate set, i.e. the correct idenPty of the user
  • 28. Formulation 29 little has contributed to address these challenges and drawbacks of profile search. 2.2.2 Identity Linking Problem Definition 3: Given an identity IA of user I on social network SNA, a set of candidate identities Q = S(IA) = {IB1, . . . , IBj, . . . , IBN } on social network SNB and a linking function L, locate an identity pair (IA, IBj) such that L(IA , IBj) = max{L(IA, IB1),. . . , L(IA, IBN )}. IBj with highest link-score is inferred as IB. IB = max 1jN (L(IA, IBj)) where IBj 2 Q) An identity linking method estimates the correspondence between identity IA and each candidate identity IBj by calculating a link-score L(IA, IBj) between their respective attributes and then rank the candidate set on the basis of link-score. Candidate identity IBj with highest link-score is con- cluded, as IB. The function L can be computed for all variety of data – text, date, image and location. The function can either be a supervised classifier decision boundary or a heuristic rule, in both scenarios, the function can be computed with partial and complete information. cerc.iiitd.ac.in Linking Func2on L: •  Can be a rule or a supervised classifier •  Can be computed with parPal informaPon •  Can be computed with different genre of informaPon (text, image) New*Delhi* .* .* .* @janemargetkitchen* Mrs.*Marget* Cookie*Specialist**'Implies'comparison'of'complete'iden--es' Figure 2.3: Architecture of an identity resolution process. 2.2.1 Identity Search Problem Definition 2: For a user I, given her identity IA on social network SNA and a search function S, find a set of identities IBj on social network SNB such that simS(IA, IBj) ✓, on defined similarity metric simS and empirically calculated threshold ✓. {IB1, . . . , IBj, . . . , IBN } = S(IA) s.t. simS(IA, IBj) ✓ Each identity IBj in the set is termed as candidate identity and the set as candidate set. The size 13 • Content attributes describe the content she creates o post. • Network attributes refer to the connections of the followers. An individual is denoted by I and her identity on a social of identity resolution can be formally defined as follows. Problem Definition 1: Identity Resolution: Given an SNA, find her identity IB on social network SNB using a L. IB = max 1jN (L(IA, IBj)) where Observing the two functions involved, the process of iden can be divided into two subprocesses – identity search an set of candidate identities on SNB, which are similar to t to the search function S and are suspected to belong to u is represented as S(IA) and its size is denoted by N. Th value, a defined similarity metric simS, and search space selects all identities (IB1 · · · IBj · · · IBN ) from the search
  • 29. State of the art –  Methods link idenPPes using –  Profile a`ributes [Zafarani et al., Perito et al., Malhotra et al., Liu et al. ]! –  Content a`ributes [Iofciu et al., Liu et al., Goga et al.]! –  Network a`ributes [Bartunov et al., Narayanan et al., Labitzke et al.]! –  Crowd sourced mechanisms [Shehab et al.]! –  Search Engines [Bilge et al.]! –  Most literature methods assume, compare and match access to present (current) a`ributes of the idenPPes. –  But, current versions of the idenPPes may fail to match due to –  User choice –  A`ribute EvoluPon 30cerc.iiitd.ac.in
  • 31. Proposed Identity Linking –  If current versions do not match and if the user behavior is consistent across OSNs, any of the past versions “may” match. 32 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Username Sets PredicPon 3Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2015. Other Times, Other Values: Leveraging ADribute History to Link User Profiles across Online Social Networks. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15. ACM, New York, NY, USA, 247-255. DOI=h`p://dx.doi.org/10.1145/2700171.2791040.
  • 32. Username Set Collection 33 Tumblr username on the URL Twi`er username 33 cerc.iiitd.ac.in •  Past usernames: •  Automated Tracking System that queries a user’s ID via API to record her changed profile a`ributes •  her username on the OSN •  her URL a`ribute signifying change to her other OSN username •  Old Twi`er URL – abcd_efgh.tumblr.com •  New Twi`er URL – xyz.tumblr.com •  Ground Truth: •  Self-idenPficaPon behavior [Cross-referencing one’s OSN accounts]
  • 33. Example –  User ID: 595**942* –  Past usernames on Twi>er: –  ["bigeasye_", "reezy11_", "epiceric_", "soulanola", "swampson_", "hebetheeeric", "swampkidd_"] –  Past Usernames on Tumblr: –  ["bigeasye_", "epiceric17", "swampson", "hebetheeeric"]} 34cerc.iiitd.ac.in
  • 34. Methodology 35 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Collected Username Sets PredicPon cerc.iiitd.ac.in
  • 35. Features Username Set Similarities Syntactic Static Creation Similar Length Similar Choice of Characters Similar Arrangement of Characters Evolutionary Creation Stylistic Occasional Reuse Common username? Best similarity score Second Best similarity score Frequent Reuse Common username set Temporal ordering? Temporal sync? Evolution of Length Evolution of Choice of Characters Evolution of Arrangement of Characters Temporal Case LeetSpeak Emphasizer Prefix / Suffix Slang words Bad words Function words Phonetic Replacement Grammar 36cerc.iiitd.ac.in
  • 36. Evaluation 37 Supervised Classification Feature: 1 Feature: n Similarity: 1 Similarity: n Patterns of username creation behavior across OSNs Patterns of username reuse behavior across OSNs . . . . . . Labeled datasets US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} SNA SNB Feature: 1 Feature: m Similarity: 1 Similarity: m . . . . . . Collected Username Sets PredicPon cerc.iiitd.ac.in
  • 37. Datasets –  Linking profiles –  Twi`er – Tumblr –  Twi`er – Facebook –  Twi`er – Instagram –  Past usernames available for both profiles: –  18,959 posiPve pairs, 18,959 negaPve pairs –  Past usernames available only on Twi`er but current username available on other profile: –  109,292 posiPve pairs, 109,292 negaPve pairs 38cerc.iiitd.ac.in Network-Pair Twi>er-Tumblr Twi>er-Facebook Twi>er-Instagram Total Users History on both 14,301 1,166 3,492 18,959 History on source only 58,285 31,076 19,931 109,292
  • 38. 1.  Independent Supervised Framework 2.  Fusion Supervised Framework Supervised Classification 39 3. Cascaded Supervised Framework Classifier I Current Username Features [Exact Match, Substring Match] Classifier II Username Set Features [Naive Bayes, SVM, DecisionTree, Random Forest] Negative? Positive? Same User Different Users Negative? US: {‘eenjolrass',‘isabelnevills', ‘giuliettacapuleti',‘tobsregbo'} UC: {‘enjoolras',‘isabelnevilles'} uc: {‘isabelnevilles'} {‘tobsregbo' ‘isabelnevilles} Us - UC (or US - uc ) cerc.iiitd.ac.in
  • 39. Prediction 40 Framework Config. Accuracy FNR FPR Exact Match (b1) 55.38 89.34 0.00 Substring Match (b2) 60.99 78.46 0.00 Independent [b1→Naive Bayes] 72.10 53.81 1.91 Fusion [b1→Naive Bayes] 72.93 51.89 0.19 Cascaded [b1→Naive Bayes] 73.12 48.87 3.07 Cascaded [b1 → SVM [Linear]] 76.97 40.87 3.71 Cascaded [b2 → Naive Bayes] 73.27 48.52 3.14 Cascaded [b2 → SVM [Linear]] 76.93 40.87 3.78 - 48.47% cerc.iiitd.ac.in
  • 40. So far… cerc.iiitd.ac.in 41 @darkmaDer_ @holy.james @magascus, @hello_kiDy @darkmaDer_ @hello_kiDy, @magascu_, @holy.james Exis0ng iden0ty linking Same user Different users Proposed iden0ty linking Reduc0on of FNR from 89% to 40%
  • 42. A`ribute EvoluPon –  Implies: Out of sync idenPPes in Pme –  IdenPfy possible reasons and characterisPcs –  ImplicaPons? A`ribute Sharing –  Implies: sharing sensiPve informaPon –  IdenPfy possible reasons and characterisPcs? –  Risks? Privacy ImplicaPons? Do users care? 43 Understanding … cerc.iiitd.ac.in
  • 43. Attribute Evolution –  Aim: To understand how, why, and what fracPon of users have “out-of-sync” idenPPes across OSNs –  Tracked about 8.7 million random Twi`er users and analyzed 10K users in depth who evolved over Pme [selecPve sampled] –  Studied a unique idenPfiable public a`ribute - username –  Observa0ons: –  20% of users consPtute 80% of username changes observed on Twi`er –  New usernames are disPnctly different from the old usernames –  A secPon of these users change for benign reasons like space gain, change of idenPfiability while others are suspected with malicious intenPons –  Implica0on: Due to a`ribute evoluPon, quality dataset of past idenPPes of a user is available. This instead of a challenge, becomes an opportunity for our proposed idenPty linking. cerc.iiitd.ac.in 44
  • 44. Attribute Sharing –  Aim: To understand the reasons and risks of sharing sensiPve idenPfiable informaPon about oneself –  Collected 2,492 Indian mobile numbers from OSNs like Twi`er and Facebook public posts, bio and name –  Observa0ons: –  Mobiles numbers are pushed across mulPple OSNs, intenPonally and unintenPonally –  Publicly shared sensiPve informaPon like mobile number can expose idenPfiable details (ID, name, family) if collated with external data sources –  Implica0ons: –  Awareness of collaPon risks associated with sensiPve sharing is necessary. Technological soluPons should a`end to it. –  Sharing sensiPve informaPon can implicitly resolve idenPPes cerc.iiitd.ac.in 45
  • 45. Contributions Summary –  Methods for idenPty search that exploit public a`ributes and user behavior across OSNs –  We address the challenge of heterogeneous OSNs by considering only public and universally available a`ributes –  Method for idenPty linking that leverage user evoluPon over Pme –  We exploit the challenge of a`ribute evoluPon to our advantage. Compare both past and current versions of the idenPPes –  Observed and characterized user behavior that aids our proposed methods –  We add to exisPng knowledge for development of our methods as well as future idenPty resoluPon methods 46cerc.iiitd.ac.in
  • 46. Implications to? –  Enterprises can carry out: –  Automated audience de-duplicaPon –  Automated psychographic segmentaPon based on aggregated user profiles and inferred a`ributes. –  Security pracPPoners can de-anonymize malicious users –  Users –  Can be`er understand their idenPty leaks and patch them to avoid idenPty resoluPon –  E.g. “should not share same content”, “should not create similar histories of username” –  Risks of sharing sensiPve informaPon needs to the communicated by new Over-the-top (OTT) applicaPons 47cerc.iiitd.ac.in
  • 47. Limitations and Future Work –  Dependency on API –  LimiPng to only usernames for idenPty linking –  EvaluaPon on self-idenPfied users –  Future work: –  Extend to include past versions of idenPPes for be`er idenPty search methods –  Extend to exploit evoluPon of mulPple a`ributes in a Pme synchronized manner for idenPty linking –  Develop an OTT messenger that highlights possible leaks of sensiPve informaPon, privacy and idenPty to a user 48cerc.iiitd.ac.in
  • 48. Peer-reviewed Publications (1) –  Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2013. @I seek ‘L.me’: Iden2fying Users across Mul2ple Online Social Networks. In Proceedings of the 22nd InternaPonal Conference on World Wide Web, WWW ’13 Companion. ACM, New York, NY, USA, 1259- 1268. DOI=h`p://dx.doi.org/10.1145/2487788.2488160 –  NiyaP Chhaya, Dhwanit Agarwal, Nikaash Puri, Paridhi Jain, Deepak Pai, and Ponnurangam Kumaraguru. 2015. EnTwine: Feature Analysis and Candidate Selec2on for Social User Iden2ty Aggrega2on. In Proceedings of the 2015 IEEE/ACM InternaPonal Conference on Advances in Social Networks Analysis and Mining, ASONAM ’15. ACM, New York, NY, USA, 1575-1576, DOI=h`p://dx.doi.org/10.1145/2808797.2809340. –  Paridhi Jain, Ponnurangam Kumaraguru, and Anupam Joshi. 2015. Other Times, Other Values: Leveraging ADribute History to Link User Profiles across Online Social Networks. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15. ACM, New York, NY, USA, 247-255. DOI=h`p://dx.doi.org/10.1145/2700171.2791040. 49cerc.iiitd.ac.in
  • 49. Peer-reviewed Publications (2) –  Paridhi Jain and Ponnurangam Kumaraguru. 2016. On the Dynamics of Username Changing Behavior on TwiDer. In Proceedings of the 3rd IKDD Conference on Data Science, 2016, CODS ’16. ACM, New York, NY, USA, ArPcle 6 , 6 pages. DOI=h`p:// dx.doi.org/10.1145/ 2888451.2888452. –  Prachi Jain, Paridhi Jain, and Ponnurangam Kumaraguru. 2013. Call me Maybe: Un- derstanding Nature and Risks of sharing Mobile Numbers on Online Social Networks. In Proceedings of the first ACM Conference on Online social networks, COSN ’13. ACM, New York, NY, USA, 101-106, DOI=h`p://dx.doi.org/10.1145/2512938.2512959. –  Paridhi Jain, Tiago Rodrigues, Gabriel Magno, Ponnurangam Kumaraguru, and Virgilio Almeida. Cross-Pollina2on of Informa2on in Online Social Media: A Case Study on Popular Social Networks. In Proceedings of the 2011 IEEE 3rd InternaPonal Conference on Social CompuPng, SocialCom ʹ11, pages 477–482, Oct 2011. 50cerc.iiitd.ac.in
  • 50. Acknowledgments 51 •  My advisor ‘PK’ •  Prof. Anupam Joshi and Prof. Rahul Purandare •  Members of Precog@IIITD and CERC@IIITD •  Supported by TCS Research Fellowship (2010 – 2016) •  Friends, Colleagues and Family Niharika Siddhartha AdiP Prateek Anupama SrishP cerc.iiitd.ac.in