RESLVE:	
  Leveraging	
  User	
  Interest	
  
to	
  Improve	
  En6ty	
  Disambigua6on	
  
on	
  Short	
  Text	
  
Elizabeth	
  L.	
  Murnane	
  elm236@cornell.edu	
  
Bernhard	
  Haslhofer	
  bernhard.haslhofer@univie.ac.at	
  
Carl	
  Lagoze	
  clagoze@umich.edu	
  
A	
  Personalized	
  Approach	
  to	
  Entity	
  Resolution	
  
Background	
  
•  Task	
  Defini6ons	
  
•  Challenges	
  &	
  Examples	
  
•  ADempted	
  Solu6ons	
  
Approach	
  
•  Mo6va6ons	
  
•  Modeling	
  a	
  Knowledge	
  Context	
  
•  Implementa6on:	
  The	
  RESLVE	
  System	
  
Evalua2on	
  
•  Experiments	
  
•  Results	
  
•  Future	
  Work	
  
A	
  Personalized	
  Approach	
  to	
  Entity	
  Resolution	
  
Background	
  
•  Task	
  Defini6ons	
  
•  Challenges	
  &	
  Examples	
  
•  ADempted	
  Solu6ons	
  
Approach	
  
•  Mo6va6ons	
  
•  Modeling	
  a	
  Knowledge	
  Context	
  
•  Implementa6on:	
  The	
  RESLVE	
  System	
  
Evalua2on	
  
•  Experiments	
  
•  Results	
  
•  Future	
  Work	
  
Social	
  Web	
  
10	
  million	
  	
  
pages	
  per	
  day	
  
Social	
  Web	
  
800	
  million	
  	
  
visitors	
  per	
  month	
  
Social	
  Web	
  
7	
  billion	
  images	
  
(twice	
  4	
  years	
  ago)	
  
Task	
  Definition
Task	
  Definition
Named	
  En2ty	
  Recogni2on	
  (NER)	
  
•  Systema6cally	
  iden6fying	
  men6ons	
  of	
  en##es	
  
(e.g.,	
  people,	
  places,	
  concepts,	
  ideas)	
  
Task	
  Definition
Named	
  En2ty	
  Recogni2on	
  (NER)	
  
•  Systema6cally	
  iden6fying	
  men6ons	
  of	
  en##es	
  
(e.g.,	
  people,	
  places,	
  concepts,	
  ideas)	
  
Named	
  En2ty	
  Disambigua2on	
  (NED)	
  
Resolving	
  the	
  intended	
  meaning	
  of	
  ambiguous	
  en66es	
  
from	
  mul6ple	
  candidate	
  meanings	
  
Ambiguous	
  Entities	
  
aaahh	
  one	
  more	
  day	
  
un,l	
  finn!!!	
  #cantwait	
  
	
  
	
  
	
  
office	
  holiday	
  party	
   Beetle	
  
Ambiguous	
  Entities	
  
aaahh	
  one	
  more	
  day	
  
un,l	
  finn!!!	
  #cantwait	
  
	
  
	
  
	
  
office	
  holiday	
  party	
   Beetle	
  
Ambiguous	
  Entities	
  
aaahh	
  one	
  more	
  day	
  
un,l	
  finn!!!	
  #cantwait	
  
	
  
	
  
	
  
office	
  holiday	
  party	
   Beetle	
  
Ambiguous	
  Entities	
  
aaahh	
  one	
  more	
  day	
  
un,l	
  finn!!!	
  #cantwait	
  
	
  
	
  
	
  
office	
  holiday	
  party	
   Beetle	
  
Footage:	
  
office	
  holiday	
  party	
  
office	
  holiday	
  party	
  
Footage:	
  
• Workplace?	
  
office	
  holiday	
  party	
  
Footage:	
  
• Workplace?	
  
• TV	
  Show?	
  
office	
  holiday	
  party	
  
Episode	
  4	
  
Footage:	
  
• Workplace?	
  
• TV	
  Show?	
  
office	
  holiday	
  party	
  
Episode	
  4	
  
Footage:	
  
• Workplace?	
  
• TV	
  Show?	
  
• US	
  Version?	
  
• UK	
  Version?	
  
Episode	
  4	
  
office	
  holiday	
  party	
  
office,	
  december	
  3	
  
Footage:	
  
• Workplace?	
  
• TV	
  Show?	
  
• US	
  Version?	
  
• UK	
  Version?	
  
Challenges	
  &	
  Focus	
  
Challenges	
  &	
  Focus	
  
•  Short	
  Length	
  
Challenges	
  &	
  Focus	
  
•  Short	
  Length	
  
•  Sparse	
  Lexical	
  Context	
  
Challenges	
  &	
  Focus	
  
•  Short	
  Length	
  
•  Sparse	
  Lexical	
  Context	
  
•  Noisy	
  
Challenges	
  &	
  Focus	
  
•  Short	
  Length	
  
•  Sparse	
  Lexical	
  Context	
  
•  Noisy	
  
•  Highly	
  personal	
  in	
  nature	
  
Challenges	
  &	
  Focus	
  
•  Short	
  Length	
  
•  Sparse	
  Lexical	
  Context	
  
•  Noisy	
  
•  Highly	
  personal	
  in	
  nature	
  
Limitations	
  of	
  Extant	
  Research	
  
Tweets	
  severely	
  degrade	
  tradi6onal	
  techniques	
  
	
  
Limitations	
  of	
  Extant	
  Research	
  
Tweets	
  severely	
  degrade	
  tradi6onal	
  techniques	
  
•  Stanford	
  NER:	
  F1	
  drops	
  90%	
  à	
  46%	
  
•  DBPedia	
  Spotlight	
  &	
  Wikipedia	
  Miner:	
  P@1	
  <	
  40%	
  
Limitations	
  of	
  Extant	
  Research	
  
Tweets	
  severely	
  degrade	
  tradi6onal	
  techniques	
  
•  Stanford	
  NER:	
  F1	
  drops	
  90%	
  à	
  46%	
  
•  DBPedia	
  Spotlight	
  &	
  Wikipedia	
  Miner:	
  P@1	
  <	
  40%	
  
	
  
Recent	
  strategies	
  
Limitations	
  of	
  Extant	
  Research	
  
Tweets	
  severely	
  degrade	
  tradi6onal	
  techniques	
  
•  Stanford	
  NER:	
  F1	
  drops	
  90%	
  à	
  46%	
  
•  DBPedia	
  Spotlight	
  &	
  Wikipedia	
  Miner:	
  P@1	
  <	
  40%	
  
	
  
Recent	
  strategies	
  
•  Crowd-­‐sourcing	
  
•  Limita6on:	
  Dependent	
  on	
  reliable	
  human	
  workers	
  
Tweets	
  severely	
  degrade	
  tradi6onal	
  techniques	
  
•  Stanford	
  NER:	
  F1	
  drops	
  90%	
  à	
  46%	
  
•  DBPedia	
  Spotlight	
  &	
  Wikipedia	
  Miner:	
  P@1	
  <	
  40%	
  
	
  
Recent	
  strategies	
  
•  Crowd-­‐sourcing	
  
•  Limita6on:	
  Dependent	
  on	
  reliable	
  human	
  workers	
  
•  Automated	
  aDempts	
  
•  Limita6on:	
  Focus	
  on	
  NER	
  not	
  NED	
  
•  Limita6on:	
  Generalizability	
  beyond	
  TwiDer?	
  
	
  
Limitations	
  of	
  Extant	
  Research	
  
Challenges	
  &	
  Focus	
  
•  Short	
  Length	
  
•  Sparse	
  Lexical	
  Context	
  
•  Noisy	
  
•  Highly	
  personal	
  in	
  nature	
  
• User’s	
  past	
  content	
  on	
  
same	
  plaeorm	
  not	
  feasible	
  
background	
  corpus	
  
Challenges	
  &	
  Focus	
  
•  Short	
  Length	
  
•  Sparse	
  Lexical	
  Context	
  
•  Noisy	
  
•  Highly	
  personal	
  in	
  nature	
  
Task	
  Definition
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Our	
  focus:	
  disambigua2ng	
  any	
  en2ty	
  detected	
  
in	
  users’	
  text-­‐based	
  uNerances	
  on	
  social	
  Web	
  
Named	
  En2ty	
  Recogni2on	
  (NER)	
  
•  Systema6cally	
  iden6fying	
  men6ons	
  of	
  en##es	
  
(e.g.,	
  people,	
  places,	
  concepts,	
  ideas)	
  
Named	
  En2ty	
  Disambigua2on	
  (NED)	
  
Resolving	
  the	
  intended	
  meaning	
  of	
  ambiguous	
  en66es	
  
from	
  mul6ple	
  candidate	
  meanings	
  
Exploring	
  a	
  Personalized	
  Solution	
  
•  Individual-­‐centric	
  approach	
  to	
  NED	
  
Exploring	
  a	
  Personalized	
  Solution	
  
•  Individual-­‐centric	
  approach	
  to	
  NED	
  
	
  
•  Incorporates	
  external,	
  user-­‐specific	
  seman6c	
  data	
   Personal	
  
Context	
  
Exploring	
  a	
  Personalized	
  Solution	
  
•  Individual-­‐centric	
  approach	
  to	
  NED	
  
	
  
•  Incorporates	
  external,	
  user-­‐specific	
  seman6c	
  data	
  
•  Model	
  personal	
  interests	
  with	
  respect	
  to	
  this	
  informa6on	
  
Personal	
  
Context	
  
Exploring	
  a	
  Personalized	
  Solution	
  
•  Individual-­‐centric	
  approach	
  to	
  NED	
  
	
  
•  Incorporates	
  external,	
  user-­‐specific	
  seman6c	
  data	
  
•  Model	
  personal	
  interests	
  with	
  respect	
  to	
  this	
  informa6on	
  
•  Determine	
  user’s	
  likely	
  intended	
  meaning	
  of	
  ambiguous	
  en6ty	
  
based	
  on	
  similarity	
  between	
  poten6al	
  meanings	
  and	
  interests	
  
Personal	
  
Context	
  
Exploring	
  a	
  Personalized	
  Solution	
  
•  Individual-­‐centric	
  approach	
  to	
  NED	
  
	
  
•  Incorporates	
  external,	
  user-­‐specific	
  seman6c	
  data	
  
•  Model	
  personal	
  interests	
  with	
  respect	
  to	
  this	
  informa6on	
  
•  Determine	
  user’s	
  likely	
  intended	
  meaning	
  of	
  ambiguous	
  en6ty	
  
based	
  on	
  similarity	
  between	
  poten6al	
  meanings	
  and	
  interests	
  
RESLVE	
  
Resolving	
  En6ty	
  Sense	
  by	
  LeVeraging	
  Edits	
  
	
  
Personal	
  
Context	
  
Background	
  
•  Task	
  Defini6ons	
  
•  Challenges	
  &	
  Examples	
  
•  ADempted	
  Solu6ons	
  
Approach	
  
•  Mo6va6ons	
  
•  Modeling	
  a	
  Knowledge	
  Context	
  
•  Implementa6on:	
  The	
  RESLVE	
  System	
  
Evalua2on	
  
•  Experiments	
  
•  Results	
  
•  Future	
  Work	
  
Agenda	
  
Underlying	
  Assumptions	
  
Underlying	
  Assumptions	
  
•  User	
  has	
  core	
  interests	
  
•  User	
  more	
  likely	
  to	
  men6on	
  an	
  en6ty	
  about	
  a	
  topic	
  relevant	
  to	
  personal	
  
interests	
  than	
  men6on	
  a	
  topic	
  of	
  non-­‐interest	
  
	
  
User	
  expresses	
  these	
  interests	
  consistently	
  in	
  content	
  she	
  posts	
  
online	
  in	
  mul6ple	
  communi6es	
  
Can	
  use	
  a	
  seman6c	
  knowledge	
  base	
  to	
  formally	
  represent	
  these	
  
topics	
  of	
  interest	
  
	
  	
  
	
  	
  
	
  	
  
Underlying	
  Assumptions	
  
•  User	
  has	
  core	
  interests	
  
•  User	
  more	
  likely	
  to	
  men6on	
  an	
  en6ty	
  about	
  a	
  topic	
  relevant	
  to	
  personal	
  
interests	
  than	
  men6on	
  a	
  topic	
  of	
  non-­‐interest	
  
	
  
•  User	
  expresses	
  these	
  interests	
  consistently	
  in	
  content	
  she	
  posts	
  
online	
  in	
  mul6ple	
  communi6es	
  
Can	
  use	
  a	
  seman6c	
  knowledge	
  base	
  to	
  formally	
  represent	
  these	
  
topics	
  of	
  interest	
  
	
  	
  
	
  	
  
	
  	
  
Underlying	
  Assumptions	
  
•  User	
  has	
  core	
  interests	
  
•  User	
  more	
  likely	
  to	
  men6on	
  an	
  en6ty	
  about	
  a	
  topic	
  relevant	
  to	
  personal	
  
interests	
  than	
  men6on	
  a	
  topic	
  of	
  non-­‐interest	
  
	
  
•  User	
  expresses	
  these	
  interests	
  consistently	
  in	
  content	
  she	
  posts	
  
online	
  in	
  mul6ple	
  communi6es	
  
•  Can	
  use	
  a	
  seman6c	
  knowledge	
  base	
  to	
  formally	
  represent	
  these	
  
topics	
  of	
  interest	
  
	
  	
  
	
  	
  
	
  	
  
Underlying	
  Assumptions	
  
•  User	
  has	
  core	
  interests	
  
•  User	
  more	
  likely	
  to	
  men6on	
  an	
  en6ty	
  about	
  a	
  topic	
  relevant	
  to	
  personal	
  
interests	
  than	
  men6on	
  a	
  topic	
  of	
  non-­‐interest	
  
	
  
•  User	
  expresses	
  these	
  interests	
  consistently	
  in	
  content	
  she	
  posts	
  
online	
  in	
  mul6ple	
  communi6es	
  
•  Can	
  use	
  a	
  seman6c	
  knowledge	
  base	
  to	
  formally	
  represent	
  these	
  
topics	
  of	
  interest	
  
Ø  Bridge	
  user	
  iden6ty	
  between	
  social	
  Web	
  and	
  knowledge	
  base,	
  K	
  
Ø  Model	
  interests	
  using	
  K’s	
  organiza6onal	
  scheme	
  
Ø  Rank	
  en6ty	
  senses	
  according	
  to	
  relevance	
  to	
  interests	
  
Qualitative	
  Analysis:	
  Stable	
  Interests	
  
Qualitative	
  Analysis:	
  Stable	
  Interests	
  
User’s	
  topics	
  of	
  contribu6on	
  similar	
  across	
  Web:	
  
	
  
	
  	
  
	
  	
  
On	
  average,	
  52.4%	
  of	
  en66es	
  a	
  user	
  men6ons	
  in	
  social	
  Web	
  (e.g.,	
  
“Java”)	
  have	
  at	
  least	
  1	
  candidate	
  sense	
  in	
  same	
  parent	
  category	
  of	
  
Wikipedia	
  ar6cle	
  same	
  user	
  edited	
  (e.g.,	
  “Programming	
  language”)	
  
If	
  extend	
  to	
  just	
  4	
  parents	
  up	
  category	
  hierarchy,	
  get	
  all	
  100%	
  
	
  
Qualitative	
  Analysis:	
  Stable	
  Interests	
  
User’s	
  topics	
  of	
  contribu6on	
  similar	
  across	
  Web:	
  
	
  
Same	
  Topics	
  
	
  	
  
On	
  average,	
  52.4%	
  of	
  en66es	
  a	
  user	
  men6ons	
  in	
  social	
  Web	
  (e.g.,	
  
“Java”)	
  have	
  at	
  least	
  1	
  candidate	
  sense	
  in	
  same	
  parent	
  category	
  of	
  
Wikipedia	
  ar6cle	
  same	
  user	
  edited	
  (e.g.,	
  “Programming	
  language”)	
  
If	
  extend	
  to	
  just	
  4	
  parents	
  up	
  category	
  hierarchy,	
  get	
  all	
  100%	
  
	
  
	
  
	
  
	
  
Ambiguous	
  YouTube	
  post:	
  	
  
office,	
  december	
  3	
  
	
  
Same	
  user’s	
  recent	
  Wikipedia	
  edit:	
  	
  
<item	
  userid="xxxx"	
  user="xxxx”	
  
pageid="31841130”	
  ,tle=	
  	
  
"The	
  Office	
  (U.S.	
  season	
  8)"/>	
  
	
  
Qualitative	
  Analysis:	
  Stable	
  Interests	
  
User’s	
  topics	
  of	
  contribu6on	
  similar	
  across	
  Web:	
  
	
  
Same	
  Topics	
  
Same	
  categories	
  
•  On	
  average,	
  52.4%	
  of	
  en66es	
  a	
  user	
  men6ons	
  in	
  social	
  Web	
  (e.g.,	
  
“Java”)	
  have	
  at	
  least	
  1	
  candidate	
  sense	
  in	
  same	
  parent	
  category	
  of	
  
Wikipedia	
  ar6cle	
  same	
  user	
  edited	
  (e.g.,	
  “Programming	
  language”)	
  
•  If	
  extend	
  to	
  just	
  4	
  parents	
  up	
  category	
  hierarchy,	
  get	
  all	
  100%	
  
	
  
	
  
	
  
	
  
Ambiguous	
  YouTube	
  post:	
  	
  
office,	
  december	
  3	
  
	
  
Same	
  user’s	
  recent	
  Wikipedia	
  edit:	
  	
  
<item	
  userid="xxxx"	
  user="xxxx”	
  
pageid="31841130”	
  ,tle=	
  	
  
"The	
  Office	
  (U.S.	
  season	
  8)"/>	
  
	
  
Theoretical	
  Motivations	
  
Theoretical	
  Motivations	
  
•  Online	
  Contribu6on:	
  
•  Users	
  produce	
  online	
  content	
  about	
  key	
  set	
  of	
  personally-­‐interes6ng	
  
topics	
  because	
  it	
  is	
  fulfilling	
  and	
  seen	
  as	
  having	
  beDer	
  cost	
  benefit	
  
•  (Harper	
  et	
  al.,	
  2007;	
  Lakhani	
  &	
  von	
  Hippel,	
  2003;	
  Lerner	
  &	
  Tirole,	
  2000;	
  
Ling	
  et	
  al.,	
  2006;	
  Maslow,	
  1970)	
  
	
  
	
  
Theoretical	
  Motivations	
  
•  Online	
  Contribu6on:	
  
•  Users	
  produce	
  online	
  content	
  about	
  key	
  set	
  of	
  personally-­‐interes6ng	
  
topics	
  because	
  it	
  is	
  fulfilling	
  and	
  seen	
  as	
  having	
  beDer	
  cost	
  benefit	
  
•  (Harper	
  et	
  al.,	
  2007;	
  Lakhani	
  &	
  von	
  Hippel,	
  2003;	
  Lerner	
  &	
  Tirole,	
  2000;	
  
Ling	
  et	
  al.,	
  2006;	
  Maslow,	
  1970)	
  
•  Modeling	
  Interests:	
  
•  Effec6ve	
  to	
  model	
  these	
  topic	
  interests	
  from	
  lexical	
  features	
  of	
  these	
  
text-­‐based	
  contribu6ons	
  
•  (Chen	
  et	
  al.,	
  2010;	
  Cosley	
  et	
  al.,	
  2007;	
  Pennacchioq	
  &	
  Popescu,	
  2011)	
  
	
  
Modeling	
  a	
  Knowledge	
  Context	
  
•  Knowledge	
  base,	
  K	
  
•  K=(N,E)	
  
•  2	
  node	
  types:	
  
•  Categories	
  
•  Topics	
  
c1
c2
c4
t3t2
c3
d2d1 d3
t1
The	
  Knowledge	
  Graph	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  
	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
•  Belongs	
  to	
  one	
  or	
  more	
  categories	
  
	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
•  Belongs	
  to	
  one	
  or	
  more	
  categories	
  
	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
•  Belongs	
  to	
  one	
  or	
  more	
  categories	
  
	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
•  Belongs	
  to	
  one	
  or	
  more	
  categories	
  
	
  
The	
  Knowledge	
  Graph	
  
•  Category	
  nodes:	
  NCategory⊂N	
  
•  Unique	
  iden6fier	
  
•  Seman6c	
  rela6onships	
  with	
  other	
  nodes	
  
•  Topic	
  nodes:	
  NTopic⊂N	
  	
  
•  Unique	
  iden6fier	
  
•  Belongs	
  to	
  one	
  or	
  more	
  categories	
  
•  Associated	
  with	
  text-­‐based	
  descrip6on	
  
	
  
User	
  Interest	
  Model	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
	
  	
  
	
  	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
	
  	
  
	
  	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
	
  	
  
	
  	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
	
  	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
	
  	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
	
  	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
	
  	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
•  Category	
  nodes:	
  categories	
  reachable	
  in	
  knowledge	
  graph	
  from	
  those	
  topics	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
•  Category	
  nodes:	
  categories	
  reachable	
  in	
  knowledge	
  graph	
  from	
  those	
  topics	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
•  Category	
  nodes:	
  categories	
  reachable	
  in	
  knowledge	
  graph	
  from	
  those	
  topics	
  
	
  	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
•  Category	
  nodes:	
  categories	
  reachable	
  in	
  knowledge	
  graph	
  from	
  those	
  topics	
  
•  Edge	
  weight	
  =	
  inverse	
  of	
  shortest	
  path	
  length	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
•  Category	
  nodes:	
  categories	
  reachable	
  in	
  knowledge	
  graph	
  from	
  those	
  topics	
  
•  Edge	
  weight	
  =	
  inverse	
  of	
  shortest	
  path	
  length	
  
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
•  Category	
  nodes:	
  categories	
  reachable	
  in	
  knowledge	
  graph	
  from	
  those	
  topics	
  
•  Edge	
  weight	
  =	
  inverse	
  of	
  shortest	
  path	
  length	
  
! c1 c2 c3 c4
t1
!
!
! 1!
!
!
! 0!
t2
!
!
! 1!
!
!
! 1!
t3 0! 0!
!
!
! 1!
User	
  Interest	
  Model	
  
•  Edi6ng	
  a	
  descrip6on	
  signals	
  interest	
  in	
  associated	
  topic	
  
•  Topic	
  nodes:	
  all	
  topics	
  user	
  edited	
  descrip6on	
  of	
  
•  Category	
  nodes:	
  categories	
  reachable	
  in	
  knowledge	
  graph	
  from	
  those	
  topics	
  
•  Edge	
  weight	
  =	
  inverse	
  of	
  shortest	
  path	
  length	
  
! c1 c2 c3 c4
t1
!
!
! 1!
!
!
! 0!
t2
!
!
! 1!
!
!
! 1!
t3 0! 0!
!
!
! 1!
•  Same	
  representa6on	
  for	
  candidates	
  
Instantiating	
  the	
  Model
•  Wikipedia	
  
•  DBPedia	
  
•  Freebase	
  
Instantiating	
  the	
  Model
•  Wikipedia	
  
•  DBPedia	
  
•  Freebase	
  
Instantiating	
  on	
  Wikipedia
•  Ar6cles,	
  categories	
  effec6vely	
  represent	
  topics	
  (Syed,	
  2008)	
  
Instantiating	
  on	
  Wikipedia
•  Ar6cles,	
  categories	
  effec6vely	
  represent	
  topics	
  (Syed,	
  2008)	
  
•  Good	
  coverage	
  of	
  even	
  rare	
  en6ty	
  concepts	
  (Zesch,	
  2007)	
  
Instantiating	
  on	
  Wikipedia
•  Ar6cles,	
  categories	
  effec6vely	
  represent	
  topics	
  (Syed,	
  2008)	
  
•  Good	
  coverage	
  of	
  even	
  rare	
  en6ty	
  concepts	
  (Zesch,	
  2007)	
  
•  Compa6ble	
  with	
  NER	
  toolkits	
  
•  DBPedia	
  Spotlight,	
  Wikipedia	
  Miner	
  
Instantiating	
  on	
  Wikipedia
•  Ar6cles,	
  categories	
  effec6vely	
  represent	
  topics	
  (Syed,	
  2008)	
  
•  Good	
  coverage	
  of	
  even	
  rare	
  en6ty	
  concepts	
  (Zesch,	
  2007)	
  
•  Compa6ble	
  with	
  NER	
  toolkits	
  
•  DBPedia	
  Spotlight,	
  Wikipedia	
  Miner	
  
•  Ar6cle	
  edi6ng	
  behavior	
  effec6ve	
  for	
  modeling	
  interests	
  (Cosley,	
  2007;	
  
Lieberman	
  &	
  Lin,	
  2009;	
  WaDenberg	
  et	
  al.,	
  2007)	
  
Article	
  editing	
  signals	
  topic	
  interest	
  
Editing Behavior Intuition
Number of times
user edits article
Repeatedly editing an article implies
greater commitment and interest
Article’s overall edit
activity and total
number of editors
Generally popular and actively edited
articles are less discriminative of individ-
ual interest and personal relevance
Time period user
edits article
Long-term interests are stronger than
fleeting, short-term interests
Type of edit accord-
ing to revision tag
Trivial edits such as vandalism reversion
or typo correction less indicative of inter-
est than thoughtful, effortful edits
Complexity, com-
pleteness, informa-
tiveness of edit ac-
cording to metrics of
Information Quality
Type, substantiveness, and overall quality
of care user gives to an edit indicates con-
cern and interest in topic
Edi6ng	
  behaviors	
  indica6ve	
  of	
  user	
  interest:	
  
Article	
  editing	
  signals	
  topic	
  interest	
  
Editing Behavior Intuition
Number of times
user edits article
Repeatedly editing an article implies
greater commitment and interest
Article’s overall edit
activity and total
number of editors
Generally popular and actively edited
articles are less discriminative of individ-
ual interest and personal relevance
Time period user
edits article
Long-term interests are stronger than
fleeting, short-term interests
Type of edit accord-
ing to revision tag
Trivial edits such as vandalism reversion
or typo correction less indicative of inter-
est than thoughtful, effortful edits
Complexity, com-
pleteness, informa-
tiveness of edit ac-
cording to metrics of
Information Quality
Type, substantiveness, and overall quality
of care user gives to an edit indicates con-
cern and interest in topic
Edi6ng	
  behaviors	
  indica6ve	
  of	
  user	
  interest:	
  
Less	
  Meaningful	
  Edits	
  
Ignore Irrelevant Edits Clean Article Text
Articles with less than 100
non-stopwords
Stem, tokenize, lowercase; re-
move stopwords, punctuation,
non-printable characters.
Trivial edits, i.e., typo correc-
tion, vandalism reversion.
Parse Wiki Markup to remove
article maintenance information
List pages merely containing
widely diverse sets of topics
that are all not necessarily
indicative of the piece person-
ally relevant to the user
Implementation:	
  The	
  RESLVE	
  System	
  
RESLVE	
  (Resolving	
  En6ty	
  Sense	
  by	
  LeVeraging	
  Edits)	
  addresses	
  NED	
  by:	
  
	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Implementation:	
  The	
  RESLVE	
  System	
  
RESLVE	
  (Resolving	
  En6ty	
  Sense	
  by	
  LeVeraging	
  Edits)	
  addresses	
  NED	
  by:	
  
I.  Connec6ng	
  social	
  Web	
  +	
  Wikipedia	
  editor	
  iden6ty	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Implementation:	
  The	
  RESLVE	
  System	
  
RESLVE	
  (Resolving	
  En6ty	
  Sense	
  by	
  LeVeraging	
  Edits)	
  addresses	
  NED	
  by:	
  
I.  Connec6ng	
  social	
  Web	
  +	
  Wikipedia	
  editor	
  iden6ty	
  	
  
II.  Modeling	
  topics	
  of	
  interests	
  using	
  ar6cle	
  edits	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Implementation:	
  The	
  RESLVE	
  System	
  
RESLVE	
  (Resolving	
  En6ty	
  Sense	
  by	
  LeVeraging	
  Edits)	
  addresses	
  NED	
  by:	
  
I.  Connec6ng	
  social	
  Web	
  +	
  Wikipedia	
  editor	
  iden6ty	
  	
  
II.  Modeling	
  topics	
  of	
  interests	
  using	
  ar6cle	
  edits	
  
III.  Ranking	
  en6ty	
  candidates	
  by	
  personal	
  relevance	
  
	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Implementation:	
  The	
  RESLVE	
  System	
  
RESLVE	
  (Resolving	
  En6ty	
  Sense	
  by	
  LeVeraging	
  Edits)	
  addresses	
  NED	
  by:	
  
I.  Connec6ng	
  social	
  Web	
  +	
  Wikipedia	
  editor	
  iden6ty	
  	
  
II.  Modeling	
  topics	
  of	
  interests	
  using	
  ar6cle	
  edits	
  
III.  Ranking	
  en6ty	
  candidates	
  by	
  personal	
  relevance	
  
	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Phase	
  1:	
  Bridging	
  Web	
  Identities	
  
•  Connect	
  iden6ty	
  of	
  social	
  media	
  user	
  with	
  Wikipedia	
  editor	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Phase	
  1:	
  Bridging	
  Web	
  Identities	
  
•  Connect	
  iden6ty	
  of	
  social	
  media	
  user	
  with	
  Wikipedia	
  editor	
  
•  Simple	
  string	
  matching	
  
•  Iofciu,	
  2011;	
  Perito,	
  2011	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Phase	
  2:	
  Representing	
  Users	
  and	
  Entities	
  
•  Models	
  user’s	
  topics	
  of	
  interest	
  using	
  bridged	
  Wiki	
  account’s	
  edi6ng-­‐history	
  
•  Compares	
  similarity	
  of	
  those	
  topics	
  to	
  topic	
  associated	
  with	
  candidate	
  sense	
  
•  Models	
  user’s	
  topics	
  of	
  interest	
  using	
  bridged	
  Wiki	
  account’s	
  edi6ng-­‐history	
  
•  Compares	
  similarity	
  of	
  those	
  topics	
  to	
  topic	
  associated	
  with	
  candidate	
  sense	
  
•  Content-­‐based	
  &	
  knowledge-­‐graph	
  based	
  similarity	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Phase	
  2:	
  Representing	
  Users	
  and	
  Entities	
  
•  Models	
  user’s	
  topics	
  of	
  interest	
  using	
  bridged	
  Wiki	
  account’s	
  edi6ng-­‐history	
  
•  Compares	
  similarity	
  of	
  those	
  topics	
  to	
  topic	
  associated	
  with	
  candidate	
  sense	
  
•  Content-­‐based	
  &	
  knowledge-­‐graph	
  based	
  similarity	
  
•  Weighted	
  vectors	
  used	
  to	
  represent	
  user	
  and	
  candidate	
  sense	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Phase	
  2:	
  Representing	
  Users	
  and	
  Entities	
  
Content-­‐based	
  similarity	
  
•  Bag-­‐Of-­‐Words	
  
•  Titles	
  of	
  ar6cles	
  user	
  edited	
  
•  Candidate’s	
  ar6cle	
  6tle	
  
•  Words	
  from	
  those	
  ar6cles’	
  pages	
  &	
  category	
  6tles	
  
•  TF-­‐IDF	
  weighted	
  	
  
Content-­‐based	
  similarity	
  
•  Bag-­‐Of-­‐Words	
  
•  Titles	
  of	
  ar6cles	
  user	
  edited	
  
•  Candidate’s	
  ar6cle	
  6tle	
  
•  Words	
  from	
  those	
  ar6cles’	
  pages	
  &	
  category	
  6tles	
  
•  TF-­‐IDF	
  weighted	
  	
  
•  User,	
  u:	
  Vcontent,	
  u	
  
•  Candidate	
  meaning,	
  m:	
  Vcontent,	
  m	
  
	
  
simcontent(u,	
  m)	
  =	
  cossim(Vcontent,	
  u	
  ,	
  Vcontent,	
  m)	
  
	
  
Knowledge-­‐context	
  based	
  similarity	
  
•  Vectors	
  of	
  ar6cles’	
  category	
  IDs	
  
•  Weight	
  is	
  distance	
  between	
  the	
  ar6cle	
  (topic)	
  and	
  category	
  in	
  
knowledge	
  graph	
  
•  E.g.,	
  “American	
  Television	
  Series”	
  >	
  “Broadcas6ng”	
  	
  
Knowledge-­‐context	
  based	
  similarity	
  
•  Vectors	
  of	
  ar6cles’	
  category	
  IDs	
  
•  Weight	
  is	
  distance	
  between	
  the	
  ar6cle	
  (topic)	
  and	
  category	
  in	
  
knowledge	
  graph	
  
•  E.g.,	
  “American	
  Television	
  Series”	
  >	
  “Broadcas6ng”	
  	
  
•  User,	
  u	
  :	
  Vcategory,	
  u	
  
•  Candidate	
  meaning,	
  m:	
  Vcategory,	
  m	
  
	
  
simcategory(u,	
  m)	
  =	
  cossim(Vcategory,	
  u	
  ,	
  Vcategory,	
  m)	
  
	
  
Phase	
  3:	
  Ranking	
  by	
  Personal	
  Relevance	
  
Output	
  highest	
  scoring	
  candidate	
  as	
  intended	
  meaning	
  by	
  measuring:	
  
sim(u,m)=α*simcontent(u,m)+(1-­‐α)*simcategory(u,m)	
  	
  
	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Pre-­‐processing	
  &	
  prepara6on	
  modules	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Pre-­‐processing	
  &	
  prepara6on	
  modules	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Pre-­‐processing	
  &	
  prepara6on	
  modules	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Pre-­‐processing	
  &	
  prepara6on	
  modules	
  
pre-
processor
Wikipedia
Miner
user utterances
unstructured
short texts
DBPedia
Spotlight
top ranked
personally-
relevant
candidates
entity
m
m
m
entity
username
user contributed
structured
documents
user interest
model
BRIDGING
USER
IDENTITY
MODELING
USER
INTEREST
I II
III
RANKING
CANDIDATES
BY PERSONAL
RELEVANCE
m
m
m
m m
m m
m
m
m
entity
entity
detected entities &
candidate meanings ("m")
Background	
  
•  Task	
  Defini6ons	
  
•  Challenges	
  &	
  Examples	
  
•  ADempted	
  Solu6ons	
  
Approach	
  
•  Mo6va6ons	
  
•  Modeling	
  a	
  Knowledge	
  Context	
  
•  Implementa6on:	
  The	
  RESLVE	
  System	
  
Evalua2on	
  
•  Experiments	
  
•  Results	
  
•  Future	
  Work	
  
Agenda	
  
Experiment	
  
Data	
  Sample	
  
•  TwiDer:	
  tweets	
  
•  YouTube:	
  video	
  6tles,	
  descrip6ons	
  
•  Flickr:	
  photo	
  tags,	
  6tles,	
  descrip6ons	
  
	
  
Experiment	
  
Data	
  Sample	
  
•  TwiDer:	
  tweets	
  
•  YouTube:	
  video	
  6tles,	
  descrip6ons	
  
•  Flickr:	
  photo	
  tags,	
  6tles,	
  descrip6ons	
  
	
  
•  String-­‐matched	
  usernames	
  of	
  posters	
  to	
  Wikipedia	
  accounts	
  
•  Mechanical	
  Turk	
  used	
  to	
  confirm	
  accounts	
  were	
  same	
  person	
  
	
  
Experiment	
  
Data	
  Sample	
  
•  TwiDer:	
  tweets	
  
•  YouTube:	
  video	
  6tles,	
  descrip6ons	
  
•  Flickr:	
  photo	
  tags,	
  6tles,	
  descrip6ons	
  
	
  
•  String-­‐matched	
  usernames	
  of	
  posters	
  to	
  Wikipedia	
  accounts	
  
•  Mechanical	
  Turk	
  used	
  to	
  confirm	
  accounts	
  were	
  same	
  person	
  
For	
  confirmed	
  matches:	
  
•  Collected	
  100	
  most	
  recent	
  uDerances	
  	
  
•  ID,	
  6tle,	
  page	
  content,	
  categories	
  of	
  edited	
  ar6cles	
  
Experiment	
  
Labeling	
  correct	
  en6ty	
  meaning	
  
•  1545	
  valid	
  ambiguous	
  en66es	
  
•  Mechanical	
  Turk	
  Categoriza6on	
  Masters	
  	
  
•  Averaged	
  observed	
  agreement	
  across	
  all	
  coders	
  and	
  items	
  =	
  0.866	
  
•  Average	
  Fleiss	
  Kappa	
  =	
  0.803	
  
•  918	
  unanimously	
  labeled	
  ambiguous	
  en66es	
  
Dataset	
  Characteristics	
  
Text	
  Length	
  
Longest	
  uDerances	
  s6ll	
  shorter	
  than	
  even	
  shortest	
  texts	
  from	
  
NER	
  task	
  corpora	
  like	
  Reuters-­‐21578,	
  Brown-­‐Corpus	
  
0"
5"
10"
15"
20"
25"
30"
10"
40"
70"
100"
130"
160"
190"
300"
450"
600"
800"
1100"
1400"
2500"
4000"
5500"
7000"
8500"
10000"
11500"
13000"
14500"
Twi/er" YouTube" Flickr"
Reuters" Brown"
High	
  Ambiguity	
  
•  NER	
  services	
  have	
  low	
  confidence	
  
	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
Wikipedia"Miner" DBPedia"Spotlight"
High	
  Ambiguity	
  
•  NER	
  services	
  have	
  low	
  confidence	
  
•  Many	
  poten6al	
  candidates	
  (2	
  to	
  163,	
  avg.	
  5-­‐6,	
  median	
  4)	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
Wikipedia"Miner" DBPedia"Spotlight"
High	
  Ambiguity	
  
•  91%	
  of	
  uDerances	
  contain	
  at	
  least	
  1	
  ambiguous	
  en6ty	
  
•  2/3	
  of	
  en66es	
  detected	
  are	
  ambiguous	
  
•  Almost	
  no	
  en66es	
  without	
  at	
  least	
  2	
  senses	
  to	
  disambiguate	
  
Performance	
  
Metric	
  
•  Precision	
  at	
  rank	
  1	
  (P@1)	
  
Performance	
  
Metric	
  
•  Precision	
  at	
  rank	
  1	
  (P@1)	
  
Methods	
  of	
  comparison	
  
•  Human	
  annotated	
  gold	
  standard	
  
•  RC:	
  Randomly	
  sorted	
  candidates	
  
•  PF:	
  Prior	
  frequency	
  	
  
•  RU:	
  RESLVE	
  given	
  a	
  random	
  Wikipedia	
  user's	
  interest	
  model	
  	
  
•  DS:	
  DBPedia	
  Spotlight	
  
•  WM:	
  Wikipedia	
  Miner	
  	
  
Results	
  
Flickr	
   YouTube	
  
RESLVE	
   0.63	
   0.76	
   0.84	
  
RC	
   0.21	
   0.32	
   0.31	
  
PF	
   0.74	
   0.69	
   0.66	
  
RU	
   0.51	
   0.71	
   0.78	
  
WM	
   0.78	
   0.58	
   0.80	
  
DS	
   0.53	
   0.67	
   0.63	
  
Twitter
Discussion	
  
•  Best	
  performance	
  on	
  YouTube	
  texts	
  	
  
	
  	
  	
  (longest)	
  due	
  to	
  content-­‐based	
  sim	
  
Discussion	
  
•  Best	
  performance	
  on	
  YouTube	
  texts	
  	
  
	
  	
  	
  (longest)	
  due	
  to	
  content-­‐based	
  sim	
  
•  Outperforms	
  on	
  more	
  personal	
  text	
  (e.g.,	
  tweets)	
  
•  Random	
  user	
  model	
  less	
  effec6ve	
  
	
  
Discussion	
  
•  Best	
  performance	
  on	
  YouTube	
  texts	
  	
  
	
  	
  	
  (longest)	
  due	
  to	
  content-­‐based	
  sim	
  
•  Outperforms	
  on	
  more	
  personal	
  text	
  (e.g.,	
  tweets)	
  
•  Random	
  user	
  model	
  less	
  effec6ve	
  
•  Less	
  effec6ve	
  on	
  impersonal	
  text	
  (e.g.,	
  photo	
  geo-­‐tags)	
  
•  	
  High	
  prior	
  frequency	
  so	
  standard	
  methods	
  suffice	
  
•  Personally-­‐unfamiliar	
  topics	
  so	
  not	
  likely	
  to	
  make	
  Wiki	
  edits	
  about	
  them	
  
•  Stable	
  interests	
  assump6on	
  breaks	
  down	
  here	
  
Error	
  Cases	
  
•  Automated	
  messages	
  
•  “I	
  uploaded	
  a	
  video	
  on	
  @youtube”	
  à	
  1945	
  European	
  Films	
  
Error	
  Cases	
  
•  Automated	
  messages	
  
•  “I	
  uploaded	
  a	
  video	
  on	
  @youtube”	
  à	
  1945	
  European	
  Films	
  
•  En66es	
  not	
  in	
  knowledge	
  base	
  
•  “Peter	
  on	
  the	
  dock”	
  
Error	
  Cases	
  
•  Automated	
  messages	
  
•  “I	
  uploaded	
  a	
  video	
  on	
  @youtube”	
  à	
  1945	
  European	
  Films	
  
•  En66es	
  not	
  in	
  knowledge	
  base	
  
•  “Peter	
  on	
  the	
  dock”	
  
•  Less	
  prolific	
  contributors	
  
Future	
  Work	
  
Future	
  Work	
  
•  Computability	
  
•  Wikipedia	
  has	
  5M	
  ar6cles,	
  700K	
  categories	
  à	
  Vector	
  pruning	
  
	
  
	
  
Future	
  Work	
  
•  Computability	
  
•  Wikipedia	
  has	
  5M	
  ar6cles,	
  700K	
  categories	
  à	
  Vector	
  pruning	
  
•  User	
  iden6ty	
  &	
  modeling	
  interests	
  
	
  
	
  
	
  
Bridging	
  User	
  Accounts	
  
#	
  Usernames	
   Exist	
  on	
  Wikipedia	
  
TwiDer	
   479	
   46.1%	
  
YouTube	
   454	
   19.6%	
  
Flickr	
   226	
   21.7%	
  
Bridging	
  User	
  Accounts	
  
#	
  Usernames	
   Exist	
  on	
  Wikipedia	
   Matches	
  are	
  same	
  person	
  
TwiDer	
   479	
   46.1%	
   47%	
  
YouTube	
   454	
   19.6%	
   48%	
  
Flickr	
   226	
   21.7%	
   71%	
  
Bridging	
  User	
  Accounts
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
	
  
	
  
c.  False	
  posi6ves	
  (string	
  match,	
  but	
  different	
  people)	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
	
  
	
  
c.  False	
  posi6ves	
  (string	
  match,	
  but	
  different	
  people)	
  
Collabora6ve	
  filtering	
  techniques	
  to	
  approximate	
  user's	
  
own	
  interests	
  with	
  contribu6ons	
  of	
  social	
  connec6ons	
  ü 	
  	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
	
  
	
  
c.  False	
  posi6ves	
  (string	
  match,	
  but	
  different	
  people)	
  
Collabora6ve	
  filtering	
  techniques	
  to	
  approximate	
  user's	
  
own	
  interests	
  with	
  contribu6ons	
  of	
  social	
  connec6ons	
  ü 	
  	
  
Consider	
  more	
  profile	
  aDributes	
  than	
  username	
  
ü 	
  	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
	
  
	
  
c.  False	
  posi6ves	
  (string	
  match,	
  but	
  different	
  people)	
  
Collabora6ve	
  filtering	
  techniques	
  to	
  approximate	
  user's	
  
own	
  interests	
  with	
  contribu6ons	
  of	
  social	
  connec6ons	
  ü 	
  	
  
Consider	
  more	
  profile	
  aDributes	
  than	
  username	
  
ü 	
  	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
	
  
	
  
c.  False	
  posi6ves	
  (string	
  match,	
  but	
  different	
  people)	
  
•  Use	
  other	
  knowledge	
  base	
  besides	
  Wikipedia	
  
Collabora6ve	
  filtering	
  techniques	
  to	
  approximate	
  user's	
  
own	
  interests	
  with	
  contribu6ons	
  of	
  social	
  connec6ons	
  ü 	
  	
  
Consider	
  more	
  profile	
  aDributes	
  than	
  username	
  
ü 	
  	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
	
  
	
  
c.  False	
  posi6ves	
  (string	
  match,	
  but	
  different	
  people)	
  
•  Use	
  other	
  knowledge	
  base	
  besides	
  Wikipedia	
  
•  Model	
  user	
  interest	
  from	
  addi6onal	
  kinds	
  of	
  par6cipa6on	
  	
  	
  
(e.g.,	
  page	
  visits,	
  bookmarking	
  favori6ng)	
  
Collabora6ve	
  filtering	
  techniques	
  to	
  approximate	
  user's	
  
own	
  interests	
  with	
  contribu6ons	
  of	
  social	
  connec6ons	
  ü 	
  	
  
Consider	
  more	
  profile	
  aDributes	
  than	
  username	
  
ü 	
  	
  
Bridging	
  User	
  Accounts
a.  True	
  nega6ve	
  (no	
  iden6ty	
  in	
  knowledge	
  base)	
  
	
  
	
  
b.  False	
  nega6ve	
  (same	
  person,	
  different	
  usernames)	
  
	
  
	
  
c.  False	
  posi6ves	
  (string	
  match,	
  but	
  different	
  people)	
  
•  Use	
  other	
  knowledge	
  base	
  besides	
  Wikipedia	
  
•  Model	
  user	
  interest	
  from	
  addi6onal	
  kinds	
  of	
  par6cipa6on	
  	
  	
  
(e.g.,	
  page	
  visits,	
  bookmarking	
  favori6ng)	
  
•  Interest	
  driy	
  &	
  6me-­‐frame	
  of	
  pos6ngs	
  
Collabora6ve	
  filtering	
  techniques	
  to	
  approximate	
  user's	
  
own	
  interests	
  with	
  contribu6ons	
  of	
  social	
  connec6ons	
  ü 	
  	
  
Consider	
  more	
  profile	
  aDributes	
  than	
  username	
  
ü 	
  	
  
Summary	
  &	
  Conclusion	
  
•  Social	
  Web	
  texts:	
  short	
  &	
  highly	
  personal	
  
•  User	
  posts	
  about	
  same	
  topics	
  across	
  communi6es	
  (but	
  not	
  always)	
  
•  Models	
  user	
  interest	
  as	
  personal	
  context	
  with	
  respect	
  to	
  a	
  
knowledge	
  base’s	
  categorical	
  organiza6on	
  scheme	
  
•  Ranking	
  technique	
  compares	
  en6ty’s	
  poten6al	
  meanings	
  to	
  user’s	
  
interests	
  to	
  determine	
  intended	
  meaning	
  
•  Language	
  and	
  context	
  independent	
  
•  Promising	
  performance	
  gains	
  
•  Going	
  forward:	
  such	
  a	
  strategy	
  becomes	
  increasingly	
  necessary,	
  
feasible,	
  and	
  effec6ve	
  
Thank	
  	
  You!	
  
	
  
Acknowledgements	
  
•  Claire	
  Cardie,	
  Dan	
  Cosley,	
  Lillian	
  Lee,	
  Sean	
  Allen,	
  Wenceslaus	
  Lee	
  	
  
•  Na6onal	
  Science	
  Founda6on	
  Graduate	
  Research	
  Fellowship	
  
under	
  Grant	
  No.	
  DGE	
  1144153	
  
•  Marie	
  Curie	
  Interna6onal	
  Outgoing	
  Fellowship	
  within	
  the	
  7th	
  
European	
  Community	
  Framework	
  Programme	
  (PIOF-­‐
GA-­‐2009-­‐252206).	
  
•  Ques6ons?	
  
Elizabeth	
  L.	
  Murnane	
  
elm236@cornell.edu	
  
Bernhard	
  Haslhofer	
  
bernhard.haslhofer@	
  
univie.ac.at	
  
Carl	
  Lagoze	
  
clagoze@umich.edu	
  
	
  

RESLVE: Leveraging User Interest to Improve Entity Disambiguation on Short Text

  • 1.
    RESLVE:  Leveraging  User  Interest   to  Improve  En6ty  Disambigua6on   on  Short  Text   Elizabeth  L.  Murnane  elm236@cornell.edu   Bernhard  Haslhofer  bernhard.haslhofer@univie.ac.at   Carl  Lagoze  clagoze@umich.edu  
  • 2.
    A  Personalized  Approach  to  Entity  Resolution   Background   •  Task  Defini6ons   •  Challenges  &  Examples   •  ADempted  Solu6ons   Approach   •  Mo6va6ons   •  Modeling  a  Knowledge  Context   •  Implementa6on:  The  RESLVE  System   Evalua2on   •  Experiments   •  Results   •  Future  Work  
  • 3.
    A  Personalized  Approach  to  Entity  Resolution   Background   •  Task  Defini6ons   •  Challenges  &  Examples   •  ADempted  Solu6ons   Approach   •  Mo6va6ons   •  Modeling  a  Knowledge  Context   •  Implementa6on:  The  RESLVE  System   Evalua2on   •  Experiments   •  Results   •  Future  Work  
  • 5.
    Social  Web   10  million     pages  per  day  
  • 6.
    Social  Web   800  million     visitors  per  month  
  • 7.
    Social  Web   7  billion  images   (twice  4  years  ago)  
  • 8.
  • 9.
    Task  Definition Named  En2ty  Recogni2on  (NER)   •  Systema6cally  iden6fying  men6ons  of  en##es   (e.g.,  people,  places,  concepts,  ideas)  
  • 10.
    Task  Definition Named  En2ty  Recogni2on  (NER)   •  Systema6cally  iden6fying  men6ons  of  en##es   (e.g.,  people,  places,  concepts,  ideas)   Named  En2ty  Disambigua2on  (NED)   Resolving  the  intended  meaning  of  ambiguous  en66es   from  mul6ple  candidate  meanings  
  • 11.
    Ambiguous  Entities   aaahh  one  more  day   un,l  finn!!!  #cantwait         office  holiday  party   Beetle  
  • 12.
    Ambiguous  Entities   aaahh  one  more  day   un,l  finn!!!  #cantwait         office  holiday  party   Beetle  
  • 13.
    Ambiguous  Entities   aaahh  one  more  day   un,l  finn!!!  #cantwait         office  holiday  party   Beetle  
  • 14.
    Ambiguous  Entities   aaahh  one  more  day   un,l  finn!!!  #cantwait         office  holiday  party   Beetle  
  • 15.
  • 16.
    office  holiday  party   Footage:   • Workplace?  
  • 17.
    office  holiday  party   Footage:   • Workplace?   • TV  Show?  
  • 18.
    office  holiday  party   Episode  4   Footage:   • Workplace?   • TV  Show?  
  • 19.
    office  holiday  party   Episode  4   Footage:   • Workplace?   • TV  Show?   • US  Version?   • UK  Version?  
  • 20.
    Episode  4   office  holiday  party   office,  december  3   Footage:   • Workplace?   • TV  Show?   • US  Version?   • UK  Version?  
  • 21.
  • 22.
    Challenges  &  Focus   •  Short  Length  
  • 23.
    Challenges  &  Focus   •  Short  Length   •  Sparse  Lexical  Context  
  • 24.
    Challenges  &  Focus   •  Short  Length   •  Sparse  Lexical  Context   •  Noisy  
  • 25.
    Challenges  &  Focus   •  Short  Length   •  Sparse  Lexical  Context   •  Noisy   •  Highly  personal  in  nature  
  • 26.
    Challenges  &  Focus   •  Short  Length   •  Sparse  Lexical  Context   •  Noisy   •  Highly  personal  in  nature  
  • 27.
    Limitations  of  Extant  Research   Tweets  severely  degrade  tradi6onal  techniques    
  • 28.
    Limitations  of  Extant  Research   Tweets  severely  degrade  tradi6onal  techniques   •  Stanford  NER:  F1  drops  90%  à  46%   •  DBPedia  Spotlight  &  Wikipedia  Miner:  P@1  <  40%  
  • 29.
    Limitations  of  Extant  Research   Tweets  severely  degrade  tradi6onal  techniques   •  Stanford  NER:  F1  drops  90%  à  46%   •  DBPedia  Spotlight  &  Wikipedia  Miner:  P@1  <  40%     Recent  strategies  
  • 30.
    Limitations  of  Extant  Research   Tweets  severely  degrade  tradi6onal  techniques   •  Stanford  NER:  F1  drops  90%  à  46%   •  DBPedia  Spotlight  &  Wikipedia  Miner:  P@1  <  40%     Recent  strategies   •  Crowd-­‐sourcing   •  Limita6on:  Dependent  on  reliable  human  workers  
  • 31.
    Tweets  severely  degrade  tradi6onal  techniques   •  Stanford  NER:  F1  drops  90%  à  46%   •  DBPedia  Spotlight  &  Wikipedia  Miner:  P@1  <  40%     Recent  strategies   •  Crowd-­‐sourcing   •  Limita6on:  Dependent  on  reliable  human  workers   •  Automated  aDempts   •  Limita6on:  Focus  on  NER  not  NED   •  Limita6on:  Generalizability  beyond  TwiDer?     Limitations  of  Extant  Research  
  • 32.
    Challenges  &  Focus   •  Short  Length   •  Sparse  Lexical  Context   •  Noisy   •  Highly  personal  in  nature  
  • 33.
    • User’s  past  content  on   same  plaeorm  not  feasible   background  corpus   Challenges  &  Focus   •  Short  Length   •  Sparse  Lexical  Context   •  Noisy   •  Highly  personal  in  nature  
  • 34.
    Task  Definition                 Our  focus:  disambigua2ng  any  en2ty  detected   in  users’  text-­‐based  uNerances  on  social  Web   Named  En2ty  Recogni2on  (NER)   •  Systema6cally  iden6fying  men6ons  of  en##es   (e.g.,  people,  places,  concepts,  ideas)   Named  En2ty  Disambigua2on  (NED)   Resolving  the  intended  meaning  of  ambiguous  en66es   from  mul6ple  candidate  meanings  
  • 35.
    Exploring  a  Personalized  Solution   •  Individual-­‐centric  approach  to  NED  
  • 36.
    Exploring  a  Personalized  Solution   •  Individual-­‐centric  approach  to  NED     •  Incorporates  external,  user-­‐specific  seman6c  data   Personal   Context  
  • 37.
    Exploring  a  Personalized  Solution   •  Individual-­‐centric  approach  to  NED     •  Incorporates  external,  user-­‐specific  seman6c  data   •  Model  personal  interests  with  respect  to  this  informa6on   Personal   Context  
  • 38.
    Exploring  a  Personalized  Solution   •  Individual-­‐centric  approach  to  NED     •  Incorporates  external,  user-­‐specific  seman6c  data   •  Model  personal  interests  with  respect  to  this  informa6on   •  Determine  user’s  likely  intended  meaning  of  ambiguous  en6ty   based  on  similarity  between  poten6al  meanings  and  interests   Personal   Context  
  • 39.
    Exploring  a  Personalized  Solution   •  Individual-­‐centric  approach  to  NED     •  Incorporates  external,  user-­‐specific  seman6c  data   •  Model  personal  interests  with  respect  to  this  informa6on   •  Determine  user’s  likely  intended  meaning  of  ambiguous  en6ty   based  on  similarity  between  poten6al  meanings  and  interests   RESLVE   Resolving  En6ty  Sense  by  LeVeraging  Edits     Personal   Context  
  • 40.
    Background   •  Task  Defini6ons   •  Challenges  &  Examples   •  ADempted  Solu6ons   Approach   •  Mo6va6ons   •  Modeling  a  Knowledge  Context   •  Implementa6on:  The  RESLVE  System   Evalua2on   •  Experiments   •  Results   •  Future  Work   Agenda  
  • 41.
  • 42.
    Underlying  Assumptions   • User  has  core  interests   •  User  more  likely  to  men6on  an  en6ty  about  a  topic  relevant  to  personal   interests  than  men6on  a  topic  of  non-­‐interest     User  expresses  these  interests  consistently  in  content  she  posts   online  in  mul6ple  communi6es   Can  use  a  seman6c  knowledge  base  to  formally  represent  these   topics  of  interest              
  • 43.
    Underlying  Assumptions   • User  has  core  interests   •  User  more  likely  to  men6on  an  en6ty  about  a  topic  relevant  to  personal   interests  than  men6on  a  topic  of  non-­‐interest     •  User  expresses  these  interests  consistently  in  content  she  posts   online  in  mul6ple  communi6es   Can  use  a  seman6c  knowledge  base  to  formally  represent  these   topics  of  interest              
  • 44.
    Underlying  Assumptions   • User  has  core  interests   •  User  more  likely  to  men6on  an  en6ty  about  a  topic  relevant  to  personal   interests  than  men6on  a  topic  of  non-­‐interest     •  User  expresses  these  interests  consistently  in  content  she  posts   online  in  mul6ple  communi6es   •  Can  use  a  seman6c  knowledge  base  to  formally  represent  these   topics  of  interest              
  • 45.
    Underlying  Assumptions   • User  has  core  interests   •  User  more  likely  to  men6on  an  en6ty  about  a  topic  relevant  to  personal   interests  than  men6on  a  topic  of  non-­‐interest     •  User  expresses  these  interests  consistently  in  content  she  posts   online  in  mul6ple  communi6es   •  Can  use  a  seman6c  knowledge  base  to  formally  represent  these   topics  of  interest   Ø  Bridge  user  iden6ty  between  social  Web  and  knowledge  base,  K   Ø  Model  interests  using  K’s  organiza6onal  scheme   Ø  Rank  en6ty  senses  according  to  relevance  to  interests  
  • 46.
  • 47.
    Qualitative  Analysis:  Stable  Interests   User’s  topics  of  contribu6on  similar  across  Web:             On  average,  52.4%  of  en66es  a  user  men6ons  in  social  Web  (e.g.,   “Java”)  have  at  least  1  candidate  sense  in  same  parent  category  of   Wikipedia  ar6cle  same  user  edited  (e.g.,  “Programming  language”)   If  extend  to  just  4  parents  up  category  hierarchy,  get  all  100%    
  • 48.
    Qualitative  Analysis:  Stable  Interests   User’s  topics  of  contribu6on  similar  across  Web:     Same  Topics       On  average,  52.4%  of  en66es  a  user  men6ons  in  social  Web  (e.g.,   “Java”)  have  at  least  1  candidate  sense  in  same  parent  category  of   Wikipedia  ar6cle  same  user  edited  (e.g.,  “Programming  language”)   If  extend  to  just  4  parents  up  category  hierarchy,  get  all  100%           Ambiguous  YouTube  post:     office,  december  3     Same  user’s  recent  Wikipedia  edit:     <item  userid="xxxx"  user="xxxx”   pageid="31841130”  ,tle=     "The  Office  (U.S.  season  8)"/>    
  • 49.
    Qualitative  Analysis:  Stable  Interests   User’s  topics  of  contribu6on  similar  across  Web:     Same  Topics   Same  categories   •  On  average,  52.4%  of  en66es  a  user  men6ons  in  social  Web  (e.g.,   “Java”)  have  at  least  1  candidate  sense  in  same  parent  category  of   Wikipedia  ar6cle  same  user  edited  (e.g.,  “Programming  language”)   •  If  extend  to  just  4  parents  up  category  hierarchy,  get  all  100%           Ambiguous  YouTube  post:     office,  december  3     Same  user’s  recent  Wikipedia  edit:     <item  userid="xxxx"  user="xxxx”   pageid="31841130”  ,tle=     "The  Office  (U.S.  season  8)"/>    
  • 50.
  • 51.
    Theoretical  Motivations   • Online  Contribu6on:   •  Users  produce  online  content  about  key  set  of  personally-­‐interes6ng   topics  because  it  is  fulfilling  and  seen  as  having  beDer  cost  benefit   •  (Harper  et  al.,  2007;  Lakhani  &  von  Hippel,  2003;  Lerner  &  Tirole,  2000;   Ling  et  al.,  2006;  Maslow,  1970)      
  • 52.
    Theoretical  Motivations   • Online  Contribu6on:   •  Users  produce  online  content  about  key  set  of  personally-­‐interes6ng   topics  because  it  is  fulfilling  and  seen  as  having  beDer  cost  benefit   •  (Harper  et  al.,  2007;  Lakhani  &  von  Hippel,  2003;  Lerner  &  Tirole,  2000;   Ling  et  al.,  2006;  Maslow,  1970)   •  Modeling  Interests:   •  Effec6ve  to  model  these  topic  interests  from  lexical  features  of  these   text-­‐based  contribu6ons   •  (Chen  et  al.,  2010;  Cosley  et  al.,  2007;  Pennacchioq  &  Popescu,  2011)    
  • 53.
    Modeling  a  Knowledge  Context   •  Knowledge  base,  K   •  K=(N,E)   •  2  node  types:   •  Categories   •  Topics   c1 c2 c4 t3t2 c3 d2d1 d3 t1
  • 54.
  • 55.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N  
  • 56.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N  
  • 57.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N  
  • 58.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N  
  • 59.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier  
  • 60.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes    
  • 61.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N    
  • 62.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N  
  • 63.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N    
  • 64.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier    
  • 65.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier  
  • 66.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier  
  • 67.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier   •  Belongs  to  one  or  more  categories    
  • 68.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier   •  Belongs  to  one  or  more  categories    
  • 69.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier   •  Belongs  to  one  or  more  categories    
  • 70.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier   •  Belongs  to  one  or  more  categories    
  • 71.
    The  Knowledge  Graph   •  Category  nodes:  NCategory⊂N   •  Unique  iden6fier   •  Seman6c  rela6onships  with  other  nodes   •  Topic  nodes:  NTopic⊂N     •  Unique  iden6fier   •  Belongs  to  one  or  more  categories   •  Associated  with  text-­‐based  descrip6on    
  • 72.
  • 73.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic              
  • 74.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic              
  • 75.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic              
  • 76.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of          
  • 77.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of          
  • 78.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of          
  • 79.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of          
  • 80.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of   •  Category  nodes:  categories  reachable  in  knowledge  graph  from  those  topics      
  • 81.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of   •  Category  nodes:  categories  reachable  in  knowledge  graph  from  those  topics      
  • 82.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of   •  Category  nodes:  categories  reachable  in  knowledge  graph  from  those  topics      
  • 83.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of   •  Category  nodes:  categories  reachable  in  knowledge  graph  from  those  topics   •  Edge  weight  =  inverse  of  shortest  path  length  
  • 84.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of   •  Category  nodes:  categories  reachable  in  knowledge  graph  from  those  topics   •  Edge  weight  =  inverse  of  shortest  path  length  
  • 85.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of   •  Category  nodes:  categories  reachable  in  knowledge  graph  from  those  topics   •  Edge  weight  =  inverse  of  shortest  path  length   ! c1 c2 c3 c4 t1 ! ! ! 1! ! ! ! 0! t2 ! ! ! 1! ! ! ! 1! t3 0! 0! ! ! ! 1!
  • 86.
    User  Interest  Model   •  Edi6ng  a  descrip6on  signals  interest  in  associated  topic   •  Topic  nodes:  all  topics  user  edited  descrip6on  of   •  Category  nodes:  categories  reachable  in  knowledge  graph  from  those  topics   •  Edge  weight  =  inverse  of  shortest  path  length   ! c1 c2 c3 c4 t1 ! ! ! 1! ! ! ! 0! t2 ! ! ! 1! ! ! ! 1! t3 0! 0! ! ! ! 1! •  Same  representa6on  for  candidates  
  • 87.
    Instantiating  the  Model • Wikipedia   •  DBPedia   •  Freebase  
  • 88.
    Instantiating  the  Model • Wikipedia   •  DBPedia   •  Freebase  
  • 89.
    Instantiating  on  Wikipedia • Ar6cles,  categories  effec6vely  represent  topics  (Syed,  2008)  
  • 90.
    Instantiating  on  Wikipedia • Ar6cles,  categories  effec6vely  represent  topics  (Syed,  2008)   •  Good  coverage  of  even  rare  en6ty  concepts  (Zesch,  2007)  
  • 91.
    Instantiating  on  Wikipedia • Ar6cles,  categories  effec6vely  represent  topics  (Syed,  2008)   •  Good  coverage  of  even  rare  en6ty  concepts  (Zesch,  2007)   •  Compa6ble  with  NER  toolkits   •  DBPedia  Spotlight,  Wikipedia  Miner  
  • 92.
    Instantiating  on  Wikipedia • Ar6cles,  categories  effec6vely  represent  topics  (Syed,  2008)   •  Good  coverage  of  even  rare  en6ty  concepts  (Zesch,  2007)   •  Compa6ble  with  NER  toolkits   •  DBPedia  Spotlight,  Wikipedia  Miner   •  Ar6cle  edi6ng  behavior  effec6ve  for  modeling  interests  (Cosley,  2007;   Lieberman  &  Lin,  2009;  WaDenberg  et  al.,  2007)  
  • 93.
    Article  editing  signals  topic  interest   Editing Behavior Intuition Number of times user edits article Repeatedly editing an article implies greater commitment and interest Article’s overall edit activity and total number of editors Generally popular and actively edited articles are less discriminative of individ- ual interest and personal relevance Time period user edits article Long-term interests are stronger than fleeting, short-term interests Type of edit accord- ing to revision tag Trivial edits such as vandalism reversion or typo correction less indicative of inter- est than thoughtful, effortful edits Complexity, com- pleteness, informa- tiveness of edit ac- cording to metrics of Information Quality Type, substantiveness, and overall quality of care user gives to an edit indicates con- cern and interest in topic Edi6ng  behaviors  indica6ve  of  user  interest:  
  • 94.
    Article  editing  signals  topic  interest   Editing Behavior Intuition Number of times user edits article Repeatedly editing an article implies greater commitment and interest Article’s overall edit activity and total number of editors Generally popular and actively edited articles are less discriminative of individ- ual interest and personal relevance Time period user edits article Long-term interests are stronger than fleeting, short-term interests Type of edit accord- ing to revision tag Trivial edits such as vandalism reversion or typo correction less indicative of inter- est than thoughtful, effortful edits Complexity, com- pleteness, informa- tiveness of edit ac- cording to metrics of Information Quality Type, substantiveness, and overall quality of care user gives to an edit indicates con- cern and interest in topic Edi6ng  behaviors  indica6ve  of  user  interest:  
  • 95.
    Less  Meaningful  Edits   Ignore Irrelevant Edits Clean Article Text Articles with less than 100 non-stopwords Stem, tokenize, lowercase; re- move stopwords, punctuation, non-printable characters. Trivial edits, i.e., typo correc- tion, vandalism reversion. Parse Wiki Markup to remove article maintenance information List pages merely containing widely diverse sets of topics that are all not necessarily indicative of the piece person- ally relevant to the user
  • 96.
    Implementation:  The  RESLVE  System   RESLVE  (Resolving  En6ty  Sense  by  LeVeraging  Edits)  addresses  NED  by:     pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 97.
    Implementation:  The  RESLVE  System   RESLVE  (Resolving  En6ty  Sense  by  LeVeraging  Edits)  addresses  NED  by:   I.  Connec6ng  social  Web  +  Wikipedia  editor  iden6ty   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 98.
    Implementation:  The  RESLVE  System   RESLVE  (Resolving  En6ty  Sense  by  LeVeraging  Edits)  addresses  NED  by:   I.  Connec6ng  social  Web  +  Wikipedia  editor  iden6ty     II.  Modeling  topics  of  interests  using  ar6cle  edits   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 99.
    Implementation:  The  RESLVE  System   RESLVE  (Resolving  En6ty  Sense  by  LeVeraging  Edits)  addresses  NED  by:   I.  Connec6ng  social  Web  +  Wikipedia  editor  iden6ty     II.  Modeling  topics  of  interests  using  ar6cle  edits   III.  Ranking  en6ty  candidates  by  personal  relevance     pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 100.
    Implementation:  The  RESLVE  System   RESLVE  (Resolving  En6ty  Sense  by  LeVeraging  Edits)  addresses  NED  by:   I.  Connec6ng  social  Web  +  Wikipedia  editor  iden6ty     II.  Modeling  topics  of  interests  using  ar6cle  edits   III.  Ranking  en6ty  candidates  by  personal  relevance     pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 101.
    Phase  1:  Bridging  Web  Identities   •  Connect  iden6ty  of  social  media  user  with  Wikipedia  editor   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 102.
    Phase  1:  Bridging  Web  Identities   •  Connect  iden6ty  of  social  media  user  with  Wikipedia  editor   •  Simple  string  matching   •  Iofciu,  2011;  Perito,  2011   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 103.
    pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight topranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m") Phase  2:  Representing  Users  and  Entities   •  Models  user’s  topics  of  interest  using  bridged  Wiki  account’s  edi6ng-­‐history   •  Compares  similarity  of  those  topics  to  topic  associated  with  candidate  sense  
  • 104.
    •  Models  user’s  topics  of  interest  using  bridged  Wiki  account’s  edi6ng-­‐history   •  Compares  similarity  of  those  topics  to  topic  associated  with  candidate  sense   •  Content-­‐based  &  knowledge-­‐graph  based  similarity   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m") Phase  2:  Representing  Users  and  Entities  
  • 105.
    •  Models  user’s  topics  of  interest  using  bridged  Wiki  account’s  edi6ng-­‐history   •  Compares  similarity  of  those  topics  to  topic  associated  with  candidate  sense   •  Content-­‐based  &  knowledge-­‐graph  based  similarity   •  Weighted  vectors  used  to  represent  user  and  candidate  sense   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m") Phase  2:  Representing  Users  and  Entities  
  • 106.
    Content-­‐based  similarity   • Bag-­‐Of-­‐Words   •  Titles  of  ar6cles  user  edited   •  Candidate’s  ar6cle  6tle   •  Words  from  those  ar6cles’  pages  &  category  6tles   •  TF-­‐IDF  weighted    
  • 107.
    Content-­‐based  similarity   • Bag-­‐Of-­‐Words   •  Titles  of  ar6cles  user  edited   •  Candidate’s  ar6cle  6tle   •  Words  from  those  ar6cles’  pages  &  category  6tles   •  TF-­‐IDF  weighted     •  User,  u:  Vcontent,  u   •  Candidate  meaning,  m:  Vcontent,  m     simcontent(u,  m)  =  cossim(Vcontent,  u  ,  Vcontent,  m)    
  • 108.
    Knowledge-­‐context  based  similarity   •  Vectors  of  ar6cles’  category  IDs   •  Weight  is  distance  between  the  ar6cle  (topic)  and  category  in   knowledge  graph   •  E.g.,  “American  Television  Series”  >  “Broadcas6ng”    
  • 109.
    Knowledge-­‐context  based  similarity   •  Vectors  of  ar6cles’  category  IDs   •  Weight  is  distance  between  the  ar6cle  (topic)  and  category  in   knowledge  graph   •  E.g.,  “American  Television  Series”  >  “Broadcas6ng”     •  User,  u  :  Vcategory,  u   •  Candidate  meaning,  m:  Vcategory,  m     simcategory(u,  m)  =  cossim(Vcategory,  u  ,  Vcategory,  m)    
  • 110.
    Phase  3:  Ranking  by  Personal  Relevance   Output  highest  scoring  candidate  as  intended  meaning  by  measuring:   sim(u,m)=α*simcontent(u,m)+(1-­‐α)*simcategory(u,m)       pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 111.
    Pre-­‐processing  &  prepara6on  modules   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 112.
    Pre-­‐processing  &  prepara6on  modules   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 113.
    Pre-­‐processing  &  prepara6on  modules   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 114.
    Pre-­‐processing  &  prepara6on  modules   pre- processor Wikipedia Miner user utterances unstructured short texts DBPedia Spotlight top ranked personally- relevant candidates entity m m m entity username user contributed structured documents user interest model BRIDGING USER IDENTITY MODELING USER INTEREST I II III RANKING CANDIDATES BY PERSONAL RELEVANCE m m m m m m m m m m entity entity detected entities & candidate meanings ("m")
  • 115.
    Background   •  Task  Defini6ons   •  Challenges  &  Examples   •  ADempted  Solu6ons   Approach   •  Mo6va6ons   •  Modeling  a  Knowledge  Context   •  Implementa6on:  The  RESLVE  System   Evalua2on   •  Experiments   •  Results   •  Future  Work   Agenda  
  • 116.
    Experiment   Data  Sample   •  TwiDer:  tweets   •  YouTube:  video  6tles,  descrip6ons   •  Flickr:  photo  tags,  6tles,  descrip6ons    
  • 117.
    Experiment   Data  Sample   •  TwiDer:  tweets   •  YouTube:  video  6tles,  descrip6ons   •  Flickr:  photo  tags,  6tles,  descrip6ons     •  String-­‐matched  usernames  of  posters  to  Wikipedia  accounts   •  Mechanical  Turk  used  to  confirm  accounts  were  same  person    
  • 118.
    Experiment   Data  Sample   •  TwiDer:  tweets   •  YouTube:  video  6tles,  descrip6ons   •  Flickr:  photo  tags,  6tles,  descrip6ons     •  String-­‐matched  usernames  of  posters  to  Wikipedia  accounts   •  Mechanical  Turk  used  to  confirm  accounts  were  same  person   For  confirmed  matches:   •  Collected  100  most  recent  uDerances     •  ID,  6tle,  page  content,  categories  of  edited  ar6cles  
  • 119.
    Experiment   Labeling  correct  en6ty  meaning   •  1545  valid  ambiguous  en66es   •  Mechanical  Turk  Categoriza6on  Masters     •  Averaged  observed  agreement  across  all  coders  and  items  =  0.866   •  Average  Fleiss  Kappa  =  0.803   •  918  unanimously  labeled  ambiguous  en66es  
  • 120.
  • 121.
    Text  Length   Longest  uDerances  s6ll  shorter  than  even  shortest  texts  from   NER  task  corpora  like  Reuters-­‐21578,  Brown-­‐Corpus   0" 5" 10" 15" 20" 25" 30" 10" 40" 70" 100" 130" 160" 190" 300" 450" 600" 800" 1100" 1400" 2500" 4000" 5500" 7000" 8500" 10000" 11500" 13000" 14500" Twi/er" YouTube" Flickr" Reuters" Brown"
  • 122.
    High  Ambiguity   • NER  services  have  low  confidence     0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" Wikipedia"Miner" DBPedia"Spotlight"
  • 123.
    High  Ambiguity   • NER  services  have  low  confidence   •  Many  poten6al  candidates  (2  to  163,  avg.  5-­‐6,  median  4)   0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" Wikipedia"Miner" DBPedia"Spotlight"
  • 124.
    High  Ambiguity   • 91%  of  uDerances  contain  at  least  1  ambiguous  en6ty   •  2/3  of  en66es  detected  are  ambiguous   •  Almost  no  en66es  without  at  least  2  senses  to  disambiguate  
  • 125.
    Performance   Metric   • Precision  at  rank  1  (P@1)  
  • 126.
    Performance   Metric   • Precision  at  rank  1  (P@1)   Methods  of  comparison   •  Human  annotated  gold  standard   •  RC:  Randomly  sorted  candidates   •  PF:  Prior  frequency     •  RU:  RESLVE  given  a  random  Wikipedia  user's  interest  model     •  DS:  DBPedia  Spotlight   •  WM:  Wikipedia  Miner    
  • 127.
    Results   Flickr  YouTube   RESLVE   0.63   0.76   0.84   RC   0.21   0.32   0.31   PF   0.74   0.69   0.66   RU   0.51   0.71   0.78   WM   0.78   0.58   0.80   DS   0.53   0.67   0.63   Twitter
  • 128.
    Discussion   •  Best  performance  on  YouTube  texts          (longest)  due  to  content-­‐based  sim  
  • 129.
    Discussion   •  Best  performance  on  YouTube  texts          (longest)  due  to  content-­‐based  sim   •  Outperforms  on  more  personal  text  (e.g.,  tweets)   •  Random  user  model  less  effec6ve    
  • 130.
    Discussion   •  Best  performance  on  YouTube  texts          (longest)  due  to  content-­‐based  sim   •  Outperforms  on  more  personal  text  (e.g.,  tweets)   •  Random  user  model  less  effec6ve   •  Less  effec6ve  on  impersonal  text  (e.g.,  photo  geo-­‐tags)   •   High  prior  frequency  so  standard  methods  suffice   •  Personally-­‐unfamiliar  topics  so  not  likely  to  make  Wiki  edits  about  them   •  Stable  interests  assump6on  breaks  down  here  
  • 131.
    Error  Cases   • Automated  messages   •  “I  uploaded  a  video  on  @youtube”  à  1945  European  Films  
  • 132.
    Error  Cases   • Automated  messages   •  “I  uploaded  a  video  on  @youtube”  à  1945  European  Films   •  En66es  not  in  knowledge  base   •  “Peter  on  the  dock”  
  • 133.
    Error  Cases   • Automated  messages   •  “I  uploaded  a  video  on  @youtube”  à  1945  European  Films   •  En66es  not  in  knowledge  base   •  “Peter  on  the  dock”   •  Less  prolific  contributors  
  • 134.
  • 135.
    Future  Work   • Computability   •  Wikipedia  has  5M  ar6cles,  700K  categories  à  Vector  pruning      
  • 136.
    Future  Work   • Computability   •  Wikipedia  has  5M  ar6cles,  700K  categories  à  Vector  pruning   •  User  iden6ty  &  modeling  interests        
  • 137.
    Bridging  User  Accounts   #  Usernames   Exist  on  Wikipedia   TwiDer   479   46.1%   YouTube   454   19.6%   Flickr   226   21.7%  
  • 138.
    Bridging  User  Accounts   #  Usernames   Exist  on  Wikipedia   Matches  are  same  person   TwiDer   479   46.1%   47%   YouTube   454   19.6%   48%   Flickr   226   21.7%   71%  
  • 139.
  • 140.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)  
  • 141.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)  
  • 142.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)       c.  False  posi6ves  (string  match,  but  different  people)  
  • 143.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)       c.  False  posi6ves  (string  match,  but  different  people)   Collabora6ve  filtering  techniques  to  approximate  user's   own  interests  with  contribu6ons  of  social  connec6ons  ü     
  • 144.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)       c.  False  posi6ves  (string  match,  but  different  people)   Collabora6ve  filtering  techniques  to  approximate  user's   own  interests  with  contribu6ons  of  social  connec6ons  ü      Consider  more  profile  aDributes  than  username   ü     
  • 145.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)       c.  False  posi6ves  (string  match,  but  different  people)   Collabora6ve  filtering  techniques  to  approximate  user's   own  interests  with  contribu6ons  of  social  connec6ons  ü      Consider  more  profile  aDributes  than  username   ü     
  • 146.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)       c.  False  posi6ves  (string  match,  but  different  people)   •  Use  other  knowledge  base  besides  Wikipedia   Collabora6ve  filtering  techniques  to  approximate  user's   own  interests  with  contribu6ons  of  social  connec6ons  ü      Consider  more  profile  aDributes  than  username   ü     
  • 147.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)       c.  False  posi6ves  (string  match,  but  different  people)   •  Use  other  knowledge  base  besides  Wikipedia   •  Model  user  interest  from  addi6onal  kinds  of  par6cipa6on       (e.g.,  page  visits,  bookmarking  favori6ng)   Collabora6ve  filtering  techniques  to  approximate  user's   own  interests  with  contribu6ons  of  social  connec6ons  ü      Consider  more  profile  aDributes  than  username   ü     
  • 148.
    Bridging  User  Accounts a. True  nega6ve  (no  iden6ty  in  knowledge  base)       b.  False  nega6ve  (same  person,  different  usernames)       c.  False  posi6ves  (string  match,  but  different  people)   •  Use  other  knowledge  base  besides  Wikipedia   •  Model  user  interest  from  addi6onal  kinds  of  par6cipa6on       (e.g.,  page  visits,  bookmarking  favori6ng)   •  Interest  driy  &  6me-­‐frame  of  pos6ngs   Collabora6ve  filtering  techniques  to  approximate  user's   own  interests  with  contribu6ons  of  social  connec6ons  ü      Consider  more  profile  aDributes  than  username   ü     
  • 149.
    Summary  &  Conclusion   •  Social  Web  texts:  short  &  highly  personal   •  User  posts  about  same  topics  across  communi6es  (but  not  always)   •  Models  user  interest  as  personal  context  with  respect  to  a   knowledge  base’s  categorical  organiza6on  scheme   •  Ranking  technique  compares  en6ty’s  poten6al  meanings  to  user’s   interests  to  determine  intended  meaning   •  Language  and  context  independent   •  Promising  performance  gains   •  Going  forward:  such  a  strategy  becomes  increasingly  necessary,   feasible,  and  effec6ve  
  • 150.
    Thank    You!     Acknowledgements   •  Claire  Cardie,  Dan  Cosley,  Lillian  Lee,  Sean  Allen,  Wenceslaus  Lee     •  Na6onal  Science  Founda6on  Graduate  Research  Fellowship   under  Grant  No.  DGE  1144153   •  Marie  Curie  Interna6onal  Outgoing  Fellowship  within  the  7th   European  Community  Framework  Programme  (PIOF-­‐ GA-­‐2009-­‐252206).   •  Ques6ons?   Elizabeth  L.  Murnane   elm236@cornell.edu   Bernhard  Haslhofer   bernhard.haslhofer@   univie.ac.at   Carl  Lagoze   clagoze@umich.edu