Entities, Graphs, and Crowdsourcing for better Web Search

544 views

Published on

Invited talk at the Università della Svizzera Italiana, Lugano, Switzerland in June 2013

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
544
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Entities, Graphs, and Crowdsourcing for better Web Search

  1. 1. En##es,  Graphs,  and  Crowdsourcing   for  be7er  Web  Search   Gianluca  Demar#ni   eXascale  Infolab   University  of  Fribourg,  Switzerland  
  2. 2. Gianluca  Demar#ni   •  M.Sc.  at  University  of  Udine,  Italy   •  Ph.D.  at  University  of  Hannover,  Germany   –  En#ty  Retrieval   •  Worked  for  UC  Berkeley  (on  Crowdsourcing),  Yahoo!  Research   (Spain),  L3S  Research  Center  (Germany)   •  Post-­‐doc  at  the  eXascale  Infolab,  Uni  Fribourg,  Switzerland.   •  Lecturer  for  Social  Compu,ng  in  Fribourg   •  Tutorial  on  En#ty  Search  at  ECIR  2012,  on  Crowdsourcing  at   ESWC  2013  and  ISWC  2013   •  Research  Interests   –  Informa#on  Retrieval,  Seman#c  Web,  Crowdsourcing   2   demartini@exascale.info Gianluca  Demar#ni  
  3. 3. Gianluca  Demar#ni   3  
  4. 4. Gianluca  Demar#ni   4  
  5. 5. Web  of  Data   •  Freebase   –  Acquired  by  Google  in  July  2010.   –  Knowledge  Graph  launched  in  May  2012.   •  Schema.org   –  Driven  by  major  search  engine  companies   –  Machine-­‐readable  annota#ons  of  Web  pages   •  Linked  Open  Data   –  31  billion  triples,  Sept.  2011   Gianluca  Demar#ni   5  
  6. 6. Linked  Open  Data   Z.  Kaoudi  and  I.  Manolescu,  ICDE  seminar  2013     6  
  7. 7. I  will  talk  about   •  En#ty  Linking/Disambigua#on   – On  the  Web  using  crowdsourcing   – For  scien#fic  literature  using  graphs   •  Ad-­‐hoc  Object  Retrieval  (En#ty  Ranking)   – Using  IR  and  graphs   •  Crowdsourced  Query  Understanding   Gianluca  Demar#ni   7  
  8. 8. Disclaimer   •  No  efficiency  evalua#on   – Approaches  not  distributed   – But  designed  to  scale  out   •  No  user  studies   – Goal:  Obtain  high  quality  data   – Only  TREC-­‐like  evalua#on  on  effec#veness   Gianluca  Demar#ni   8  
  9. 9. En#ty  Linking/Disambigua#on  
  10. 10. Gianluca  Demar#ni   10   h7p://dbpedia.org/resource/Facebook   h7p://dbpedia.org/resource/Instagram   jase:Instagram   owl:sameAs   Google   Android   <p>Facebook  is  not  wai#ng  for  its  ini#al   public  offering  to  make  its  first  big   purchase.</p><p>In  its  largest   acquisi#on  to  date,  the  social  network   has  purchased  Instagram,  the  popular   photo-­‐sharing  applica#on,  for  about  $1   billion  in  cash  and  stock,  the  company   said  Monday.</p>   <p><span  about="h7p://dbpedia.org/resource/ Facebook"><cite  property=”rdfs:label">Facebook</ cite>  is  not  wai#ng  for  its  ini#al  public  offering  to   make  its  first  big  purchase.</span></p><p><span   about="h7p://dbpedia.org/resource/Instagram">In   its  largest  acquisi#on  to  date,  the  social  network  has   purchased  <cite  property=”rdfs:label">Instagram</ cite>  ,  the  popular  photo-­‐sharing  applica#on,  for   about  $1  billion  in  cash  and  stock,  the  company  said   Monday.</span></p>   RDFa   enrichment   HTML:  
  11. 11. Crowdsourcing   •  Exploit  human  intelligence  to  solve   – Tasks  simple  for  humans,  complex  for  machines   – With  a  large  number  of  humans  (the  Crowd)   – Small  problems:  micro-­‐tasks  (Amazon  MTurk)   •  Examples   – Wikipedia,  Image  tagging   •  Incen#ves   – Financial,  fun,  visibility   Gianluca  Demar#ni   11  
  12. 12. ZenCrowd   •  Combine  both  algorithmic  and  manual  linking   •  Automate  manual  linking  via  crowdsourcing   •  Dynamically  assess  human  workers  with  a   probabilis#c  reasoning  framework   12   Crowd   Algorithms  Machines   Gianluca  Demar#ni  
  13. 13. ZenCrowd  Architecture   Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform ZenCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers Gianluca  Demar#ni   13   Gianluca  Demar#ni,  Djellel  Eddine  Difallah,  and  Philippe  Cudré-­‐Mauroux.  ZenCrowd:  Leveraging  Probabilis#c   Reasoning  and  Crowdsourcing  Techniques  for  Large-­‐Scale  En#ty  Linking.  In:  21st  Interna#onal  Conference  on   World  Wide  Web  (WWW  2012).  
  14. 14. Algorithmic  Matching   •  Inverted  index  over  LOD  en##es   – DBPedia,  Freebase,  Geonames,  NYT   •  TF-­‐IDF  (IR  ranking  func#on)   •  Top  ranked  URIs  linked  to  en##es  in  docs   •  Threshold  on  the  ranking  func#on  or  top  N   Gianluca  Demar#ni   14  
  15. 15. En#ty  Factor  Graphs   •  Graph  components   – Workers,  links,  clicks   – Prior  probabili#es   – Link  Factors   – Constraints   •  Probabilis#c   Inference   – Select  all  links  with   posterior  prob  >τ   w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l3 lf3( ) pl3( ) c11 c22 c12 c21 c13 c23 u2-3( )sa1-2( ) 2  workers,  6  clicks,  3  candidate  links   Link  priors   Worker   priors   Observed   variables   Link   factors   SameAs   constraints   Dataset   Unicity   constraints   Gianluca  Demar#ni   15  
  16. 16. En#ty  Factor  Graphs   •  Training  phase   – Ini#alize  worker  priors   – with  k  matches  on  known  answers   •  Upda#ng  worker  Priors   – Use  link  decision  as  new  observa#ons   – Compute  new  worker  probabili#es   •  Iden#fy  (and  discard)  unreliable  workers   Gianluca  Demar#ni   16  
  17. 17. Experimental  Evalua#on   •  Datasets   –  25  news  ar#cles  from   •  CNN.com  (Global  news)   •  NYTimes.com  (Global  news)   •  Washington-­‐post.com  (US  local  news)   •  Timesofindia.india#mes.com  (India  news)   •  Swissinfo.com  (Switzerland  local  news)   –  40M  en##es  (Freebase,  DBPedia,  Geonames,  NYT)   Gianluca  Demar#ni   17  
  18. 18. Worker  Selec#on   Gianluca  Demar#ni   18   Top$US$ Worker$ 0$ 0.5$ 1$ 0$ 250$ 500$ Worker&Precision& Number&of&Tasks& US$Workers$ IN$Workers$ 0.6$ 0.62$ 0.64$ 0.66$ 0.68$ 0.7$ 0.72$ 0.74$ 0.76$ 0.78$ 0.8$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$Precision) Top)K)workers)
  19. 19. Lessons  Learnt   •  Crowdsourcing  +  Prob  reasoning  works!   •  But   – Different  worker  communi#es  perform  differently   – Many  low  quality  workers   – Comple#on  #me  may  vary  (based  on  reward)   •  Need  to  find  the  right  workers  for  your  task   (see  WWW13  paper)   Gianluca  Demar#ni   19  
  20. 20. ZenCrowd  Summary   •  ZenCrowd:  Probabilis#c  reasoning  over  automa#c  and   crowdsourcing  methods  for  en#ty  linking   •  Standard  crowdsourcing  improves  6%  over  automa#c   •  4%  -­‐  35%  improvement  over  standard  crowdsourcing   •  14%  average  improvement  over  automa#c  approaches   •  On-­‐going  work:   –  Also  used  for  instance  matching  across  datasets   –  3-­‐way  blocking  with  the  crowd   h7p://exascale.info/zencrowd/   Gianluca  Demar#ni   20  
  21. 21. En#ty  Disambigua#on   in  Scien#fic  Literature   •  Using  a  background  concept  graph   Roman  Prokofyev,  Gianluca  Demar#ni,  Philippe  Cudré-­‐Mauroux,  Alexey  Boyarsky,  and  Oleg  Ruchayskiy.   Ontology-­‐Based  Word  Sense  Disambigua#on  in  the  Scien#fic  Domain.  In:  35th  European  Conference  on   Informa#on  Retrieval  (ECIR  2013).   Gianluca  Demar#ni   21   h7p://scienceWISE.info/  
  22. 22. En#ty  Ranking  
  23. 23. Ad-­‐hoc  Object  Retrieval   •  Once  en##es  have  been  iden#fied…   •  We  want  to  rank  them  as  answer  to  a  query   •  AOR   – Given  the  descrip#on  of  an  en#ty   – give  me  back  its  iden#fier   – Input:  query  q,  data  graph  G   – Output:  ranked  list  of  URIs  from  G   Gianluca  Demar#ni   23  
  24. 24. An  Hybrid  Approach  to  AOR   Alberto  Tonon,  Gianluca  Demar#ni,  and  Philippe  Cudré-­‐Mauroux.  Combining  Inverted  Indices  and  Structured   Search  for  Ad-­‐hoc  Object  Retrieval.  In:  35th  Annual  ACM  SIGIR  Conference  (SIGIR  2012).   index() User Query Annotation and Expansion Inverted Index RDF Store Ranking FunctionsRanking FunctionsRanking Functions query() Entity Search Keyword Query intermediate top-k results Graph-Enriched Results Graph Traversals (queries on object properties) Neighborhoods (queries on datatype properties) Structured Inverted Index WordNet 3rd party search engines Final Ranking Function Pseudo-Relevance Feedback Gianluca  Demar#ni   24  
  25. 25. AOR  Evalua#on   •  1.3  billions  RDF  triples  from  LOD  cloud   •  92  and  50  queries   •  Crowdsourced  relevance  judgments   •  semsearch.yahoo.com   Gianluca  Demar#ni   25  
  26. 26. Evalua#on  Results   Gianluca  Demar#ni   26  
  27. 27. Summary   •  AOR  =  “Given  the  descrip,on  of  an  en,ty,  give   me  back  its  iden,fier”     •  Combining  classic  IR  techniques  +  structured   database  storing  graph  data     •  Significantly  be7er  results  (up  to  +25%  MAP   over  BM25  baseline).     •  Overhead  caused  from  the  graph  traversal   part  is  limited     Gianluca  Demar#ni   27   h7p://exascale.info/AOR/  
  28. 28. CrowdQ:  Crowdsourced  Query   Understanding  
  29. 29. birthdate  of  mayor  of  capital  city  of  france   Gianluca  Demar#ni   29  
  30. 30. capital  city  of  france   Gianluca  Demar#ni   30  
  31. 31. mayor  of  paris   Gianluca  Demar#ni   31  
  32. 32. birthdate  of  Bertrand  Delanoë   Gianluca  Demar#ni   32  
  33. 33. Mo#va#on   •  Web  Search  Engines  can  answer  simple  factual   queries  directly  on  the  result  page   •  Users  with  complex  informa#on  needs  are   oyen  unsa#sfied   •  Purely  automa#c  techniques  are  not  enough   •  We  want  to  solve  it  with  Crowdsourcing!   Gianluca  Demar#ni   33  
  34. 34. CrowdQ   •  CrowdQ  is  the  first  system  that  uses   crowdsourcing  to   – Understand  the  intended  meaning   – Build  a  structured  query  template   – Answer  the  query  over  Linked  Open  Data   Gianluca  Demar#ni   34   Gianluca  Demar#ni,  Beth  Trushkowsky,  Tim  Kraska,  and  Michael  Franklin.  CrowdQ:   Crowdsourced  Query  Understanding.  In:  6th  Biennial  Conference  on  Innova#ve  Data  Systems   Research  (CIDR  2013).  
  35. 35. 35  
  36. 36. User Keyword Query On#line'Complex'Query Processing Complex query classifier Crowdsourcing Platform Vetrical selection, Unstructured Search, ... POS + NER tagging Query Template Index Crowd Manager N Y Queries Templ + Answer Types Structured LOD Search Result Joiner Template Generation SERP t1 t2 t3 Off#line'Complex'Query Decomposition Structured Query Query Log query N Answer Composition LOD Open Data Cloud Match with existing query templates CrowdQ  Architecture   36   Off-­‐line:  query  template  genera#on  with  the  help  of  the  crowd   On-­‐line:  query  template  matching  using  NLP  and  search  over  open  data  
  37. 37. Hybrid  Human-­‐Machine  Pipeline   Gianluca  Demar#ni   37   Q=  birthdate  of  actors  of  forrest  gump   Query  annota#on   Noun   Noun   Named  en#ty   Verifica#on   En#ty  Rela#ons   Is  forrest  gump  this  en#ty  in  the  query?   Which  is  the  rela#on  between:  actors  and  forrest  gump   starring   Schema  element   Starring                          <dbpedia-­‐owl:starring>     Verifica#on   Is  the  rela#on  between:   Indiana  Jones  –  Harrison  Ford   Back  to  the  Future  –  Michael  J.  Fox   of  the  same  type  as   Forrest  Gump  -­‐  actors      
  38. 38. Structured  query  genera#on   SELECT  ?y  ?x   WHERE  {  ?y  <dbpedia-­‐owl:birthdate>  ?x  .        ?z  <dbpedia-­‐owl:starring>  ?y  .        ?z  <rdfs:label>  ‘Forrest  Gump’  }   Gianluca  Demar#ni   38   Results  from  BTC09:   Q=  birthdate  of  actors  of  forrest  gump   MOVIE   MOVIE  
  39. 39. Conclusions   •  Structured  Data  make  Web  Search  be7er   •  Exploit  the  best  out  of  structured  and   unstructured  data  (Hybrid  AOR)   •  Crowd  can  help  in  understanding  seman#cs   •  Hybrid  human-­‐machine  systems  (ZenCrowd)   •  Exploit  Human  Intelligence  at  Scale  (CrowdQ)   gianlucademartini.net demartini@exascale.info Gianluca  Demar#ni   39  

×