Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Entities, Graphs, and Crowdsourcing for better Web Search


Published on

Invited talk at the Università della Svizzera Italiana, Lugano, Switzerland in June 2013

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Entities, Graphs, and Crowdsourcing for better Web Search

  1. 1. En##es,  Graphs,  and  Crowdsourcing   for  be7er  Web  Search   Gianluca  Demar#ni   eXascale  Infolab   University  of  Fribourg,  Switzerland  
  2. 2. Gianluca  Demar#ni   •  M.Sc.  at  University  of  Udine,  Italy   •  Ph.D.  at  University  of  Hannover,  Germany   –  En#ty  Retrieval   •  Worked  for  UC  Berkeley  (on  Crowdsourcing),  Yahoo!  Research   (Spain),  L3S  Research  Center  (Germany)   •  Post-­‐doc  at  the  eXascale  Infolab,  Uni  Fribourg,  Switzerland.   •  Lecturer  for  Social  Compu,ng  in  Fribourg   •  Tutorial  on  En#ty  Search  at  ECIR  2012,  on  Crowdsourcing  at   ESWC  2013  and  ISWC  2013   •  Research  Interests   –  Informa#on  Retrieval,  Seman#c  Web,  Crowdsourcing   2 Gianluca  Demar#ni  
  3. 3. Gianluca  Demar#ni   3  
  4. 4. Gianluca  Demar#ni   4  
  5. 5. Web  of  Data   •  Freebase   –  Acquired  by  Google  in  July  2010.   –  Knowledge  Graph  launched  in  May  2012.   •   –  Driven  by  major  search  engine  companies   –  Machine-­‐readable  annota#ons  of  Web  pages   •  Linked  Open  Data   –  31  billion  triples,  Sept.  2011   Gianluca  Demar#ni   5  
  6. 6. Linked  Open  Data   Z.  Kaoudi  and  I.  Manolescu,  ICDE  seminar  2013     6  
  7. 7. I  will  talk  about   •  En#ty  Linking/Disambigua#on   – On  the  Web  using  crowdsourcing   – For  scien#fic  literature  using  graphs   •  Ad-­‐hoc  Object  Retrieval  (En#ty  Ranking)   – Using  IR  and  graphs   •  Crowdsourced  Query  Understanding   Gianluca  Demar#ni   7  
  8. 8. Disclaimer   •  No  efficiency  evalua#on   – Approaches  not  distributed   – But  designed  to  scale  out   •  No  user  studies   – Goal:  Obtain  high  quality  data   – Only  TREC-­‐like  evalua#on  on  effec#veness   Gianluca  Demar#ni   8  
  9. 9. En#ty  Linking/Disambigua#on  
  10. 10. Gianluca  Demar#ni   10   h7p://   h7p://   jase:Instagram   owl:sameAs   Google   Android   <p>Facebook  is  not  wai#ng  for  its  ini#al   public  offering  to  make  its  first  big   purchase.</p><p>In  its  largest   acquisi#on  to  date,  the  social  network   has  purchased  Instagram,  the  popular   photo-­‐sharing  applica#on,  for  about  $1   billion  in  cash  and  stock,  the  company   said  Monday.</p>   <p><span  about="h7p:// Facebook"><cite  property=”rdfs:label">Facebook</ cite>  is  not  wai#ng  for  its  ini#al  public  offering  to   make  its  first  big  purchase.</span></p><p><span   about="h7p://">In   its  largest  acquisi#on  to  date,  the  social  network  has   purchased  <cite  property=”rdfs:label">Instagram</ cite>  ,  the  popular  photo-­‐sharing  applica#on,  for   about  $1  billion  in  cash  and  stock,  the  company  said   Monday.</span></p>   RDFa   enrichment   HTML:  
  11. 11. Crowdsourcing   •  Exploit  human  intelligence  to  solve   – Tasks  simple  for  humans,  complex  for  machines   – With  a  large  number  of  humans  (the  Crowd)   – Small  problems:  micro-­‐tasks  (Amazon  MTurk)   •  Examples   – Wikipedia,  Image  tagging   •  Incen#ves   – Financial,  fun,  visibility   Gianluca  Demar#ni   11  
  12. 12. ZenCrowd   •  Combine  both  algorithmic  and  manual  linking   •  Automate  manual  linking  via  crowdsourcing   •  Dynamically  assess  human  workers  with  a   probabilis#c  reasoning  framework   12   Crowd   Algorithms  Machines   Gianluca  Demar#ni  
  13. 13. ZenCrowd  Architecture   Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform ZenCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers Gianluca  Demar#ni   13   Gianluca  Demar#ni,  Djellel  Eddine  Difallah,  and  Philippe  Cudré-­‐Mauroux.  ZenCrowd:  Leveraging  Probabilis#c   Reasoning  and  Crowdsourcing  Techniques  for  Large-­‐Scale  En#ty  Linking.  In:  21st  Interna#onal  Conference  on   World  Wide  Web  (WWW  2012).  
  14. 14. Algorithmic  Matching   •  Inverted  index  over  LOD  en##es   – DBPedia,  Freebase,  Geonames,  NYT   •  TF-­‐IDF  (IR  ranking  func#on)   •  Top  ranked  URIs  linked  to  en##es  in  docs   •  Threshold  on  the  ranking  func#on  or  top  N   Gianluca  Demar#ni   14  
  15. 15. En#ty  Factor  Graphs   •  Graph  components   – Workers,  links,  clicks   – Prior  probabili#es   – Link  Factors   – Constraints   •  Probabilis#c   Inference   – Select  all  links  with   posterior  prob  >τ   w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l3 lf3( ) pl3( ) c11 c22 c12 c21 c13 c23 u2-3( )sa1-2( ) 2  workers,  6  clicks,  3  candidate  links   Link  priors   Worker   priors   Observed   variables   Link   factors   SameAs   constraints   Dataset   Unicity   constraints   Gianluca  Demar#ni   15  
  16. 16. En#ty  Factor  Graphs   •  Training  phase   – Ini#alize  worker  priors   – with  k  matches  on  known  answers   •  Upda#ng  worker  Priors   – Use  link  decision  as  new  observa#ons   – Compute  new  worker  probabili#es   •  Iden#fy  (and  discard)  unreliable  workers   Gianluca  Demar#ni   16  
  17. 17. Experimental  Evalua#on   •  Datasets   –  25  news  ar#cles  from   •  (Global  news)   •  (Global  news)   •  Washington-­‐  (US  local  news)   •  (India  news)   •  (Switzerland  local  news)   –  40M  en##es  (Freebase,  DBPedia,  Geonames,  NYT)   Gianluca  Demar#ni   17  
  18. 18. Worker  Selec#on   Gianluca  Demar#ni   18   Top$US$ Worker$ 0$ 0.5$ 1$ 0$ 250$ 500$ Worker&Precision& Number&of&Tasks& US$Workers$ IN$Workers$ 0.6$ 0.62$ 0.64$ 0.66$ 0.68$ 0.7$ 0.72$ 0.74$ 0.76$ 0.78$ 0.8$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$Precision) Top)K)workers)
  19. 19. Lessons  Learnt   •  Crowdsourcing  +  Prob  reasoning  works!   •  But   – Different  worker  communi#es  perform  differently   – Many  low  quality  workers   – Comple#on  #me  may  vary  (based  on  reward)   •  Need  to  find  the  right  workers  for  your  task   (see  WWW13  paper)   Gianluca  Demar#ni   19  
  20. 20. ZenCrowd  Summary   •  ZenCrowd:  Probabilis#c  reasoning  over  automa#c  and   crowdsourcing  methods  for  en#ty  linking   •  Standard  crowdsourcing  improves  6%  over  automa#c   •  4%  -­‐  35%  improvement  over  standard  crowdsourcing   •  14%  average  improvement  over  automa#c  approaches   •  On-­‐going  work:   –  Also  used  for  instance  matching  across  datasets   –  3-­‐way  blocking  with  the  crowd   h7p://   Gianluca  Demar#ni   20  
  21. 21. En#ty  Disambigua#on   in  Scien#fic  Literature   •  Using  a  background  concept  graph   Roman  Prokofyev,  Gianluca  Demar#ni,  Philippe  Cudré-­‐Mauroux,  Alexey  Boyarsky,  and  Oleg  Ruchayskiy.   Ontology-­‐Based  Word  Sense  Disambigua#on  in  the  Scien#fic  Domain.  In:  35th  European  Conference  on   Informa#on  Retrieval  (ECIR  2013).   Gianluca  Demar#ni   21   h7p://  
  22. 22. En#ty  Ranking  
  23. 23. Ad-­‐hoc  Object  Retrieval   •  Once  en##es  have  been  iden#fied…   •  We  want  to  rank  them  as  answer  to  a  query   •  AOR   – Given  the  descrip#on  of  an  en#ty   – give  me  back  its  iden#fier   – Input:  query  q,  data  graph  G   – Output:  ranked  list  of  URIs  from  G   Gianluca  Demar#ni   23  
  24. 24. An  Hybrid  Approach  to  AOR   Alberto  Tonon,  Gianluca  Demar#ni,  and  Philippe  Cudré-­‐Mauroux.  Combining  Inverted  Indices  and  Structured   Search  for  Ad-­‐hoc  Object  Retrieval.  In:  35th  Annual  ACM  SIGIR  Conference  (SIGIR  2012).   index() User Query Annotation and Expansion Inverted Index RDF Store Ranking FunctionsRanking FunctionsRanking Functions query() Entity Search Keyword Query intermediate top-k results Graph-Enriched Results Graph Traversals (queries on object properties) Neighborhoods (queries on datatype properties) Structured Inverted Index WordNet 3rd party search engines Final Ranking Function Pseudo-Relevance Feedback Gianluca  Demar#ni   24  
  25. 25. AOR  Evalua#on   •  1.3  billions  RDF  triples  from  LOD  cloud   •  92  and  50  queries   •  Crowdsourced  relevance  judgments   •   Gianluca  Demar#ni   25  
  26. 26. Evalua#on  Results   Gianluca  Demar#ni   26  
  27. 27. Summary   •  AOR  =  “Given  the  descrip,on  of  an  en,ty,  give   me  back  its  iden,fier”     •  Combining  classic  IR  techniques  +  structured   database  storing  graph  data     •  Significantly  be7er  results  (up  to  +25%  MAP   over  BM25  baseline).     •  Overhead  caused  from  the  graph  traversal   part  is  limited     Gianluca  Demar#ni   27   h7p://  
  28. 28. CrowdQ:  Crowdsourced  Query   Understanding  
  29. 29. birthdate  of  mayor  of  capital  city  of  france   Gianluca  Demar#ni   29  
  30. 30. capital  city  of  france   Gianluca  Demar#ni   30  
  31. 31. mayor  of  paris   Gianluca  Demar#ni   31  
  32. 32. birthdate  of  Bertrand  Delanoë   Gianluca  Demar#ni   32  
  33. 33. Mo#va#on   •  Web  Search  Engines  can  answer  simple  factual   queries  directly  on  the  result  page   •  Users  with  complex  informa#on  needs  are   oyen  unsa#sfied   •  Purely  automa#c  techniques  are  not  enough   •  We  want  to  solve  it  with  Crowdsourcing!   Gianluca  Demar#ni   33  
  34. 34. CrowdQ   •  CrowdQ  is  the  first  system  that  uses   crowdsourcing  to   – Understand  the  intended  meaning   – Build  a  structured  query  template   – Answer  the  query  over  Linked  Open  Data   Gianluca  Demar#ni   34   Gianluca  Demar#ni,  Beth  Trushkowsky,  Tim  Kraska,  and  Michael  Franklin.  CrowdQ:   Crowdsourced  Query  Understanding.  In:  6th  Biennial  Conference  on  Innova#ve  Data  Systems   Research  (CIDR  2013).  
  35. 35. 35  
  36. 36. User Keyword Query On#line'Complex'Query Processing Complex query classifier Crowdsourcing Platform Vetrical selection, Unstructured Search, ... POS + NER tagging Query Template Index Crowd Manager N Y Queries Templ + Answer Types Structured LOD Search Result Joiner Template Generation SERP t1 t2 t3 Off#line'Complex'Query Decomposition Structured Query Query Log query N Answer Composition LOD Open Data Cloud Match with existing query templates CrowdQ  Architecture   36   Off-­‐line:  query  template  genera#on  with  the  help  of  the  crowd   On-­‐line:  query  template  matching  using  NLP  and  search  over  open  data  
  37. 37. Hybrid  Human-­‐Machine  Pipeline   Gianluca  Demar#ni   37   Q=  birthdate  of  actors  of  forrest  gump   Query  annota#on   Noun   Noun   Named  en#ty   Verifica#on   En#ty  Rela#ons   Is  forrest  gump  this  en#ty  in  the  query?   Which  is  the  rela#on  between:  actors  and  forrest  gump   starring   Schema  element   Starring                          <dbpedia-­‐owl:starring>     Verifica#on   Is  the  rela#on  between:   Indiana  Jones  –  Harrison  Ford   Back  to  the  Future  –  Michael  J.  Fox   of  the  same  type  as   Forrest  Gump  -­‐  actors      
  38. 38. Structured  query  genera#on   SELECT  ?y  ?x   WHERE  {  ?y  <dbpedia-­‐owl:birthdate>  ?x  .        ?z  <dbpedia-­‐owl:starring>  ?y  .        ?z  <rdfs:label>  ‘Forrest  Gump’  }   Gianluca  Demar#ni   38   Results  from  BTC09:   Q=  birthdate  of  actors  of  forrest  gump   MOVIE   MOVIE  
  39. 39. Conclusions   •  Structured  Data  make  Web  Search  be7er   •  Exploit  the  best  out  of  structured  and   unstructured  data  (Hybrid  AOR)   •  Crowd  can  help  in  understanding  seman#cs   •  Hybrid  human-­‐machine  systems  (ZenCrowd)   •  Exploit  Human  Intelligence  at  Scale  (CrowdQ) Gianluca  Demar#ni   39