Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Using SKOS Vocabularies for Improving Web Search

2,034 views

Published on

WoLE 2013 talk

Published in: Technology, Education
  • Be the first to comment

Using SKOS Vocabularies for Improving Web Search

  1. 1. Using  SKOS  Vocabularies  for  Improving  Web  Search  Web  of  Linked  En>>es  Workshop  (WoLE  2013)  WWW  2013,  Rio  de  Janeiro,  May  13th  2013    Bernhard  Haslhofer  |  University  of  Vienna  Flávio  Mar>ns  |  Universidade  Nova  de  Lisboa  João  Magalhães  |  Universidade  Nova  de  Lisboa  
  2. 2. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  2  
  3. 3. What  is  SKOS?  •  A  language  for  describing  Web  vocabularies  (taxonomies,  classifica>on  schemes,  thesauri)  •  Builds  on  Linked  Data  principles  – Concepts  have  URIs    – Concepts  are  interlinked  – Vocabularies  are  expressed  in  RDF  3  
  4. 4. Who  is  using  SKOS?  h"p://www.w3.org/2001/sw/wiki/SKOS/Datasets  4  
  5. 5. SKOS  Example  -­‐  UKAT  "Weapons"skos:prefLabelskos:Conceptukat:859"Military Equipment"skos:prefLabelskos:Conceptukat:5060skos:broaderskos:narrowerskos:broaderskos:narrower"Ordnance"skos:altLabel"Armaments"skos:altLabel"Arms"skos:altLabelukat: http://www.ukat.org.uk/thesaurus/concept/skos: http://www.w3.org/2004/02/skos/core# 5  
  6. 6. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  6  
  7. 7. The  big  picture  Retrieval ModelQueryDocumentsAnalysisAnalysisQueryRepresentationDocumentRepresentationScoring ResultsSKOS-­‐based  term  expansion  7  
  8. 8. Label-­‐based  (Query)  Expansion  •  Query:  roman  arms  •  Document:  Title   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   Weapons  8  
  9. 9. Label-­‐based  (Query)  Expansion  roman armsskos:prefLabelweaponsskos:altLabelarmamentsskos:broaderordnanceskos:broadermilitary equipmentTitle   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   Weapons  matches  9  
  10. 10. URI-­‐based  Expansion  •  Query:  roman  arms  •  Document:  Title   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   hbp://www.ukat.org.uk/thesaurus/concept/859  10  
  11. 11. URI-­‐based  Expansion  11  
  12. 12. URI-­‐based  Expansion  matches  Title   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   hbp://www.ukat.org.uk/thesaurus/concept/859  arms,  weapons,  armaments,  ...  Query:  roman  arms  12  
  13. 13. Scoring  •  Apply  regular  text  retrieval  func>ons  •  boostctype:  leverages  explicit  declara>on  of  SKOS  expansion  types  in  term  representa>ons  •  coordq,d:  ensures  that  a  document  with  more  matching  terms  will  score  higher  13  
  14. 14. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Implementa>on:  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  14  
  15. 15. h"ps://github.com/behas/lucene-­‐skos  15  
  16. 16. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Implementa>on:  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  16  
  17. 17. Dataset  1  •  OHSUMED:  – 350K  Pubmed  metadata  records  from  270  journals  – Title,  author,  abstract,  …  – 3  level  relevance  judgments  for  informa>on  needs  •  Medical  Subject  Headings  (MeSH)  in  SKOS  – Maintained  by  US  Na>onal  Library  of  Medicine  – Used  to  index  millions  of  ar>cles  in  PubMed    17  
  18. 18. Dataset  2  •  8,905  MODS  metadata  records  harvested  from  the  Library  of  Congress  (LoC)  •  10  queries  from  the  2009/2010  query  collec>on  •  Binary  relevance  judgments  for  queries    •  Library  of  Congress  Subject  Headings  (LCSH)  in  SKOS  18  
  19. 19. Method  •  Focus  on  label-­‐based  expansion  at  query  >me  •  Normalized  queries  and  SKOS  vocabularies  •  Expanded  each  query  term  by  collec>ng  SKOS  concepts  that  have  that  term  in  any  of  their  labels  (prefLabel,  altLabel,  hiddenLabel)  •  Applied  pre-­‐defined  boost-­‐weights  that  maximized  query  performance  19  
  20. 20. Two  Baselines  •  No  term  expansion  (NoExp)  – queries  are  executed  over  documents  without  SKOS-­‐based  term  expansion.  •  Pseudo  Relevance  Feedback  (PRF):  – Perform  ini>al  search  with  original  query  – Collect  terms  from  k  retrieved  documents  – Resubmit  query  +  collected  terms  20  
  21. 21. OHSUMED  query  expansion  example  Query:  fibromyalgia fibrositis, diagnosis and treatment  Expanded  Query:  (fibromyalgia rheumatism muscular^0.5 diffuse myofascial pain syndrome^0.5fibromyalgia fibromyositis syndrome^0.5 myofascial pain syndrome diffuse^0.5fibromyositis fibromyalgia syndrome^0.5 fibrositis^0.5)  (fibrositis rheumatism muscular^0.5 diffuse myofascial pain syndrome^0.5fibromyalgia fibromyositis syndrome^0.5 myofascial pain syndrome diffuse^0.5fibromyositis fibromyalgia syndrome^0.5 fibromyalgia^0.5)  (diagnosis examinations diagnoses^0.5)  (treatment disease management^0.5 therapy^0.5)  21  
  22. 22. Results  OHSUMED/MESH  LTC TF-IDFP@1 P@3 P@10 nDCG@1 nDCG@3 nDCG@10 PNoExp 0.333 0.302 0.260 0.407 0.376 0.356 0.3PRF 0.322 0.282 0.302 0.379 0.360 0.393 0.3SKOS 0.419 0.366 0.276 0.484 0.429 0.379 0.5Table 1: Precision and nDCG results onquery set provided with each dataset (63 OHSUMED/MESH,10 LoC/LCSH). We focused on two measures, precision atrank n (P@n) for both dataset bundles and nDCG at rank n(nDCG@n) for the OHSUMED/MESH bundle, which pro-vides ordinal relevance judgments. Two retrieval modelswere used for ranking documents: LTC TF-IDF and BM25.4.1 Dataset Characteristicsonlyoutco25 mwithoapartPRFdocumBM25nDCG@10 P@1 P@3 P@10 nDCG@1 nDCG@3 nDCG@100.356 0.381 0.344 0.265 0.450 0.414 0.3740.393 0.377 0.317 0.275 0.443 0.397 0.3690.379 0.500 0.366 0.282 0.548 0.435 0.397CG results on the OHSUMED dataset.22  
  23. 23. Ini>al  Results  LoC/LCSH  .fsh.m-ee-ede.--fPRF might lead to better recall i.e., return more relevantdocuments, but hurt precison at top ranks. On the otherhand, with SKOS-expansion the terms used in expansionare always from a SKOS vocabulary and not from the cor-pus. Therefore, we are already only expanding using rele-vant terms, since the terms’ existence in the vocabulary is agood hint that the term is important.LTC TF-IDFP@1 P@3 P@10NoExp 0.500 0.500 0.370PRF 0.300 0.433 0.430SKOS 0.600 0.533 0.370Table 2: Precision with LTC TF-IDF results on theLibrary of Congress dataset.Table 2 shows the results we obtained from early experi-ments with the Library of Congress Subject Headings. Theresults are similar to the results with OHSUMED. The PRF23  
  24. 24. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Implementa>on:  Lucene-­‐SKOS  •  Experiments  •  Conclusions  24  
  25. 25. Conclusions  •  Our  experiments  indicated  gains  in  retrieval  effec>veness  compared  to  no-­‐expansion  or  pseudo  relevance  feedback  •  Our  solu>on  can  easily  be  adopted  by  loading  lucene-­‐skos  with  Apache  Lucene  and  Solr  25  
  26. 26. Future  Work  •  More  thorough  evalua>on  using  LoC/LCSH    •  Use  corpora,  queries,  and  vocabularies  from  other  domains  •  Apply  this  technique  for  general  concept-­‐based  and  URI-­‐iden>fied  Web  data  sources  26  
  27. 27. Thanks!  @bhaslhofer,  @flaviomar>ns  hbp://slideshare.net/bhaslhofer    hbps://github.com/behas/lucene-­‐skos  27  

×