Using SKOS Vocabularies for Improving Web Search

1,839 views

Published on

WoLE 2013 talk

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,839
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
19
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Using SKOS Vocabularies for Improving Web Search

  1. 1. Using  SKOS  Vocabularies  for  Improving  Web  Search  Web  of  Linked  En>>es  Workshop  (WoLE  2013)  WWW  2013,  Rio  de  Janeiro,  May  13th  2013    Bernhard  Haslhofer  |  University  of  Vienna  Flávio  Mar>ns  |  Universidade  Nova  de  Lisboa  João  Magalhães  |  Universidade  Nova  de  Lisboa  
  2. 2. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  2  
  3. 3. What  is  SKOS?  •  A  language  for  describing  Web  vocabularies  (taxonomies,  classifica>on  schemes,  thesauri)  •  Builds  on  Linked  Data  principles  – Concepts  have  URIs    – Concepts  are  interlinked  – Vocabularies  are  expressed  in  RDF  3  
  4. 4. Who  is  using  SKOS?  h"p://www.w3.org/2001/sw/wiki/SKOS/Datasets  4  
  5. 5. SKOS  Example  -­‐  UKAT  "Weapons"skos:prefLabelskos:Conceptukat:859"Military Equipment"skos:prefLabelskos:Conceptukat:5060skos:broaderskos:narrowerskos:broaderskos:narrower"Ordnance"skos:altLabel"Armaments"skos:altLabel"Arms"skos:altLabelukat: http://www.ukat.org.uk/thesaurus/concept/skos: http://www.w3.org/2004/02/skos/core# 5  
  6. 6. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  6  
  7. 7. The  big  picture  Retrieval ModelQueryDocumentsAnalysisAnalysisQueryRepresentationDocumentRepresentationScoring ResultsSKOS-­‐based  term  expansion  7  
  8. 8. Label-­‐based  (Query)  Expansion  •  Query:  roman  arms  •  Document:  Title   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   Weapons  8  
  9. 9. Label-­‐based  (Query)  Expansion  roman armsskos:prefLabelweaponsskos:altLabelarmamentsskos:broaderordnanceskos:broadermilitary equipmentTitle   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   Weapons  matches  9  
  10. 10. URI-­‐based  Expansion  •  Query:  roman  arms  •  Document:  Title   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   hbp://www.ukat.org.uk/thesaurus/concept/859  10  
  11. 11. URI-­‐based  Expansion  11  
  12. 12. URI-­‐based  Expansion  matches  Title   Spearhead  Descrip>on   Roman  iron  spearhead.  The  spearhead  was  abached  to  one  end  of  a  wooden  shac.  .  .  Subject   hbp://www.ukat.org.uk/thesaurus/concept/859  arms,  weapons,  armaments,  ...  Query:  roman  arms  12  
  13. 13. Scoring  •  Apply  regular  text  retrieval  func>ons  •  boostctype:  leverages  explicit  declara>on  of  SKOS  expansion  types  in  term  representa>ons  •  coordq,d:  ensures  that  a  document  with  more  matching  terms  will  score  higher  13  
  14. 14. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Implementa>on:  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  14  
  15. 15. h"ps://github.com/behas/lucene-­‐skos  15  
  16. 16. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Implementa>on:  Lucene-­‐SKOS  •  Evalua>on  •  Conclusions  16  
  17. 17. Dataset  1  •  OHSUMED:  – 350K  Pubmed  metadata  records  from  270  journals  – Title,  author,  abstract,  …  – 3  level  relevance  judgments  for  informa>on  needs  •  Medical  Subject  Headings  (MeSH)  in  SKOS  – Maintained  by  US  Na>onal  Library  of  Medicine  – Used  to  index  millions  of  ar>cles  in  PubMed    17  
  18. 18. Dataset  2  •  8,905  MODS  metadata  records  harvested  from  the  Library  of  Congress  (LoC)  •  10  queries  from  the  2009/2010  query  collec>on  •  Binary  relevance  judgments  for  queries    •  Library  of  Congress  Subject  Headings  (LCSH)  in  SKOS  18  
  19. 19. Method  •  Focus  on  label-­‐based  expansion  at  query  >me  •  Normalized  queries  and  SKOS  vocabularies  •  Expanded  each  query  term  by  collec>ng  SKOS  concepts  that  have  that  term  in  any  of  their  labels  (prefLabel,  altLabel,  hiddenLabel)  •  Applied  pre-­‐defined  boost-­‐weights  that  maximized  query  performance  19  
  20. 20. Two  Baselines  •  No  term  expansion  (NoExp)  – queries  are  executed  over  documents  without  SKOS-­‐based  term  expansion.  •  Pseudo  Relevance  Feedback  (PRF):  – Perform  ini>al  search  with  original  query  – Collect  terms  from  k  retrieved  documents  – Resubmit  query  +  collected  terms  20  
  21. 21. OHSUMED  query  expansion  example  Query:  fibromyalgia fibrositis, diagnosis and treatment  Expanded  Query:  (fibromyalgia rheumatism muscular^0.5 diffuse myofascial pain syndrome^0.5fibromyalgia fibromyositis syndrome^0.5 myofascial pain syndrome diffuse^0.5fibromyositis fibromyalgia syndrome^0.5 fibrositis^0.5)  (fibrositis rheumatism muscular^0.5 diffuse myofascial pain syndrome^0.5fibromyalgia fibromyositis syndrome^0.5 myofascial pain syndrome diffuse^0.5fibromyositis fibromyalgia syndrome^0.5 fibromyalgia^0.5)  (diagnosis examinations diagnoses^0.5)  (treatment disease management^0.5 therapy^0.5)  21  
  22. 22. Results  OHSUMED/MESH  LTC TF-IDFP@1 P@3 P@10 nDCG@1 nDCG@3 nDCG@10 PNoExp 0.333 0.302 0.260 0.407 0.376 0.356 0.3PRF 0.322 0.282 0.302 0.379 0.360 0.393 0.3SKOS 0.419 0.366 0.276 0.484 0.429 0.379 0.5Table 1: Precision and nDCG results onquery set provided with each dataset (63 OHSUMED/MESH,10 LoC/LCSH). We focused on two measures, precision atrank n (P@n) for both dataset bundles and nDCG at rank n(nDCG@n) for the OHSUMED/MESH bundle, which pro-vides ordinal relevance judgments. Two retrieval modelswere used for ranking documents: LTC TF-IDF and BM25.4.1 Dataset Characteristicsonlyoutco25 mwithoapartPRFdocumBM25nDCG@10 P@1 P@3 P@10 nDCG@1 nDCG@3 nDCG@100.356 0.381 0.344 0.265 0.450 0.414 0.3740.393 0.377 0.317 0.275 0.443 0.397 0.3690.379 0.500 0.366 0.282 0.548 0.435 0.397CG results on the OHSUMED dataset.22  
  23. 23. Ini>al  Results  LoC/LCSH  .fsh.m-ee-ede.--fPRF might lead to better recall i.e., return more relevantdocuments, but hurt precison at top ranks. On the otherhand, with SKOS-expansion the terms used in expansionare always from a SKOS vocabulary and not from the cor-pus. Therefore, we are already only expanding using rele-vant terms, since the terms’ existence in the vocabulary is agood hint that the term is important.LTC TF-IDFP@1 P@3 P@10NoExp 0.500 0.500 0.370PRF 0.300 0.433 0.430SKOS 0.600 0.533 0.370Table 2: Precision with LTC TF-IDF results on theLibrary of Congress dataset.Table 2 shows the results we obtained from early experi-ments with the Library of Congress Subject Headings. Theresults are similar to the results with OHSUMED. The PRF23  
  24. 24. Overview  •  A  brief  intro  to  SKOS  •  SKOS-­‐based  term  expansion  •  Implementa>on:  Lucene-­‐SKOS  •  Experiments  •  Conclusions  24  
  25. 25. Conclusions  •  Our  experiments  indicated  gains  in  retrieval  effec>veness  compared  to  no-­‐expansion  or  pseudo  relevance  feedback  •  Our  solu>on  can  easily  be  adopted  by  loading  lucene-­‐skos  with  Apache  Lucene  and  Solr  25  
  26. 26. Future  Work  •  More  thorough  evalua>on  using  LoC/LCSH    •  Use  corpora,  queries,  and  vocabularies  from  other  domains  •  Apply  this  technique  for  general  concept-­‐based  and  URI-­‐iden>fied  Web  data  sources  26  
  27. 27. Thanks!  @bhaslhofer,  @flaviomar>ns  hbp://slideshare.net/bhaslhofer    hbps://github.com/behas/lucene-­‐skos  27  

×