Your SlideShare is downloading. ×
Comparing taxonomies for organising collections of documents presentation
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Comparing taxonomies for organising collections of documents presentation


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Comparing taxonomies for organising collections of documents Samuel Fernando, Mark Hall, Eneko Agirre, Aitor Soroa, Paul Clough, Mark StevensonCOLING 2012, 14th December 2012, Mumbai, India
  • 2. Introduction● Large collections of diverse data are available online. PATHS project aims to support user exploration in digital library collections.● Search box is useful but taxonomies are better suited for exploration and browsing.● We apply taxonomies to organise data from a large digital library collection.● Process is automatic – either map items to an existing taxonomy, or induce a taxonomy from the data.COLING 2012, 14th December 2012, Mumbai, India
  • 3. Evaluation data● We use items from Europeana, a large online collection of cultural heritage.● Use English subset, approx. 550,000 items.● Item typically contains a picture, a title, description and subject keywords.● Very diverse data comprising artifacts, places, people. Topics include fashion, archaeology, architecture and many other subjects.● Data from many providers, some of which use taxonomies, some don’t – need unified approachCOLING 2012, 14th December 2012, Mumbai, India
  • 4. Example item Title: Design Council Slide Collection Subject: colour, exhibitions, industrial design Description: Display on the theme of colour matching at the Design Centre, London, 1960COLING 2012, 14th December 2012, Mumbai, India
  • 5. Manually created taxonomies● We use four existing manually created taxonomies: – LCSH (Library of Congress) – WordNet domains – Wikipedia Taxonomy – DBpedia ontology● The taxonomies already exist and are of good quality - but problem is to map Europeana items into the correct place in the taxonomy.COLING 2012, 14th December 2012, Mumbai, India
  • 6. LCSH● A controlled vocabulary maintained by the US Library of Congress for bibliographic records.● Used by libraries to organise collections and also by curators of cultural heritage.● Subject keywords are used to map Europeana items into the appropriate LCSH category nodes.industrial design  design creation (literary, artistic, etc.) intellect+30 more higher level headingsCOLING 2012, 14th December 2012, Mumbai, India
  • 7. WordNet domains● WordNet domains (Bernardo Magnini, LREC 2000) applies a small set of 164 domain labels to each of the WordNet synsets.● Again use subject keywords to map Europeana items - first to Yago2 (for proper nouns) then to synset and finally to WordNet domain label. tourism  social color  factotum art  humanities + 5 moreCOLING 2012, 14th December 2012, Mumbai, India
  • 8. Wikipedia Taxonomy● Wikipedia category hierarchy preserving only is-a relations - all others are discarded.● Use Wikipedia Miner over each Europeana item to identify Wikipedia articles in the subject keywords. Then map item to all categories that contain these articlesdesign  visual_arts  criticismimage_processing  digital_signal_processing  signal_processingmuseology  museums  educational_organizations organizations +35 moreCOLING 2012, 14th December 2012, Mumbai, India
  • 9. DBpedia ontology A formalised shallow ontology manually created based on Wikipedia (with inference capability). Again use Wikipedia Miner to find Wikipedia articles in subject keywords of each item and map item to the categories which these articles belong. musical_work  work work album  musicalwork  workCOLING 2012, 14th December 2012, Mumbai, India
  • 10. Automatic data-derived taxonomies● We use two approaches to derive taxonomies automatically from the Europeana data. – LDA (Latent Dirichlet Allocation) topic modelling – WikiFreq (Wikipedia Frequency hierarchy)● Taxonomies fit data - no unnecessary nodes to prune.● Mapping from items to concept nodes is implicit during derivation.COLING 2012, 14th December 2012, Mumbai, India
  • 11. LDA topic modelling  Latent Dirichlet Allocation (LDA) maps each item to one or more topics.  Distribution of items over topics - each topic is a distribution over words  Item-topic and topic-word distributions are learned using collapsed Gibbs sampling  Has been used for improving results from IR  Previous work has developed hierarchical LDA but this is infeasible over our large data setCOLING 2012, 14th December 2012, Mumbai, India
  • 12. Hierarchical LDA topics● Run LDA over corpus to determine item-topic probabilities.● Identify set of items for each topic. Each item assigned to highest probability topic. Topic labelled with highest probability word.● If a topic has less than 60 items then stop. Otherwise go back to first step with the set of items identified in previous part as the corpus.COLING 2012, 14th December 2012, Mumbai, India
  • 13. Hierarchical LDA topics (example) Bangle  design  design  design  brooch  collectionCOLING 2012, 14th December 2012, Mumbai, India
  • 14. Wikipedia link frequencies● Novel approach.● Run Wikipedia Miner to find links in all Europeana items – use title, subject and description.● Find frequency counts for each link.● For each item take the set of links found.● Create taxonomy branch (if not already present) with links in order of frequency (most frequent first).● Map item to least frequent link.COLING 2012, 14th December 2012, Mumbai, India
  • 15. Wikipedia link frequencies (cont.)● Large number of concept nodes - limit to 24 children for each node.● Require at least 2 links for each item - filter out items with little metadata.● Filter out concepts with fewer than 20 items. industrial design  design councilCOLING 2012, 14th December 2012, Mumbai, India
  • 16. StatisticsType Taxonomy Items Nodes Avg. Avg. Top parents Depth nodesManual LCSH 99259 285238 1.8 1.97 28901 DBpedia 178312 273 4.2 2 30 WikiTax 275359 121359 11.7 1.13 10417 WN domains 308687 170 7.1 7.1 6Automatic LDA topics 545896 22494 1 7.3 9 Wiki Freq 66558 502 1 3.39 24 COLING 2012, 14th December 2012, Mumbai, India
  • 17. Evaluation - cohesion Intruder detection originally proposed in (Chang et. al, 2009). A cohesive unit is defined as one in which the items are similar while at the same time different from items in other clusters. Present 5 items to each annotator. 4 from one concept node, and an intruder item randomly from elsewhere in the taxonomy. The more cohesive the unit, the more obvious the intruder will be. Crowd-sourcing: 111 annotators, 30 units from each taxonomy. 1255 answers – average 7 annotators for each unitCOLING 2012, 14th December 2012, Mumbai, India
  • 18. Example of a cohesive unitCOLING 2012, 14th December 2012, Mumbai, India
  • 19. Evaluation - cohesion results Type Taxonomy Cohesive Percentage units Manual LCSH 19 63.3 DBpedia 17 56.7 Wiki Taxonomy 18 60.0 WN domains 15 50.0 Automatic LDA topics 17 56.7 Wiki Freq 29 96.7 Number of cohesive units (out of a possible 30)COLING 2012, 14th December 2012, Mumbai, India
  • 20. Evaluation - relation classification Previous work has typically used a simple boolean question “is it true that ChildNode is-a ParentNode?” We ask two questions for each child-parent pair A and B:  Are the concepts A and B related?  If they are, is A more specific than B, less specific than B, or neither? Crowd sourcing: 173 annotators, 40 pairs from each taxonomy, each pair evaluated on average 16 timesCOLING 2012, 14th December 2012, Mumbai, India
  • 21. Evaluation - example pairs Taxonomy Child (A) Parent(B) LCSH Work Human Behaviour Braid Weaving DBpedia Mountain Range Place Fern Plant Wiki Mammals of Africa Wildlife of Africa Taxonomy Schools in Wiltshire Schools in England WN domains vehicles transport mechanics engineering LDA topics earthenware dish view church Wiki Freq Corrosion Coin Interior Design Industrial DesignCOLING 2012, 14th December 2012, Mumbai, India
  • 22. Are A and B related? Taxonomy Yes No Dont know LCSH 74.2 8.8 17.0 DBpedia 86.6 11.2 2.2 Wiki Taxonomy 96.1 1.7 2.3 WN domains 77.1 14.5 8.4 LDA topics 30.3 50.3 19.3 Wiki Freq 47.6 16.5 35.8COLING 2012, 14th December 2012, Mumbai, India
  • 23. Which is more specific? Taxonomy A<B A>B Neither Dont know LCSH 65.4 8.7 23.4 2.5 DBpedia 76.2 4.9 18.1 0.7WikiTaxonomy 78.3 4.7 16.0 0.9WN domains 63.6 6.3 28.0 2.0 LDA topics 21.4 14.8 62.1 1.6 Wiki Freq 30.9 22.6 43.6 2.9COLING 2012, 14th December 2012, Mumbai, India
  • 24. Conclusions Wikipedia Taxonomy is conceptually well organised, even better than LCSH which has been widely used for organising library collections. WikiFreq gives very high cohesion for items although the conceptual relations are not well defined. Future work continues with different intrinsic and user evaluations. Also aim to combine Wikipedia Taxonomy and WikiFreq to get the best of both.COLING 2012, 14th December 2012, Mumbai, India
  • 25. The by the PATHS project http://paths-project.euFunded by the European Communitys Seventh FrameworkProgramme (FP7/2007-2013) under grant agreement no.270082. This research was also partially funded by the Ministryof Economy under grant TIN2009-14715-C04-01 (KNOW2project Questions?