Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)

  1. 1. Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach Presentation to STC 2006 Brad Allen, Founder and CTO Siderean Software, Inc.
  2. 2. Preface <ul><li>This is not rocket science </li></ul><ul><li>This is appropriate semantic technology </li></ul><ul><li>What Jim Hendler said: it’s about linking things so the whole is greater than the sum of the parts </li></ul>
  3. 3. Disclaimer <ul><li>We will be viewing uncontrolled vocabulary from the Web live </li></ul><ul><li>Sometimes it’s not pretty </li></ul><ul><li>Please don’t be offended </li></ul>
  4. 4. The problem <ul><li>Associating subject metadata with content and data is an old technique for improving precision and recall in search </li></ul><ul><li>Traditionally, subject languages have been expressed as highly-governed taxonomies (i.e., thesauri, controlled vocabularies, etc.) that entail substantial costs in creation and use </li></ul><ul><li>User tagging and the emergence of folksonomies have changed the economics of subject metadata creation but at the cost of quality </li></ul><ul><li>Can the two approaches to subject metadata be combined to yield an approach that yields the advantages of both while addressing their shortcomings? </li></ul>
  5. 5. Taxonomies <ul><li>A taxonomy is a controlled subject language whose terms exist in explicit relation to one another </li></ul><ul><li>Advantages </li></ul><ul><ul><li>Authoritative reference for terms and their relational semantics </li></ul></ul><ul><ul><li>Can support reasoning and classification </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Creation requires training and discipline </li></ul></ul><ul><ul><li>Expensive and slow to track changes in usage </li></ul></ul><ul><li>Adoption </li></ul><ul><ul><li>Pervasive for decades throughout the information science and IT communities </li></ul></ul>
  6. 6. Folksonomies <ul><li>A folksonomy is an uncontrolled subject language whose tags have no explicit relation to one another </li></ul><ul><li>Advantages </li></ul><ul><ul><li>The cost of creation can be shared across many untrained users </li></ul></ul><ul><ul><li>Can track changes in usage in real-time </li></ul></ul><ul><li>Disadvantages </li></ul><ul><ul><li>Lexical variations (misspellings, inconsistent case or white space) </li></ul></ul><ul><ul><li>Lack of relational semantics </li></ul></ul><ul><ul><li>Sense ambiguity </li></ul></ul><ul><li>Adoption </li></ul><ul><ul><li>Rapid growth on the Web (, Flickr) and emerging in enterprise pilots (IBM, DKW) </li></ul></ul>
  7. 7. It’s an old story: neats vs. scruffies <ul><li>The taxonomy/thesaurus tradition is solid </li></ul><ul><li>But user-generated metadata is gold </li></ul><ul><li>A good solution should leverage aspects of both approaches </li></ul>
  8. 8. Bridging the gap <ul><li>The key ideas </li></ul><ul><ul><li>User tagging gets tags into repository as “author keywords” </li></ul></ul><ul><ul><ul><li>Ingested through RSS feeds with tagged items </li></ul></ul></ul><ul><ul><li>Tags are related to terms in (separately defined) taxonomies </li></ul></ul><ul><ul><li>Users can search using one or the other or both </li></ul></ul><ul><li>Result </li></ul><ul><ul><li>Folksonomies make taxonomies more responsive </li></ul></ul><ul><ul><li>Taxonomies make folksonomies more responsible </li></ul></ul>
  9. 9. Example from DCMI Conference Thesaurus
  10. 10. Building the bridge with ontologies <ul><li>SKOS </li></ul><ul><ul><li>Lexical vs. concept-based thesauri </li></ul></ul><ul><ul><li>Modeling taxonomies in SKOS </li></ul></ul><ul><ul><ul><li>skos:Concept </li></ul></ul></ul><ul><ul><ul><li>skos:broader/skos:narrower </li></ul></ul></ul><ul><ul><ul><li>skos:related </li></ul></ul></ul><ul><li>Dublin Core (DC) </li></ul><ul><ul><li>Basic asset metadata for modeling content creation </li></ul></ul><ul><ul><ul><li>dc:creator </li></ul></ul></ul><ul><ul><ul><li>dc:dateSubmitted </li></ul></ul></ul>
  11. 11. Modeling folksonomies in SKOS and DC <ul><li>Represent each tag as a skos:Concept </li></ul><ul><li>The prefLabel of the concept is the tag </li></ul><ul><li>The item is skos:subjectOf the concept </li></ul><ul><li>The concept is skos:inScheme associated with the RSS channel </li></ul><ul><li>No broader/narrower/related relationships (at least initially) </li></ul>
  12. 12. Addressing the shortcomings <ul><li>Reduce/eliminate lexical variation </li></ul><ul><ul><li>Merge variants into a single concept using skos:prefLabel and skos:altLabel </li></ul></ul><ul><li>Relate tags to terms and other tags </li></ul><ul><ul><li>Tag the tags with categories </li></ul></ul><ul><ul><li>Place tags in time and space </li></ul></ul><ul><ul><ul><li>The dc:dateSubmitted of the item is associated with its tags </li></ul></ul></ul><ul><ul><ul><li>Geolocation metadata can be added to concepts representing physical locations </li></ul></ul></ul><ul><ul><li>Tags are related to other tags through shared skos:subjectOf relationships with items </li></ul></ul><ul><li>Compensate for ambiguous tags with term indexing </li></ul><ul><ul><li>Index items tagged with ambiguous tags with unambiguous terms based on context (e.g. the tag “SF”) </li></ul></ul><ul><li>Allow users to exploit tags and terms concurrently </li></ul>
  13. 13. Social aspects <ul><li>The role of the community of interest and focused collections of edge content </li></ul><ul><li>A virtuous circle where navigation and tagging continuously improve quality of subject indexing </li></ul><ul><li>A disruptive impact of the economics of knowledge management </li></ul>Content Consumers Content Producers (Indexed) content Tagged content Navigation and tagging Navigation and tagging Community of Interest
  14. 14. Case studies and demonstrations <ul><li>Environmental Health News </li></ul><ul><ul><li>RSS item categorization </li></ul></ul><ul><li> </li></ul><ul><ul><li>RSS/Atom into SKOS/FOAF/DC </li></ul></ul><ul><li>BBC Rushes </li></ul><ul><ul><li>Crosswalks </li></ul></ul>
  15. 15. Case study: Environmental Health News <ul><li>Aggregating content from hundreds of Web pages daily </li></ul><ul><ul><li>10 5 Web pages </li></ul></ul><ul><ul><li>10 3 originating sites </li></ul></ul><ul><ul><li>10 1 editors </li></ul></ul><ul><ul><li>10 4 subscribers </li></ul></ul><ul><li>Adding value at the metadata level to the Web at large for a focused community of interest </li></ul><ul><ul><li>Policy makers </li></ul></ul><ul><ul><li>Activists </li></ul></ul><ul><ul><li>Researchers </li></ul></ul>
  16. 16. Case study: <ul><li>Aggregating feeds from social bookmarking site </li></ul><ul><ul><li>10 5 Web pages </li></ul></ul><ul><ul><li>10 4 tags </li></ul></ul><ul><ul><li>10 4 contributors </li></ul></ul><ul><ul><li>10 4 originating sites </li></ul></ul><ul><li>Combining user tagging with faceted navigation </li></ul><ul><ul><li>“ In 3 clicks, I drilled down through 9700+ sites, to a more specific set of 98 things, down to one I found useful.” </li></ul></ul><ul><ul><li>“… the most comprehensive tool for searching the database of” </li></ul></ul><ul><ul><li>“ Siderean’s half-year test makes the narrowness of the service evident.” </li></ul></ul>
  17. 17. Case study: BBC rushes <ul><li>Joint work with Accenture Technology Labs for TRECVID program </li></ul><ul><li>BBC Rushes: 49.3 hours of raw video </li></ul><ul><ul><ul><li>4 issues of “Summer Holiday” (~ 2 hours) </li></ul></ul></ul><ul><ul><ul><li>BBC One News (30’) + fragment (~3’) </li></ul></ul></ul><ul><li>Faceted navigation using both textual and visual features </li></ul>
  18. 18. Future work <ul><li>(Semi)automatic folksonomy/taxonomy crosswalk generation </li></ul><ul><ul><li>The notion of “relatedness” </li></ul></ul><ul><ul><ul><li>By cooccurrence </li></ul></ul></ul><ul><ul><ul><li>By explicit warrant </li></ul></ul></ul><ul><li>Machine learning for tag sense disambiguation </li></ul><ul><ul><li>Co-training using content that is simultaneously tagged and indexed </li></ul></ul><ul><li>Tag spam filtering </li></ul>
  19. 19. Siderean Software, Inc. 390 North Sepulveda Blvd., Suite 2070 El Segundo, CA 90245-4475 USA +1 310 647-4266 ballen at siderean dot com