Your SlideShare is downloading. ×
Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)


Published on

Published in: Technology, Education

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach Presentation to STC 2006 Brad Allen, Founder and CTO Siderean Software, Inc.
  • 2. Preface
    • This is not rocket science
    • This is appropriate semantic technology
    • What Jim Hendler said: it’s about linking things so the whole is greater than the sum of the parts
  • 3. Disclaimer
    • We will be viewing uncontrolled vocabulary from the Web live
    • Sometimes it’s not pretty
    • Please don’t be offended
  • 4. The problem
    • Associating subject metadata with content and data is an old technique for improving precision and recall in search
    • Traditionally, subject languages have been expressed as highly-governed taxonomies (i.e., thesauri, controlled vocabularies, etc.) that entail substantial costs in creation and use
    • User tagging and the emergence of folksonomies have changed the economics of subject metadata creation but at the cost of quality
    • Can the two approaches to subject metadata be combined to yield an approach that yields the advantages of both while addressing their shortcomings?
  • 5. Taxonomies
    • A taxonomy is a controlled subject language whose terms exist in explicit relation to one another
    • Advantages
      • Authoritative reference for terms and their relational semantics
      • Can support reasoning and classification
    • Disadvantages
      • Creation requires training and discipline
      • Expensive and slow to track changes in usage
    • Adoption
      • Pervasive for decades throughout the information science and IT communities
  • 6. Folksonomies
    • A folksonomy is an uncontrolled subject language whose tags have no explicit relation to one another
    • Advantages
      • The cost of creation can be shared across many untrained users
      • Can track changes in usage in real-time
    • Disadvantages
      • Lexical variations (misspellings, inconsistent case or white space)
      • Lack of relational semantics
      • Sense ambiguity
    • Adoption
      • Rapid growth on the Web (, Flickr) and emerging in enterprise pilots (IBM, DKW)
  • 7. It’s an old story: neats vs. scruffies
    • The taxonomy/thesaurus tradition is solid
    • But user-generated metadata is gold
    • A good solution should leverage aspects of both approaches
  • 8. Bridging the gap
    • The key ideas
      • User tagging gets tags into repository as “author keywords”
        • Ingested through RSS feeds with tagged items
      • Tags are related to terms in (separately defined) taxonomies
      • Users can search using one or the other or both
    • Result
      • Folksonomies make taxonomies more responsive
      • Taxonomies make folksonomies more responsible
  • 9. Example from DCMI Conference Thesaurus
  • 10. Building the bridge with ontologies
    • SKOS
      • Lexical vs. concept-based thesauri
      • Modeling taxonomies in SKOS
        • skos:Concept
        • skos:broader/skos:narrower
        • skos:related
    • Dublin Core (DC)
      • Basic asset metadata for modeling content creation
        • dc:creator
        • dc:dateSubmitted
  • 11. Modeling folksonomies in SKOS and DC
    • Represent each tag as a skos:Concept
    • The prefLabel of the concept is the tag
    • The item is skos:subjectOf the concept
    • The concept is skos:inScheme associated with the RSS channel
    • No broader/narrower/related relationships (at least initially)
  • 12. Addressing the shortcomings
    • Reduce/eliminate lexical variation
      • Merge variants into a single concept using skos:prefLabel and skos:altLabel
    • Relate tags to terms and other tags
      • Tag the tags with categories
      • Place tags in time and space
        • The dc:dateSubmitted of the item is associated with its tags
        • Geolocation metadata can be added to concepts representing physical locations
      • Tags are related to other tags through shared skos:subjectOf relationships with items
    • Compensate for ambiguous tags with term indexing
      • Index items tagged with ambiguous tags with unambiguous terms based on context (e.g. the tag “SF”)
    • Allow users to exploit tags and terms concurrently
  • 13. Social aspects
    • The role of the community of interest and focused collections of edge content
    • A virtuous circle where navigation and tagging continuously improve quality of subject indexing
    • A disruptive impact of the economics of knowledge management
    Content Consumers Content Producers (Indexed) content Tagged content Navigation and tagging Navigation and tagging Community of Interest
  • 14. Case studies and demonstrations
    • Environmental Health News
      • RSS item categorization
      • RSS/Atom into SKOS/FOAF/DC
    • BBC Rushes
      • Crosswalks
  • 15. Case study: Environmental Health News
    • Aggregating content from hundreds of Web pages daily
      • 10 5 Web pages
      • 10 3 originating sites
      • 10 1 editors
      • 10 4 subscribers
    • Adding value at the metadata level to the Web at large for a focused community of interest
      • Policy makers
      • Activists
      • Researchers
  • 16. Case study:
    • Aggregating feeds from social bookmarking site
      • 10 5 Web pages
      • 10 4 tags
      • 10 4 contributors
      • 10 4 originating sites
    • Combining user tagging with faceted navigation
      • “ In 3 clicks, I drilled down through 9700+ sites, to a more specific set of 98 things, down to one I found useful.”
      • “… the most comprehensive tool for searching the database of”
      • “ Siderean’s half-year test makes the narrowness of the service evident.”
  • 17. Case study: BBC rushes
    • Joint work with Accenture Technology Labs for TRECVID program
    • BBC Rushes: 49.3 hours of raw video
        • 4 issues of “Summer Holiday” (~ 2 hours)
        • BBC One News (30’) + fragment (~3’)
    • Faceted navigation using both textual and visual features
  • 18. Future work
    • (Semi)automatic folksonomy/taxonomy crosswalk generation
      • The notion of “relatedness”
        • By cooccurrence
        • By explicit warrant
    • Machine learning for tag sense disambiguation
      • Co-training using content that is simultaneously tagged and indexed
    • Tag spam filtering
  • 19. Siderean Software, Inc. 390 North Sepulveda Blvd., Suite 2070 El Segundo, CA 90245-4475 USA +1 310 647-4266 ballen at siderean dot com