Your SlideShare is downloading. ×
Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach Presentation to STC 2006 Brad Allen, Founder and CTO Siderean Software, Inc.
  • 2. Preface
    • This is not rocket science
    • This is appropriate semantic technology
    • What Jim Hendler said: it’s about linking things so the whole is greater than the sum of the parts
  • 3. Disclaimer
    • We will be viewing uncontrolled vocabulary from the Web live
    • Sometimes it’s not pretty
    • Please don’t be offended
  • 4. The problem
    • Associating subject metadata with content and data is an old technique for improving precision and recall in search
    • Traditionally, subject languages have been expressed as highly-governed taxonomies (i.e., thesauri, controlled vocabularies, etc.) that entail substantial costs in creation and use
    • User tagging and the emergence of folksonomies have changed the economics of subject metadata creation but at the cost of quality
    • Can the two approaches to subject metadata be combined to yield an approach that yields the advantages of both while addressing their shortcomings?
  • 5. Taxonomies
    • A taxonomy is a controlled subject language whose terms exist in explicit relation to one another
    • Advantages
      • Authoritative reference for terms and their relational semantics
      • Can support reasoning and classification
    • Disadvantages
      • Creation requires training and discipline
      • Expensive and slow to track changes in usage
    • Adoption
      • Pervasive for decades throughout the information science and IT communities
  • 6. Folksonomies
    • A folksonomy is an uncontrolled subject language whose tags have no explicit relation to one another
    • Advantages
      • The cost of creation can be shared across many untrained users
      • Can track changes in usage in real-time
    • Disadvantages
      • Lexical variations (misspellings, inconsistent case or white space)
      • Lack of relational semantics
      • Sense ambiguity
    • Adoption
      • Rapid growth on the Web (, Flickr) and emerging in enterprise pilots (IBM, DKW)
  • 7. It’s an old story: neats vs. scruffies
    • The taxonomy/thesaurus tradition is solid
    • But user-generated metadata is gold
    • A good solution should leverage aspects of both approaches
  • 8. Bridging the gap
    • The key ideas
      • User tagging gets tags into repository as “author keywords”
        • Ingested through RSS feeds with tagged items
      • Tags are related to terms in (separately defined) taxonomies
      • Users can search using one or the other or both
    • Result
      • Folksonomies make taxonomies more responsive
      • Taxonomies make folksonomies more responsible
  • 9. Example from DCMI Conference Thesaurus
  • 10. Building the bridge with ontologies
    • SKOS
      • Lexical vs. concept-based thesauri
      • Modeling taxonomies in SKOS
        • skos:Concept
        • skos:broader/skos:narrower
        • skos:related
    • Dublin Core (DC)
      • Basic asset metadata for modeling content creation
        • dc:creator
        • dc:dateSubmitted
  • 11. Modeling folksonomies in SKOS and DC
    • Represent each tag as a skos:Concept
    • The prefLabel of the concept is the tag
    • The item is skos:subjectOf the concept
    • The concept is skos:inScheme associated with the RSS channel
    • No broader/narrower/related relationships (at least initially)
  • 12. Addressing the shortcomings
    • Reduce/eliminate lexical variation
      • Merge variants into a single concept using skos:prefLabel and skos:altLabel
    • Relate tags to terms and other tags
      • Tag the tags with categories
      • Place tags in time and space
        • The dc:dateSubmitted of the item is associated with its tags
        • Geolocation metadata can be added to concepts representing physical locations
      • Tags are related to other tags through shared skos:subjectOf relationships with items
    • Compensate for ambiguous tags with term indexing
      • Index items tagged with ambiguous tags with unambiguous terms based on context (e.g. the tag “SF”)
    • Allow users to exploit tags and terms concurrently
  • 13. Social aspects
    • The role of the community of interest and focused collections of edge content
    • A virtuous circle where navigation and tagging continuously improve quality of subject indexing
    • A disruptive impact of the economics of knowledge management
    Content Consumers Content Producers (Indexed) content Tagged content Navigation and tagging Navigation and tagging Community of Interest
  • 14. Case studies and demonstrations
    • Environmental Health News
      • RSS item categorization
      • RSS/Atom into SKOS/FOAF/DC
    • BBC Rushes
      • Crosswalks
  • 15. Case study: Environmental Health News
    • Aggregating content from hundreds of Web pages daily
      • 10 5 Web pages
      • 10 3 originating sites
      • 10 1 editors
      • 10 4 subscribers
    • Adding value at the metadata level to the Web at large for a focused community of interest
      • Policy makers
      • Activists
      • Researchers
  • 16. Case study:
    • Aggregating feeds from social bookmarking site
      • 10 5 Web pages
      • 10 4 tags
      • 10 4 contributors
      • 10 4 originating sites
    • Combining user tagging with faceted navigation
      • “ In 3 clicks, I drilled down through 9700+ sites, to a more specific set of 98 things, down to one I found useful.”
      • “… the most comprehensive tool for searching the database of”
      • “ Siderean’s half-year test makes the narrowness of the service evident.”
  • 17. Case study: BBC rushes
    • Joint work with Accenture Technology Labs for TRECVID program
    • BBC Rushes: 49.3 hours of raw video
        • 4 issues of “Summer Holiday” (~ 2 hours)
        • BBC One News (30’) + fragment (~3’)
    • Faceted navigation using both textual and visual features
  • 18. Future work
    • (Semi)automatic folksonomy/taxonomy crosswalk generation
      • The notion of “relatedness”
        • By cooccurrence
        • By explicit warrant
    • Machine learning for tag sense disambiguation
      • Co-training using content that is simultaneously tagged and indexed
    • Tag spam filtering
  • 19. Siderean Software, Inc. 390 North Sepulveda Blvd., Suite 2070 El Segundo, CA 90245-4475 USA +1 310 647-4266 ballen at siderean dot com