Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)
Upcoming SlideShare
Loading in...5

Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)






Total Views
Views on SlideShare
Embed Views



1 Embed 1 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006) Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006) Presentation Transcript

  • Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach Presentation to STC 2006 Brad Allen, Founder and CTO Siderean Software, Inc.
  • Preface
    • This is not rocket science
    • This is appropriate semantic technology
    • What Jim Hendler said: it’s about linking things so the whole is greater than the sum of the parts
  • Disclaimer
    • We will be viewing uncontrolled vocabulary from the Web live
    • Sometimes it’s not pretty
    • Please don’t be offended
  • The problem
    • Associating subject metadata with content and data is an old technique for improving precision and recall in search
    • Traditionally, subject languages have been expressed as highly-governed taxonomies (i.e., thesauri, controlled vocabularies, etc.) that entail substantial costs in creation and use
    • User tagging and the emergence of folksonomies have changed the economics of subject metadata creation but at the cost of quality
    • Can the two approaches to subject metadata be combined to yield an approach that yields the advantages of both while addressing their shortcomings?
  • Taxonomies
    • A taxonomy is a controlled subject language whose terms exist in explicit relation to one another
    • Advantages
      • Authoritative reference for terms and their relational semantics
      • Can support reasoning and classification
    • Disadvantages
      • Creation requires training and discipline
      • Expensive and slow to track changes in usage
    • Adoption
      • Pervasive for decades throughout the information science and IT communities
  • Folksonomies
    • A folksonomy is an uncontrolled subject language whose tags have no explicit relation to one another
    • Advantages
      • The cost of creation can be shared across many untrained users
      • Can track changes in usage in real-time
    • Disadvantages
      • Lexical variations (misspellings, inconsistent case or white space)
      • Lack of relational semantics
      • Sense ambiguity
    • Adoption
      • Rapid growth on the Web (, Flickr) and emerging in enterprise pilots (IBM, DKW)
  • It’s an old story: neats vs. scruffies
    • The taxonomy/thesaurus tradition is solid
    • But user-generated metadata is gold
    • A good solution should leverage aspects of both approaches
  • Bridging the gap
    • The key ideas
      • User tagging gets tags into repository as “author keywords”
        • Ingested through RSS feeds with tagged items
      • Tags are related to terms in (separately defined) taxonomies
      • Users can search using one or the other or both
    • Result
      • Folksonomies make taxonomies more responsive
      • Taxonomies make folksonomies more responsible
  • Example from DCMI Conference Thesaurus
  • Building the bridge with ontologies
    • SKOS
      • Lexical vs. concept-based thesauri
      • Modeling taxonomies in SKOS
        • skos:Concept
        • skos:broader/skos:narrower
        • skos:related
    • Dublin Core (DC)
      • Basic asset metadata for modeling content creation
        • dc:creator
        • dc:dateSubmitted
  • Modeling folksonomies in SKOS and DC
    • Represent each tag as a skos:Concept
    • The prefLabel of the concept is the tag
    • The item is skos:subjectOf the concept
    • The concept is skos:inScheme associated with the RSS channel
    • No broader/narrower/related relationships (at least initially)
  • Addressing the shortcomings
    • Reduce/eliminate lexical variation
      • Merge variants into a single concept using skos:prefLabel and skos:altLabel
    • Relate tags to terms and other tags
      • Tag the tags with categories
      • Place tags in time and space
        • The dc:dateSubmitted of the item is associated with its tags
        • Geolocation metadata can be added to concepts representing physical locations
      • Tags are related to other tags through shared skos:subjectOf relationships with items
    • Compensate for ambiguous tags with term indexing
      • Index items tagged with ambiguous tags with unambiguous terms based on context (e.g. the tag “SF”)
    • Allow users to exploit tags and terms concurrently
  • Social aspects
    • The role of the community of interest and focused collections of edge content
    • A virtuous circle where navigation and tagging continuously improve quality of subject indexing
    • A disruptive impact of the economics of knowledge management
    Content Consumers Content Producers (Indexed) content Tagged content Navigation and tagging Navigation and tagging Community of Interest
  • Case studies and demonstrations
    • Environmental Health News
      • RSS item categorization
      • RSS/Atom into SKOS/FOAF/DC
    • BBC Rushes
      • Crosswalks
  • Case study: Environmental Health News
    • Aggregating content from hundreds of Web pages daily
      • 10 5 Web pages
      • 10 3 originating sites
      • 10 1 editors
      • 10 4 subscribers
    • Adding value at the metadata level to the Web at large for a focused community of interest
      • Policy makers
      • Activists
      • Researchers
  • Case study:
    • Aggregating feeds from social bookmarking site
      • 10 5 Web pages
      • 10 4 tags
      • 10 4 contributors
      • 10 4 originating sites
    • Combining user tagging with faceted navigation
      • “ In 3 clicks, I drilled down through 9700+ sites, to a more specific set of 98 things, down to one I found useful.”
      • “… the most comprehensive tool for searching the database of”
      • “ Siderean’s half-year test makes the narrowness of the service evident.”
  • Case study: BBC rushes
    • Joint work with Accenture Technology Labs for TRECVID program
    • BBC Rushes: 49.3 hours of raw video
        • 4 issues of “Summer Holiday” (~ 2 hours)
        • BBC One News (30’) + fragment (~3’)
    • Faceted navigation using both textual and visual features
  • Future work
    • (Semi)automatic folksonomy/taxonomy crosswalk generation
      • The notion of “relatedness”
        • By cooccurrence
        • By explicit warrant
    • Machine learning for tag sense disambiguation
      • Co-training using content that is simultaneously tagged and indexed
    • Tag spam filtering
  • Siderean Software, Inc. 390 North Sepulveda Blvd., Suite 2070 El Segundo, CA 90245-4475 USA +1 310 647-4266 ballen at siderean dot com