• Like
  • Save
Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)
Upcoming SlideShare
Loading in...5

Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006)






Total Views
Views on SlideShare
Embed Views



1 Embed 1

http://www.linkedin.com 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006) Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach (SemTech 2006) Presentation Transcript

    • Bridging the Gap Between Folksonomies and Taxonomies: A Semantic Web Approach Presentation to STC 2006 Brad Allen, Founder and CTO Siderean Software, Inc.
    • Preface
      • This is not rocket science
      • This is appropriate semantic technology
      • What Jim Hendler said: it’s about linking things so the whole is greater than the sum of the parts
    • Disclaimer
      • We will be viewing uncontrolled vocabulary from the Web live
      • Sometimes it’s not pretty
      • Please don’t be offended
    • The problem
      • Associating subject metadata with content and data is an old technique for improving precision and recall in search
      • Traditionally, subject languages have been expressed as highly-governed taxonomies (i.e., thesauri, controlled vocabularies, etc.) that entail substantial costs in creation and use
      • User tagging and the emergence of folksonomies have changed the economics of subject metadata creation but at the cost of quality
      • Can the two approaches to subject metadata be combined to yield an approach that yields the advantages of both while addressing their shortcomings?
    • Taxonomies
      • A taxonomy is a controlled subject language whose terms exist in explicit relation to one another
      • Advantages
        • Authoritative reference for terms and their relational semantics
        • Can support reasoning and classification
      • Disadvantages
        • Creation requires training and discipline
        • Expensive and slow to track changes in usage
      • Adoption
        • Pervasive for decades throughout the information science and IT communities
    • Folksonomies
      • A folksonomy is an uncontrolled subject language whose tags have no explicit relation to one another
      • Advantages
        • The cost of creation can be shared across many untrained users
        • Can track changes in usage in real-time
      • Disadvantages
        • Lexical variations (misspellings, inconsistent case or white space)
        • Lack of relational semantics
        • Sense ambiguity
      • Adoption
        • Rapid growth on the Web (del.icio.us, Flickr) and emerging in enterprise pilots (IBM, DKW)
    • It’s an old story: neats vs. scruffies
      • The taxonomy/thesaurus tradition is solid
      • But user-generated metadata is gold
      • A good solution should leverage aspects of both approaches
    • Bridging the gap
      • The key ideas
        • User tagging gets tags into repository as “author keywords”
          • Ingested through RSS feeds with tagged items
        • Tags are related to terms in (separately defined) taxonomies
        • Users can search using one or the other or both
      • Result
        • Folksonomies make taxonomies more responsive
        • Taxonomies make folksonomies more responsible
    • Example from DCMI Conference Thesaurus
    • Building the bridge with ontologies
      • SKOS
        • Lexical vs. concept-based thesauri
        • Modeling taxonomies in SKOS
          • skos:Concept
          • skos:broader/skos:narrower
          • skos:related
      • Dublin Core (DC)
        • Basic asset metadata for modeling content creation
          • dc:creator
          • dc:dateSubmitted
    • Modeling folksonomies in SKOS and DC
      • Represent each tag as a skos:Concept
      • The prefLabel of the concept is the tag
      • The item is skos:subjectOf the concept
      • The concept is skos:inScheme associated with the RSS channel
      • No broader/narrower/related relationships (at least initially)
    • Addressing the shortcomings
      • Reduce/eliminate lexical variation
        • Merge variants into a single concept using skos:prefLabel and skos:altLabel
      • Relate tags to terms and other tags
        • Tag the tags with categories
        • Place tags in time and space
          • The dc:dateSubmitted of the item is associated with its tags
          • Geolocation metadata can be added to concepts representing physical locations
        • Tags are related to other tags through shared skos:subjectOf relationships with items
      • Compensate for ambiguous tags with term indexing
        • Index items tagged with ambiguous tags with unambiguous terms based on context (e.g. the tag “SF”)
      • Allow users to exploit tags and terms concurrently
    • Social aspects
      • The role of the community of interest and focused collections of edge content
      • A virtuous circle where navigation and tagging continuously improve quality of subject indexing
      • A disruptive impact of the economics of knowledge management
      Content Consumers Content Producers (Indexed) content Tagged content Navigation and tagging Navigation and tagging Community of Interest
    • Case studies and demonstrations
      • Environmental Health News
        • RSS item categorization
      • Fac.etio.us
        • RSS/Atom into SKOS/FOAF/DC
      • BBC Rushes
        • Crosswalks
    • Case study: Environmental Health News
      • Aggregating content from hundreds of Web pages daily
        • 10 5 Web pages
        • 10 3 originating sites
        • 10 1 editors
        • 10 4 subscribers
      • Adding value at the metadata level to the Web at large for a focused community of interest
        • Policy makers
        • Activists
        • Researchers
    • Case study: fac.etio.us
      • Aggregating feeds from del.icio.us social bookmarking site
        • 10 5 Web pages
        • 10 4 tags
        • 10 4 contributors
        • 10 4 originating sites
      • Combining user tagging with faceted navigation
        • “ In 3 clicks, I drilled down through 9700+ sites, to a more specific set of 98 things, down to one I found useful.”
        • “… the most comprehensive tool for searching the database of del.icio.us.”
        • “ Siderean’s half-year test makes the narrowness of the del.icio.us service evident.”
    • Case study: BBC rushes
      • Joint work with Accenture Technology Labs for TRECVID program
      • BBC Rushes: 49.3 hours of raw video
          • 4 issues of “Summer Holiday” (~ 2 hours)
          • BBC One News (30’) + fragment (~3’)
      • Faceted navigation using both textual and visual features
    • Future work
      • (Semi)automatic folksonomy/taxonomy crosswalk generation
        • The notion of “relatedness”
          • By cooccurrence
          • By explicit warrant
      • Machine learning for tag sense disambiguation
        • Co-training using content that is simultaneously tagged and indexed
      • Tag spam filtering
    • Siderean Software, Inc. 390 North Sepulveda Blvd., Suite 2070 El Segundo, CA 90245-4475 USA +1 310 647-4266 http://www.siderean.com ballen at siderean dot com