Commodity Semantic Search: A Case Study of DiscoverEd
Upcoming SlideShare
Loading in...5

Commodity Semantic Search: A Case Study of DiscoverEd






Total Views
Views on SlideShare
Embed Views



2 Embeds 107 106 1


Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Good afternoon. My name is Nathan Yergler, and I'm Chief Technology Officer at Creative Commons. This afternoon I'm going to talk about a semantic enhanced search engine for education we've been working on called DiscoverEd. It's built on commodity hardware and open source tools, and the software can be used for other domains. I'm going to talk about some approaches we tried and rejected, and give you some information on tools you can use for building your own semantic search without investing in your own server farm.

Commodity Semantic Search: A Case Study of DiscoverEd Commodity Semantic Search: A Case Study of DiscoverEd Presentation Transcript

  • Commodity Semantic Search: A Case Study of DiscoverEd Nathan R. Yergler Creative Commons Semantic Technology Conference 24 June 2010
  • share, reuse, and remix— legally
  • Creative Commons provides legal and technical tools that make sharing easy, legal, and scalable.
  • <a href=” ” rel=”license”> Attribution 3.0 Unported </a>
  • <rdf:RDF xmlns:cc='' xmlns:foaf='' xmlns:rdf='' xmlns:dc='' xmlns:dcq='' > <cc:License rdf:about=&quot;;> <cc:permits rdf:resource=&quot;;/> <cc:permits rdf:resource=&quot;;/> <cc:permits rdf:resource=&quot;;/> <cc:requires rdf:resource=&quot;;/> <cc:requires rdf:resource=&quot;;/> <cc:legalcode rdf:resource=&quot;;/> <dcq:hasVersion>3.0</dcq:hasVersion> <foaf:logo rdf:resource=&quot;;/> <foaf:logo rdf:resource=&quot;;/> <cc:licenseClass rdf:resource=&quot;;/> <dc:creator rdf:resource=&quot;;/> </cc:License> </rdf:RDF>
  • CC Rights Expression Language
  • CC licenses are based on international copyright law
  • There are hundreds of millions of pieces of CC-licensed content on the web
  • OER
    • “Open Educational Resources”
    • Learning materials that are freely available to use, remix, and redistribute.
    • Wide variety of format, content types, audience
    • CC licenses make this content interoperable
  • But how do you find OER you’re looking for?
  • OER Search == CC Search++
    • Similarities
      • No central registry or repository
      • It's up to publishers to label their works
    • And additional interesting problems
      • Differing views on what makes it “Educational”
      • Additional facets – subject, language, etc
  • A Model for OER Search
    • Curators identify educational resources
      • Curators optionally add metadata
      • A Curator may also be the Publisher
      • Or a Curator may add metadata to someone else’s resources
  • A Model for OER Search (2)
    • Ingest resource lists, metadata via RSS/Atom feeds, OAI-PMH
  • Curators & Feeds
  • Two Prototypes
    • Google CSE
    • Nutch
  • Initial effort: Google CSE
    • Google Custom Search Engine allows you to “create a search engine for a website or a collection of interesting websites.”
      • Define resource patterns for inclusion
      • Optionally include annotations – facets and labels
    • Python scripts to consume resource lists
    • Output XML suitable for Google CSE
  • Scaling with CSE
    • Lists of individual resources did not scale well
    • Labels and Facets worked best with fixed, limited vocabulary
    • License-filtered search unavailable
  • Nutch-based Prototype
    • Nutch + Jena triple store
    • Simple scripts for generating seeds from the store
    • IndexingFilter plugin for injecting metadata into Nutch index
    • QueryFilter plugins for field-specific searches
  • DiscoverEd (Nutch)
  • Prototype Results
    • Curator model allows for very directed crawl
      • Low cost, not very resource intensive
    • Scale
      • Flexibly filter on predicate values
    • Limitations
      • Provenance for curator metadata
      • Predicate filters had to be “hand-crafted”
  • Current DiscoverEd Work
  • “We're Open”
    • Education is our test domain, but the tool can be generally useful
    • Other organizations have expressed interest in using the DiscoverEd software
    • Making code available on Gitorious,
  • Provenance
    • Initial work complete on storing provenance for curator metadata
    • Working on integrating this with the query front end now
    • When complete, will allow users to
      • Limit their query to specific curators
      • Exclude curators from their query
  • Field Queries
    • Need to map predicate URIs to “human” names
    • Currently map at query time
      • Index using the URI
      • Map specific terms to URIs
      • For example, “tag” to “”
      • Requires a Filter for every predicate
    • Landing work now to map at index time
  • Information for Curators
    • We want publishers/curators to publish more linked data
    • Need a feedback loop to help drive this
    • Working on “dashboard” to see what's indexed, how, etc.
    • Second phase: documentation, tools to help improve their RDFa
  • DiscoverEd Team & Supporters
    • Ahrash Bissell
    • Asheesh Laroia
    • Raphael Krut-Landau
    • Alex Kozak
    • Christine Geith
    • Karen Vignare
    • Hewlett Foundation
    • Bill & Melinda Gates Foundation
    • OSI
    • Michigan State University
    • AgShare
  • Cloud Tools for Semantic Search
    • Yahoo! BOSS
      • Retrieve RDF extracted from pages
    • Google CSE
      • Filter using structured data (Page Maps, RDFa)
      • Customize display using structured data
  • Conclusion
    • Cloud tools can help build simple semantic search quickly
    • Nutch provides a powerful, extensible platform for prototyping search tools
    • DiscoverEd software demonstrates semantic search without large hardware investment
  • [email_address] @nyergler (, twitter)