Commodity Semantic Search:  A Case Study of DiscoverEd Nathan R. Yergler Creative Commons Semantic Technology Conference 2...
share, reuse, and remix— legally
Creative Commons provides legal and technical tools that make sharing easy, legal, and scalable.
 
 
<a    href=” http://creativecommons.org/licenses/by/3.0/ ”  rel=”license”>   Attribution 3.0 Unported </a>
<rdf:RDF   xmlns:cc='http://creativecommons.org/ns#'   xmlns:foaf='http://xmlns.com/foaf/0.1/'   xmlns:rdf='http://www.w3....
CC Rights Expression Language
CC licenses are based on international copyright law
There are hundreds of millions of pieces of CC-licensed content on the web
OER <ul><li>“Open Educational Resources”
Learning materials that are freely available to use, remix, and redistribute.
Wide variety of format, content types, audience
CC licenses make this content interoperable </li></ul>
But how do you find OER you’re looking for?
 
OER Search == CC Search++ <ul><li>Similarities </li><ul><li>No central registry or repository
It's up to publishers to label their works </li></ul><li>And additional interesting problems </li><ul><li>Differing views ...
Additional facets – subject, language, etc </li></ul></ul>
A Model for OER Search <ul><li>Curators identify educational resources  </li><ul><li>Curators optionally add metadata
A Curator may also be the Publisher
Or a Curator may add metadata to someone else’s resources </li></ul></ul>
 
 
A Model for OER Search (2) <ul><li>Ingest resource lists, metadata via RSS/Atom feeds, OAI-PMH </li></ul>
Curators & Feeds
Two Prototypes <ul><li>Google CSE
Nutch </li></ul>
Initial effort: Google CSE <ul><li>Google Custom Search Engine allows you to “create a search engine for a website or a co...
Optionally include annotations – facets and labels </li></ul><li>Python scripts to consume resource lists
Output XML suitable for Google CSE </li></ul>
Scaling with CSE <ul><li>Lists of individual resources did not scale well
Labels and Facets worked best with fixed, limited vocabulary
License-filtered search unavailable </li></ul>
Nutch-based Prototype <ul><li>Nutch + Jena triple store
Upcoming SlideShare
Loading in...5
×

Commodity Semantic Search: A Case Study of DiscoverEd

1,453

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,453
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Good afternoon. My name is Nathan Yergler, and I&apos;m Chief Technology Officer at Creative Commons. This afternoon I&apos;m going to talk about a semantic enhanced search engine for education we&apos;ve been working on called DiscoverEd. It&apos;s built on commodity hardware and open source tools, and the software can be used for other domains. I&apos;m going to talk about some approaches we tried and rejected, and give you some information on tools you can use for building your own semantic search without investing in your own server farm.
  • Commodity Semantic Search: A Case Study of DiscoverEd

    1. 1. Commodity Semantic Search: A Case Study of DiscoverEd Nathan R. Yergler Creative Commons Semantic Technology Conference 24 June 2010
    2. 2. share, reuse, and remix— legally
    3. 3. Creative Commons provides legal and technical tools that make sharing easy, legal, and scalable.
    4. 6. <a href=” http://creativecommons.org/licenses/by/3.0/ ” rel=”license”> Attribution 3.0 Unported </a>
    5. 7. <rdf:RDF xmlns:cc='http://creativecommons.org/ns#' xmlns:foaf='http://xmlns.com/foaf/0.1/' xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:dc='http://purl.org/dc/elements/1.1/' xmlns:dcq='http://purl.org/dc/terms/' > <cc:License rdf:about=&quot;http://creativecommons.org/licenses/by/3.0/&quot;> <cc:permits rdf:resource=&quot;http://creativecommons.org/ns#DerivativeWorks&quot;/> <cc:permits rdf:resource=&quot;http://creativecommons.org/ns#Distribution&quot;/> <cc:permits rdf:resource=&quot;http://creativecommons.org/ns#Reproduction&quot;/> <cc:requires rdf:resource=&quot;http://creativecommons.org/ns#Notice&quot;/> <cc:requires rdf:resource=&quot;http://creativecommons.org/ns#Attribution&quot;/> <cc:legalcode rdf:resource=&quot;http://creativecommons.org/licenses/by/3.0/legalcode&quot;/> <dcq:hasVersion>3.0</dcq:hasVersion> <foaf:logo rdf:resource=&quot;http://i.creativecommons.org/l/by/3.0/80x15.png&quot;/> <foaf:logo rdf:resource=&quot;http://i.creativecommons.org/l/by/3.0/88x31.png&quot;/> <cc:licenseClass rdf:resource=&quot;http://creativecommons.org/license/&quot;/> <dc:creator rdf:resource=&quot;http://creativecommons.org&quot;/> </cc:License> </rdf:RDF>
    6. 8. CC Rights Expression Language
    7. 9. CC licenses are based on international copyright law
    8. 10. There are hundreds of millions of pieces of CC-licensed content on the web
    9. 11. OER <ul><li>“Open Educational Resources”
    10. 12. Learning materials that are freely available to use, remix, and redistribute.
    11. 13. Wide variety of format, content types, audience
    12. 14. CC licenses make this content interoperable </li></ul>
    13. 15. But how do you find OER you’re looking for?
    14. 17. OER Search == CC Search++ <ul><li>Similarities </li><ul><li>No central registry or repository
    15. 18. It's up to publishers to label their works </li></ul><li>And additional interesting problems </li><ul><li>Differing views on what makes it “Educational”
    16. 19. Additional facets – subject, language, etc </li></ul></ul>
    17. 20. A Model for OER Search <ul><li>Curators identify educational resources </li><ul><li>Curators optionally add metadata
    18. 21. A Curator may also be the Publisher
    19. 22. Or a Curator may add metadata to someone else’s resources </li></ul></ul>
    20. 25. A Model for OER Search (2) <ul><li>Ingest resource lists, metadata via RSS/Atom feeds, OAI-PMH </li></ul>
    21. 26. Curators & Feeds
    22. 27. Two Prototypes <ul><li>Google CSE
    23. 28. Nutch </li></ul>
    24. 29. Initial effort: Google CSE <ul><li>Google Custom Search Engine allows you to “create a search engine for a website or a collection of interesting websites.” </li><ul><li>Define resource patterns for inclusion
    25. 30. Optionally include annotations – facets and labels </li></ul><li>Python scripts to consume resource lists
    26. 31. Output XML suitable for Google CSE </li></ul>
    27. 32. Scaling with CSE <ul><li>Lists of individual resources did not scale well
    28. 33. Labels and Facets worked best with fixed, limited vocabulary
    29. 34. License-filtered search unavailable </li></ul>
    30. 35. Nutch-based Prototype <ul><li>Nutch + Jena triple store
    31. 36. Simple scripts for generating seeds from the store
    32. 37. IndexingFilter plugin for injecting metadata into Nutch index
    33. 38. QueryFilter plugins for field-specific searches </li></ul>
    34. 39. DiscoverEd (Nutch)
    35. 41. Prototype Results <ul><li>Curator model allows for very directed crawl </li><ul><li>Low cost, not very resource intensive </li></ul><li>Scale </li><ul><li>Flexibly filter on predicate values </li></ul><li>Limitations </li><ul><li>Provenance for curator metadata
    36. 42. Predicate filters had to be “hand-crafted” </li></ul></ul>
    37. 43. Current DiscoverEd Work
    38. 44. “We're Open” <ul><li>Education is our test domain, but the tool can be generally useful
    39. 45. Other organizations have expressed interest in using the DiscoverEd software
    40. 46. Making code available on Gitorious, http://gitorious.org/discovered </li></ul>
    41. 47. Provenance <ul><li>Initial work complete on storing provenance for curator metadata
    42. 48. Working on integrating this with the query front end now
    43. 49. When complete, will allow users to </li><ul><li>Limit their query to specific curators
    44. 50. Exclude curators from their query </li></ul></ul>
    45. 51. Field Queries <ul><li>Need to map predicate URIs to “human” names
    46. 52. Currently map at query time </li><ul><li>Index using the URI
    47. 53. Map specific terms to URIs
    48. 54. For example, “tag” to “http://purl.org/dc/terms/subject”
    49. 55. Requires a Filter for every predicate </li></ul><li>Landing work now to map at index time </li></ul>
    50. 56. Information for Curators <ul><li>We want publishers/curators to publish more linked data
    51. 57. Need a feedback loop to help drive this
    52. 58. Working on “dashboard” to see what's indexed, how, etc.
    53. 59. Second phase: documentation, tools to help improve their RDFa </li></ul>
    54. 60. DiscoverEd Team & Supporters <ul><li>Ahrash Bissell
    55. 61. Asheesh Laroia
    56. 62. Raphael Krut-Landau
    57. 63. Alex Kozak
    58. 64. Christine Geith
    59. 65. Karen Vignare </li></ul><ul><li>Hewlett Foundation
    60. 66. Bill & Melinda Gates Foundation
    61. 67. OSI
    62. 68. Michigan State University
    63. 69. AgShare </li></ul>
    64. 70. Cloud Tools for Semantic Search <ul><li>Yahoo! BOSS </li><ul><li>Retrieve RDF extracted from pages </li></ul><li>Google CSE </li><ul><li>Filter using structured data (Page Maps, RDFa)
    65. 71. Customize display using structured data </li></ul></ul>
    66. 72. Conclusion <ul><li>Cloud tools can help build simple semantic search quickly
    67. 73. Nutch provides a powerful, extensible platform for prototyping search tools
    68. 74. DiscoverEd software demonstrates semantic search without large hardware investment </li></ul>
    69. 75. http://wiki.creativecommons.org/DiscoverEd [email_address] @nyergler (identi.ca, twitter)
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×