Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to the Semantic Web


Published on

Introduction to the semantic web, solutions for Linux, and Apache tools presented by Stefane Fermigier and Olivier Grisel.

Published in: Education, Technology
  • Be the first to comment

Introduction to the Semantic Web

  1. 1. Introduction to the Semantic WebStefane Fermigier, Olivier Grisel - Nuxeo Solutions Linux - Paris - May 2011
  2. 2. Agenda• A pragmatic introduction to the Semantic Web• Experience report and demos from Nuxeo• Apache tools for Big Linked Data
  3. 3. 1. Introduction to the Semantic Web
  4. 4. Prelude
  5. 5. Source: Mills Davis, “Semantic Social Computing”, sept. 2007
  6. 6. History
  7. 7. Invented the web in 1989(yeah!)
  8. 8. Invented the web in 1989(yeah!)Invented the semanticweb in 1994 (duh?)
  9. 9. Historical perspective• From web 1.0: web of sites and pages, aka the World Wide Web• To web 2.0: web of people and of participation, aka the Social Web (Blogs, RSS, tags, Facebook, Wikipedia, etc.)• To web 3.0: web of data, of meaning and connected knowledge, aka the Semantic Web
  10. 10. Semantics & Ontologies
  11. 11. Some examples• FOAF: relationships between people (social network)• SIOC: relationships between websites, articles, blogs, comments• Rich Snippets: syndicate RDFa content for SEO by Google, Yahoo • good-relations: e-commerce (Ebay...) • rNews: metadata for news agencies (AFP, Reuters...)
  12. 12. How is it related to the Web?
  13. 13. The traditional Web• A principle: hypertext• A protocol: HTTP• An identification scheme: URNs/URIs• A language: HTML
  14. 14. “To a computer, then, the web is a flat, boring world devoid of meaning”Tim Berners Lee,
  15. 15. “This is a pity, as in fact documents on the web describe real objects and imaginary concepts, and give particular relationships between them”Tim Berners Lee,
  16. 16. “Adding semantics to the web involves two things: allowing documents which have information inmachine-readable forms, and allowing links to be created with relationship values.” Tim Berners Lee,
  17. 17. “The Semantic Web is not a separate Web but anextension of the current one, in which information is given well-defined meaning, better enablingcomputers and people to work in cooperation.”Tim Berners Lee,
  18. 18. The traditional Web• A principle: hypertext• A protocol: HTTP• An identification scheme: URNs/URIs• A language: HTML
  19. 19. The semantic Web• A principle: hypertext• A protocol: HTTP• An identification scheme: URNs/URIs• A language: HTML RDF
  20. 20. The W3C “Layer Cake”
  21. 21. The W3C “Layer Cake” Alreadystandardized
  22. 22. URIs and the Web of Things• URIs (Unique Resource Identifiers) are used to identify things (also called entities) in the real world• For instance: people, places, events, companies, products, movies, etc.
  23. 23. The RDF modelRDF is used to describe relationshipsbetween objects, identified by their URIs PredicateSubject Object
  24. 24. ExampleSource: web-30-linked-data-quelques-repres-pour-sy-retrouver
  25. 25. RDF serializationAs XML:Others, ex: N3:
  26. 26. SPARQL• Query language for RDF databases• Several implementations • OSS: Apache Jena, Sesame, 4Store, Virtuoso, Mulgara, Redland, Open Anzo... • Proprietary: 5Store, AllegroGraph RDFStore, Stardog, Dydra, OWLIM...• More expressive than SQL, scalability is still an open question
  27. 27. SPARQL Sample
  28. 28. Where and howto find these data?
  29. 29. Solution 1: “Lift”• One can use HTML scrapping and natural language processing (NLP) technique to extract semantic information from existing content / sites• Generic solutions: OpenCalais, Zemanta, Apache Stanbol• Pro: no need to change existing content• Con: error prone, needs human checks
  30. 30. Example: DBPedia
  31. 31. Solution 2: export• RDFa and microformats are used to embed semantic information (expressed using the DRF model) into regular web pages• RDFa does it using existing (rel) and additional (about, property, typeof) attributes• Microformats only use usual HTML attributes (class)
  32. 32. Solution 3: reuse• Linked Online Data: (usually large) data repositories available on the web (for free or not), expressed using the RDF model• Interoperability between these repositories (their ontologies) must be defined
  33. 33. Linked Open Data in 2007“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.”
  34. 34. 2008“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.”
  35. 35. 2009“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.”
  36. 36. 2010“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.”
  37. 37. Good for Enterprise apps too!Diagram source:
  38. 38. Why now?
  39. 39. Key EnablersOpen Data and Linked Online DataAdvances in automatic content analysis(linguistics, image processing) and machinelearningClassical logic and classical AIComputing power (Moore’s law +MapReduce)
  40. 40. The technologies and data are available, Let’s put them to use!
  41. 41. 2. Nuxeo &Semantic ECM
  42. 42. Nuxeo: an open source ECM vendorOur Focus is Enterprise Content ManagementECM as a Platform for Content ApplicationsOpen Source as Efficient Development ModelModern architecture for 21st Century business “Lean, mobile, social, interoperable”A Social Marketplace in action Innovation driven by community of customers, partners, and our core developers
  43. 43. Nuxeo ECM - From Platform to Products Construction Media Government Life Sciences Business Solutions Correspondence Contracts Records Invoice Processing Management Management Management Case Structured Horizontal Document Digital Asset Document Content Management Packages Management Management Framework Server Aggregator Nuxeo Enterprise Platform Platform Complete set of components covering all aspects of ECM ContentInfrastructure Nuxeo Core Lightweight, scalable, embeddable content repository 45
  44. 44. Major Customers
  45. 45. Goals for Semantic ECM • Repurpose existing content better • Improve search and collaboration • Make information more contextual • Extract and use information from content • Leverage Open and Linked Data, contribute • Make ECM user’s content smarter! • > Gain efficiency, effectiveness and strategic positioning on the ECM market 47
  46. 46. Demo 48
  47. 47. IKS project • European project under the FP7, with 13 partners (6 SMEs) and a 8.5 MEUR budget • Goal: create a semantic software “stack” that will be used by CMS vendors to add semantic features to their products • Started in Jan. 2009, will last until Dec. 2012 • First tangible result: Apache Stanbol, already integrated in a Nuxeo plugin  49
  48. 48. The Semantic Engine• From unstructured content to Knowledge• Language guessing• Topic classification (Business, Sports, Media, ...)• Named Entities extraction and linking• Relationships and properties extraction 50
  49. 49. 51
  50. 50. 52
  51. 51. 53
  52. 52. RESTful isBeautiful 54
  53. 53. = Semantic Engines (Apache OpenNLP) +Fast Linked Data local index (Apache Solr) + Semantic Rule Engine 55 (Apache Jena)
  54. 54. Apache Stanbol Engine 1 DBpedia Engine 2 21 Engine 3 Freebase Nuxeo DM 3 addon Geonames LDAP Local IT infrastructure (LAN) 56
  55. 55. 3. Apache tools for processingBig and/or Linked Data
  56. 56. Training statistical models for NER withWikipedia and DBpedia • Extract sentences with link positions in Wikipedia articles • DBPedia to the find type of the target entity (Person, Location, Organization) • Apache Pig scripts to compute the join + format the result as training files for OpenNLP • Apache OpenNLP to build and evaluate the models • Apache Hadoop for distributed processing • Apache Whirr for deployment and management on Amazon EC2 cluster 58
  57. 57. 59
  58. 58. 60
  59. 59. 61
  60. 60. 62
  61. 61. Training statistical models for topicclassification from Wikipedia and DBpedia • Filter category tree from DBpedia SKOS entries (~500k) • Pig scripts to compute the joins with articles abstracts for all the articles categorized in Wikipedia • Export as 2.8GB TSV file to be indexed in Apache Solr • Use Solr MoreLikeThisHandler to find the top 5 most related Wikipedia category for any kind of text • Apache Whirr & Hadoop for deployment and management on Amazon EC2 cluster 63
  62. 62. What’s next? • Integrate the R&D results into Stanbol / Nuxeo • Work on user interface / high level javascript toolkits for Linked Data editing • based on backbone.js • Experiment / Integrate / Refine 64
  63. 63. Resources••••••• 65