9. Invented the web in 1989
(yeah!)
Invented the semantic
web in 1994 (duh?)
10. Historical perspective
• From web 1.0: web of sites and pages,
aka the World Wide Web
• To web 2.0: web of people and of
participation, aka the Social Web (Blogs,
RSS, tags, Facebook, Wikipedia, etc.)
• To web 3.0: web of data, of meaning and
connected knowledge, aka the Semantic
Web
18. The traditional Web
• A principle: hypertext
• A protocol: HTTP
• An identification scheme: URNs/URIs
• A language: HTML
19. “To a computer, then, the web is a flat,
boring world devoid of meaning”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
20. “This is a pity, as in fact documents on the
web describe real objects and imaginary
concepts, and give particular relationships
between them”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
21. “Adding semantics to the web involves two things:
allowing documents which have information in
machine-readable forms, and allowing links to be
created with relationship values.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
22. “The Semantic Web is not a separate Web but an
extension of the current one, in which information
is given well-defined meaning, better enabling
computers and people to work in cooperation.”
Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/
23. The traditional Web
• A principle: hypertext
• A protocol: HTTP
• An identification scheme: URNs/URIs
• A language: HTML
24. The semantic Web
• A principle: hypertext
• A protocol: HTTP
• An identification scheme: URNs/URIs
• A language: HTML RDF
27. URIs and the
Web of Things
• URIs (Unique Resource Identifiers) are
used to identify things (also called
entities) in the real world
• For instance: people, places, events,
companies, products, movies, etc.
28. The RDF model
RDF is used to describe relationships
between objects, identified by their URIs
Predicate
Subject Object
31. SPARQL
• Query language for RDF databases
• Several implementations
• OSS: Apache Jena, Sesame, 4Store,
Virtuoso, Mulgara, Redland, Open Anzo...
• Proprietary: 5Store, AllegroGraph
RDFStore, Stardog, Dydra, OWLIM...
• More expressive than SQL, scalability is still
an open question
34. Solution 1: “Lift”
• One can use HTML scrapping and natural
language processing (NLP) technique to
extract semantic information from existing
content / sites
• Generic solutions: OpenCalais, Zemanta,
Apache Stanbol
• Pro: no need to change existing content
• Con: error prone, needs human checks
36. Solution 2: export
• RDFa and microformats are used to embed
semantic information (expressed using the
DRF model) into regular web pages
• RDFa does it using existing (rel) and
additional (about, property, typeof)
attributes
• Microformats only use usual HTML
attributes (class)
37. Solution 3: reuse
• Linked Online Data: (usually large) data
repositories available on the web (for free
or not), expressed using the RDF model
• Interoperability between these repositories
(their ontologies) must be defined
38. Linked Open Data in 2007
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
39. 2008
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
40. 2009
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
41. 2010
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
42. Good for Enterprise apps too!
Diagram source: http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/
44. Key Enablers
Open Data and Linked Online Data
Advances in automatic content analysis
(linguistics, image processing) and machine
learning
Classical logic and classical AI
Computing power (Moore’s law +
MapReduce)
47. Nuxeo: an open source
ECM vendor
Our Focus is Enterprise Content Management
ECM as a Platform for Content Applications
Open Source as Efficient Development Model
Modern architecture for 21st Century business
“Lean, mobile, social, interoperable”
A Social Marketplace in action
Innovation driven by community of customers,
partners, and our core developers
48. Nuxeo ECM - From Platform to Products
Construction Media Government Life Sciences
Business
Solutions
Correspondence Contracts Records
Invoice Processing
Management Management Management
Case Structured
Horizontal Document Digital Asset
Document Content
Management
Packages Management Management
Framework Server
Aggregator
Nuxeo Enterprise Platform
Platform Complete set of components covering all aspects of ECM
Content
Infrastructure Nuxeo Core
Lightweight, scalable, embeddable content repository
45
50. Goals for Semantic ECM
• Repurpose existing content better
• Improve search and collaboration
• Make information more contextual
• Extract and use information from content
• Leverage Open and Linked Data, contribute
• Make ECM user’s content smarter!
• > Gain efficiency, effectiveness and strategic
positioning on the ECM market
47
52. IKS project
• European project under the
FP7, with 13 partners (6 SMEs) and a 8.5 MEUR
budget
• Goal: create a semantic software “stack” that will be
used by CMS vendors to add semantic features to
their products
• Started in Jan. 2009, will last until Dec. 2012
• First tangible result: Apache Stanbol, already
integrated in a Nuxeo plugin
49
53. The Semantic Engine
• From unstructured content to Knowledge
• Language guessing
• Topic classification (Business, Sports, Media, ...)
• Named Entities extraction and linking
• Relationships and properties extraction
50
61. Training statistical models for NER with
Wikipedia and DBpedia
• Extract sentences with link positions in Wikipedia articles
• DBPedia to the find type of the target entity (Person,
Location, Organization)
• Apache Pig scripts to compute the join + format the result
as training files for OpenNLP
• Apache OpenNLP to build and evaluate the models
• Apache Hadoop for distributed processing
• Apache Whirr for deployment and management on Amazon
EC2 cluster
58
66. Training statistical models for topic
classification from Wikipedia and DBpedia
• Filter category tree from DBpedia SKOS entries (~500k)
• Pig scripts to compute the joins with articles abstracts for
all the articles categorized in Wikipedia
• Export as 2.8GB TSV file to be indexed in Apache Solr
• Use Solr MoreLikeThisHandler to find the top 5 most related
Wikipedia category for any kind of text
• Apache Whirr & Hadoop for deployment and management on
Amazon EC2 cluster
63
67. What’s next?
• Integrate the R&D results into Stanbol / Nuxeo
• Work on user interface / high level javascript toolkits for
Linked Data editing
• http://github.com/bergie/VIE based on backbone.js
• Experiment / Integrate / Refine
64