Introduction to the Semantic Web

Introduction to the
Semantic Web
Stefane Fermigier, Olivier Grisel - Nuxeo
Solutions Linux - Paris - May 2011

Agenda

• A pragmatic introduction to the Semantic
Web
• Experience report and demos from Nuxeo
• Apache tools for Big Linked Data

1. Introduction to the
Semantic Web

Source: Mills Davis, “Semantic Social Computing”, sept. 2007

Invented the web in 1989
(yeah!)

Invented the web in 1989
(yeah!)

Invented the semantic
web in 1994 (duh?)

Historical perspective
• From web 1.0: web of sites and pages,
aka the World Wide Web
• To web 2.0: web of people and of
participation, aka the Social Web (Blogs,
RSS, tags, Facebook, Wikipedia, etc.)
• To web 3.0: web of data, of meaning and
connected knowledge, aka the Semantic
Web

Some examples
• FOAF: relationships between people (social
network)
• SIOC: relationships between websites,
articles, blogs, comments
• Rich Snippets: syndicate RDFa content for
SEO by Google, Yahoo
• good-relations: e-commerce (Ebay...)
• rNews: metadata for news agencies (AFP,
Reuters...)

How is it related to
the Web?

The traditional Web

• A principle: hypertext
• A protocol: HTTP
• An identiﬁcation scheme: URNs/URIs
• A language: HTML

“To a computer, then, the web is a ﬂat,
boring world devoid of meaning”

Tim Berners Lee, http://www.w3.org/Talks/WWW94Tim/

“This is a pity, as in fact documents on the
web describe real objects and imaginary
concepts, and give particular relationships
between them”


“Adding semantics to the web involves two things:
allowing documents which have information in
machine-readable forms, and allowing links to be
created with relationship values.”


“The Semantic Web is not a separate Web but an
extension of the current one, in which information
is given well-deﬁned meaning, better enabling
computers and people to work in cooperation.”


The semantic Web

• A principle: hypertext
• A protocol: HTTP
• An identiﬁcation scheme: URNs/URIs
• A language: HTML RDF

The W3C “Layer Cake”

Already
standardized

URIs and the
Web of Things

• URIs (Unique Resource Identiﬁers) are
used to identify things (also called
entities) in the real world
• For instance: people, places, events,
companies, products, movies, etc.

The RDF model

RDF is used to describe relationships
between objects, identiﬁed by their URIs

Predicate
Subject Object

Example

Source: http://www.slideshare.net/AntidotNet/web-smantique-web-de-donnes-
web-30-linked-data-quelques-repres-pour-sy-retrouver

RDF serialization
As XML:

Others, ex: N3:

SPARQL
• Query language for RDF databases
• Several implementations
• OSS: Apache Jena, Sesame, 4Store,
Virtuoso, Mulgara, Redland, Open Anzo...
• Proprietary: 5Store, AllegroGraph
RDFStore, Stardog, Dydra, OWLIM...
• More expressive than SQL, scalability is still
an open question

Where and how
to ﬁnd these data?

Solution 1: “Lift”
• One can use HTML scrapping and natural
language processing (NLP) technique to
extract semantic information from existing
content / sites
• Generic solutions: OpenCalais, Zemanta,
Apache Stanbol
• Pro: no need to change existing content
• Con: error prone, needs human checks

Solution 2: export
• RDFa and microformats are used to embed
semantic information (expressed using the
DRF model) into regular web pages
• RDFa does it using existing (rel) and
additional (about, property, typeof)
attributes
• Microformats only use usual HTML
attributes (class)

Solution 3: reuse

• Linked Online Data: (usually large) data
repositories available on the web (for free
or not), expressed using the RDF model
• Interoperability between these repositories
(their ontologies) must be deﬁned

Linked Open Data in 2007

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

2008


2009


2010


Good for Enterprise apps too!

Diagram source: http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/

Key Enablers
Open Data and Linked Online Data
Advances in automatic content analysis
(linguistics, image processing) and machine
learning
Classical logic and classical AI
Computing power (Moore’s law +
MapReduce)

The technologies and data
are available,

Let’s put them to use!

Nuxeo: an open source
ECM vendor
Our Focus is Enterprise Content Management
ECM as a Platform for Content Applications
Open Source as Efficient Development Model
Modern architecture for 21st Century business
“Lean, mobile, social, interoperable”

A Social Marketplace in action
Innovation driven by community of customers,
partners, and our core developers

Nuxeo ECM - From Platform to Products

Construction Media Government Life Sciences

Business
Solutions
Correspondence Contracts Records
Invoice Processing
Management Management Management

Case Structured
Horizontal Document Digital Asset
Document Content
Management
Packages Management Management
Framework Server
Aggregator

Nuxeo Enterprise Platform
Platform Complete set of components covering all aspects of ECM

Content
Infrastructure Nuxeo Core
Lightweight, scalable, embeddable content repository
45

Goals for Semantic ECM
• Repurpose existing content better

• Improve search and collaboration

• Make information more contextual

• Extract and use information from content

• Leverage Open and Linked Data, contribute

• Make ECM user’s content smarter!

• > Gain efficiency, effectiveness and strategic
positioning on the ECM market

47

IKS project
• European project under the
FP7, with 13 partners (6 SMEs) and a 8.5 MEUR
budget

• Goal: create a semantic software “stack” that will be
used by CMS vendors to add semantic features to
their products

• Started in Jan. 2009, will last until Dec. 2012

• First tangible result: Apache Stanbol, already
integrated in a Nuxeo plugin


49

The Semantic Engine

• From unstructured content to Knowledge

• Language guessing

• Topic classification (Business, Sports, Media, ...)

• Named Entities extraction and linking

• Relationships and properties extraction

50

RESTful
is
Beautiful

54

=
Semantic Engines
(Apache OpenNLP)
+
Fast Linked Data local index
(Apache Solr)
+
Semantic Rule Engine 55

(Apache Jena)

Apache Stanbol

Engine 1 DBpedia
Engine 2

2
1 Engine 3

Freebase

Nuxeo DM
3
addon
Geonames
LDAP
Local IT infrastructure (LAN) 56

3. Apache tools for
processing
Big and/or Linked Data

Training statistical models for NER with
Wikipedia and DBpedia
• Extract sentences with link positions in Wikipedia articles

• DBPedia to the find type of the target entity (Person,
Location, Organization)

• Apache Pig scripts to compute the join + format the result
as training files for OpenNLP

• Apache OpenNLP to build and evaluate the models

• Apache Hadoop for distributed processing

• Apache Whirr for deployment and management on Amazon
EC2 cluster

58

Training statistical models for topic
classification from Wikipedia and DBpedia
• Filter category tree from DBpedia SKOS entries (~500k)

• Pig scripts to compute the joins with articles abstracts for
all the articles categorized in Wikipedia

• Export as 2.8GB TSV file to be indexed in Apache Solr

• Use Solr MoreLikeThisHandler to find the top 5 most related
Wikipedia category for any kind of text

• Apache Whirr & Hadoop for deployment and management on
Amazon EC2 cluster

63

What’s next?

• Integrate the R&D results into Stanbol / Nuxeo

• Work on user interface / high level javascript toolkits for
Linked Data editing

• http://github.com/bergie/VIE based on backbone.js

• Experiment / Integrate / Refine

64

Resources
• http://iks-project.eu

• http://stanbol.demo.nuxeo.com

• http://incubator.apache.org/stanbol

• http://blogs.nuxeo.com/dev

• http://hadoop.apache.org/

• http://incubator.apache.org/opennlp/

• http://github.com/ogrisel/pignlproc
65

Introduction to the Semantic Web

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to the Semantic Web

Similar to Introduction to the Semantic Web (20)

More from Nuxeo

More from Nuxeo (20)

Recently uploaded

Recently uploaded (20)

Introduction to the Semantic Web

Editor's Notes