Lecture semantic lifting_presentation

Semantic
Semantic CMS Community Lifting for
Traditional
Content
Lecturer
Resources
Organization

Date of presentation

Co-funded by the
1 Copyright IKS Consortium
European Union

Page:

Part I: Foundations

(1) Introduction of Content Foundations of Semantic
(2)
Management Web Technologies

Part II: Semantic Content Part III: Methodologies
Management

Knowledge Interaction Requirements Engineering
(3) (7)
and Presentation for Semantic CMS

(4) Knowledge Representation
and Reasoning
(8)
Designing
Semantic CMS

Semantifying
(5) Semantic Lifting (9) your CMS

Storing and Accessing Designing Interactive
(6) Semantic Data
(10) Ubiquitous IS

www.iks-project.eu Copyright IKS Consortium

Page: 3

What is this Lecture about?
 We have learned ... Part II: Semantic Content
 ... how to build ontologies Management
representing complex Knowledge Interaction
(3)
knowledge domains. and Presentation
 ... a way to reason about
knowledge. (4) Knowledge Representation
and Reasoning
 We need a way ...
 ... to extract knowledge from (5) Semantic Lifting

content in a automatic way 
Semantic Lifting Storing and Accessing
(6) Semantic Data


Page: 4

Overview
 What is semantic lifting?
 Core concepts
 Scenarios
 Requirements
 Technologies
 Semantic Reengineering
 Semantic Enhancements of textual content


Page: 5

What is “Semantic Lifting”?
 Semantic Lifting refers to the process of associating
content items with suitable semantic objects as
metadata to turn “unstructured” content items into
semantic knowledge resources

 Semantic Lifting makes explicit “hidden” metadata in
content items


Page: 6

Semantic Lifting Targets
 Semantic Reengineering of structured data
 Semantic Lifting harmonizes metadata representations
 Semantic Lifting reengineers data from an existing resource so
that the data from the resource can be reused within in a
semantic repository

 Semantic Content Enhancement
 Semantic Lifting generates additional metadata and annotations
by semantic analysis of content items
 Semantic Lifting classifies content objects by means of semantic
annotations


Page: 7

Structured Content
 Structured content provides implicit semantics through
the structure definition
 Table definitions in relational databases, XML
schemata, field definitions for adressbooks,
calendars, etc.
 Application programs are designed to „know“ how
to interpret the structures and the data within.
 Semantic Lifting is used for Reengineering to
support data exchange and seamless interoperability
between different systems

Page: 8

Unstructured Content
 Unstructured content
 Images, texts, videos, music, web pages composed
of various types of media items
 Meaningful only to humans not to machines
 Content must be described semantically by metadata
to become meaningful to machines, e.g. what the text
or image is about.
 Semantic Lifting is used as content enhancement


Page: 9

Mixed Content
 No dichotomy of structured and unstructured content
 Structured databases are used to store unstructured
content types, such as texts, images etc.
 Documents can be composed of unstructured content
items such as free text and images as well as more
structured information, e.g. tables and charts

Free text
Structured
content


Page: 10

Metadata: Variants
 Metadata exist in many forms:
 Free text descriptions
 Descriptive content related keywords or tags from fixed vocabularies or
in free form
 Taxonomic and classificatory labels
 Media specific metadata, such a mime-types, encoding, language, bit
rate
 Media-type specific structured metadata schemes such as EXIF for
photos, IPTC tags for images, ID3-tags for MP3, MPEG-7 for videos,
etc.
 Content related structured knowledge markup, e.g. to specify what
objects are shown in an image or mentioned in a text, what the actors
are doing, etc.


Page: 11

Metadata: Variants
 Inline metadata are part of content
 ID3 tags embedded in MP3 files
 Offline metadata are kept separate from content


Page: 12

Formal semantic metadata
 Data representation in a formalism with a formal
semantic interpretation that defines the concept of
(logical) entailment for reasoning:
 Soundness: conclusions are valid entailments
 Completeness: every valid entailment can be deduced
 Decidability: a procedure exists to determine whether a
conclusion can be deduced
 Embodiments:
 Logics
 Knowledge Representation Systems, Description Logics
 Semantic Web: RDF, OWL

Page: 13

„Semantics“ in CMS
 CMSsystems provide various methods to include
metadata
 Organize content in hierarchies
 Hierarchical taxonomies
 Attachment of properties to content items for metadata
 Content type definitions with inheritance

 These methods are used in CMS systems in ad-hoc
fashion without clear semantics. Therefore no well-
defined reasoning is possible.


Page: 14

Semantic Lifting Usage
 Content Creation and Acquisition
 Authoring content
 Support content editors in providing metadata of specified types
 Uploading external content/documents
 automatic extraction and analysis, e.g. for indexing
 Importing content from external sources/documents
 Integration of external content into content repository
 Content needs to be transformed to match internal CMS structures and
metadata schemes
 Crossreferencing/linking among CMS content items and external
content
 Detect related or additional content
 Add pointers/links to related or additional content


Page: 15

 Access to external documents and content repositories
 Semantic harmonization with CMS semantic structures
 Semantic interoperability in data exchange with other content
repositories
 TheCMS needs to understand the data structures used
by external services and programs
 E.g synchronization of a local calendar from Outlook with an
external calendar based on iCalendar format
 E.g. Importing RDF from a Linked Data endpoint such as
dbpedia
 TheCMS must present its data in a form understood by
external target services or programs

Page: 16

 Publishing content with metadata
 Metadata need to be transformed into a form compatible
with the publication format
 E.g. converting FreeDB metadata into ID3 tags for inclusion in
an MP3 file


Page: 17

Publishing Web Content with
semantic metadata
 Augmenting web content with structured information becomes
increasingly important
 Several methods have emerged in recent years to include
structured metadata in Web pages
 Microformats
 RDFa
 Microdata (HTML5)
 Supported by the major search engines to improve search and
result presentation, e.g. Google („Rich Snippets), Bing, Yahoo


Page: 18

Augmenting Web Content
 The HTML code contains a review of a restaurant in plain text
using only line breaks for structuring
 Without specialized information extraction analysis tools it cannot
be interpreted, e.g. that it is a review (of what and when?), who the
reviewer was, etc.

<div>
L’Amourita Pizza
Reviewed by Ulysses Grant on Jan 6.
Delicious, tasty pizza on Eastlake!
L'Amourita serves up traditional wood-fired Neapolitan-style pizza,
brought to your table promptly and without fuss. An ideal neighborhood
pizza joint.
Rating: 4.5
</div>


Page: 19

Microformats
 Same text but additional span elements with class attributes to
encode the type of contained information (hReview) and the
properties of that type
<div class="hreview">
<span class="item">
<span class="fn">L’Amourita Pizza</span>
</span>
Reviewed by <span class="reviewer">Ulysses Grant</span> on
<span class="dtreviewed">
Jan 6<span class="value-title" title="2009-01-06"></span>
</span>.
<span class="summary">Delicious, tasty pizza on Eastlake!</span>
<span class="description">L'Amourita serves up traditional wood-fired
Neapolitan-style pizza, brought to your table promptly and without fuss.
An ideal neighborhood pizza joint.</span>
Rating:
<span class="rating">4.5</span>
</div>


Page: 20

RDFa
 Same text but additional attributes and span elements encoding a
RDF structure:
 namespace declaration of the used ontology
 RDF class encoded by typeof attribute and its properties by a
property attribute
<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review">
<span property="v:itemreviewed">L’Amourita Pizza</span>
Reviewed by
<span property="v:reviewer">Ulysses Grant</span> on
<span property="v:dtreviewed" content="2009-01-06">Jan 6</span>.
<span property="v:summary">Delicious, tasty pizza on Eastlake!</span>
<span property="v:description">L'Amourita serves up traditional wood-fired
Neapolitan-style pizza, brought to your table promptly and without fuss.
An ideal neighborhood pizza joint.</span>
Rating:
<span property="v:rating">4.5</span>
</div>


Page: 21

Microdata (HTML5)
 Same text but additional attributes and span elements:
 A class declaration as value of an itemtype attribute and its
properties as values of an itemprop attribute

<div>
<div itemscope itemtype="http://data-vocabulary.org/Review">
<span itemprop="itemreviewed">L’Amourita Pizza</span>
Reviewed by <span itemprop="reviewer">Ulysses Grant</span> on
<time itemprop="dtreviewed" datetime="2009-01-06">Jan 6</time>.
<span itemprop="summary">Delicious, tasty pizza in Eastlake!</span>
<span itemprop="description">L'Amourita serves up traditional wood-fired
Neapolitan-style pizza,
brought to your table promptly and without fuss. An ideal neighborhood pizza
joint.</span>
Rating: <span itemprop="rating">4.5</span>
</div>
</div>


Page: 22

Lifting Requirements:
Overview
Top-level requirements
 Semantic Associations with Content
 Semantic Harmonization
 Semantic Linking
 Interactive Lifting
 Customizability
 Semantically Transparent Structured Content
Sources


Page: 23

Semantic Associations with
Content
 Unstructured content and information must be
supplied with structured semantic annotations and
metadata.
 Support for various content/media types
 Information extraction from text, topic classification, image
tagging, …
 Support for creation of semantic annotations in content
authoring


Page: 24

Semantic Harmonization
 Metadataand annotations must be harmonized with
requirements for semantic processing in the CMS
 Reengineering methods, interpreters and wrappers for all
types and formats of metadata and annotations, e.g. tags,
microformats, XML Metadata ( MPEG-7, …), ID3 tags,
EXIF data, …
 Ensure semantic interoperability of data and annotation
schemes within the CMS and across external resources
 Ontology mapping and harmonization of annotations
 Externalmetadata
 Metadata generated by semantic analysis


Page: Slide 25

Semantic Linking
 Liftingmust enable the interlinking of content
objects by semantic relationships.
 Internal linking of content items within the CMS
 links to external resources, e.g. Linked Open Data
 Establish semantic relatedness of content for different
views as well as different search, navigation and browsing
strategies, …
 Directsemantic links among content items and metadata
 Similarity relations over sets of content items
 Clustering of content items


Page: Slide 26

Interactive Lifting
 Lifting must interact with CMS users.
 Suggest semantic annotations during content creation
 Support for various publishing formats such as microformats,
RDFa, etc.
 Automatic annotations (autotagging) with optional
correction option
 Learning capabilities and adaptability of automatic
annotation components from user feedback


Page: 27

Customizability
 Liftingcomponents must be customizable by CMS
users/customers.
 Users must not be restricted to predefined vocabularies,
ontologies, …
 Domain ontologies, terminologies, tag sets are defined by
CMS users/customers.
 Browsers and editors for component resources are
necessary.


Page: 28

Transparent Structured
Content Sources
 Structured
content sources need to be reengineered to
semantic resources
 Support uniform data access to structured content
repositories, e.g. SPARQL end points based on D2RQ
technologies for transparent access to RDF and non-RDF
databases
 Extraction of ontologies from database structures,
schemata, XML, resources, …
 Alignment and mapping of the descriptions


Page: 29

Semantic Reengineering of
structured data sources
 Focus on tools for reengineering structured data sources to RDF
representations
 Many tools and platforms for
 D2R Servers: Exhibit relational DBs as RDF
 Talis platform: Linked Open Data
 Triplify: like D2R but in PHP
 Virtuoso middleware
 Krextor/OntoCape: generating RDF from XML
 Various Transformers for inducing RDF ontologies and instance
data from XSD and XML
 More details in presentation on Knowledge
Representation (KReS)

Page: 30

Semantic Content
Enhancements: Overview
 Focus here is on textual content
 Metadata Extraction from existing content in various
formats to make embedded metadata explicit
 Information Extraction from textual content:
 Named Entities
 Coreference
 Relationships
 Classification and Clustering of content items
 Statistical methods and tools
 Semantic classification based on ontological definitions


Page: 31

Information Extraction
 Rule based approaches for shallow text analysis
 Usually based on Finite State technology: fast, robust
 Cascaded processing
 Based on templates as target structures to be filled
 Example platforms:
 GATE
 SProUT
 Can be used for nearly any kind of extraction/annotation task,
including Named-Entity-Recognition (NER)
 Easy customization


Page: 32

Information Extraction
 Semi-supervised learning approaches
 Rule induction from corpora
 Use example annotations as seeds for bootstrapping
 Pattern Rules learned from contextual features with
generalization over contexts


Page: 33

Named Entities
 Statistical Approaches: examples
 Lingpipe: Hidden Markov Models
 OpenNLP: Maximum Entropy Models
 Stanford NER: Conditional Random Fields

 Statistical models crated by supervised learning techniques
 Large annotated corpora required
 Customization diffcult except by re-annotation/re-training
 Not suitable for any type of named entity


Page: 34

NER Document Markup


Page: 35

NER Markup for a Web Page


Page: 36

IE Template
A Person Template (as
Typed Featured Structure)
instantiated from text.
The template supports the
extraction of various
properties of a person.


Page: 37

Classification
 Assign a data item to some predefined class
 Statistical classification
 Numerous methods, e.g.:
 Bayes classifiers
 K-Nearest Neighbor (KNN)
 Support Vector Machines (SVM)


Page: 38

Semantic Classification
 Semanticclassification in Knowledge Representation
Formalisms
 Infer the item„s class from the item„s properties by matching
them with the class definitions: Which classes allow for these
properties?
Assume that our ontology contains 2 classes with some properties
SpatialThing: latitude, longitude
PopulatedPlace: population
Paderborn is an object with latidude „51°43′0″N“, longitude „8°46′0″E“ and a
population of 146283.
Then we can infer that Paderborn is a SpatialThing as that are the things that
have latitudes and longitudes in our ontology. Also, we can infer that it is a
PopulatedPlace as that are the things that have a population.

Page: 39

Clustering
 Detection of classes in a data set
 Partitioning data into classes in an unsupervised way
with
high intra-class similarity
low inter-class similarity
 Main variants:
 Hierarchical clustering
 Agglomerative

 Partitioning clustering
 K-Means


Page: 40

Tools for Classification and
Clustering
 Generic:
 WEKA: Java library implementing several dozen methods
for data mining. Application to textual data requires special
preprocessing.
 Text:
 MALLET: Java library with implementations of major
methods for text and document classification and
clustering


Page: 41

Evaluation Measures
 Standard evaluation measures for IE/IR etc. systems:
tp tn
 Accuracy: acc tp fp tn fn
tp
tp = true positive
 Precision: prec tp fp tn = true negative
 Recall: recall
tp fp = false positive
tp fn
fn = false negative
 F-Measure : F 2 prec recall
prec
recall


Page: 42

Evaluation Measures:
Classification
 A confusion matrix which reports on the classification of
27 wines by grape variety. The reference in this case is
the true variety and the response arises from the blind
evaluation of a human judge.

=9/(9+3+1)
Many-way Confusion Matrix
Response
Cabernet Syrah Pinot Precision Recall F-Measure
Refer- Cabernet 9 3 0 0,69 0,75 0,72
ence Syrah 3 5 1 0,56 0,56 0,56
Pinot 1 1 4 0,80 0,67 0,73
Macro average 0,68 0,66 0,67
Overall accuracy 0,67
=4/(1+1+4)

Page: 43

Evaluation Measures: NER
 Reference annotations:
 [Microsoft Corp.] CEO [Steve Ballmer] announced the release of [Windows 7] today

 Recognized annotations:
 [Microsoft Corp.] [CEO] [Steve] Ballmer announced the release of Windows 7 [today]

-> Microsoft Corp. CEO Steve Ballmer announced the release of Windows 7 today
Counts Entities

Precision: 1/(1+3) = 0,25 TP 1 [Microsoft Corp.]
Recall: 1/(1+2) = 0,33 TN

F-Measure: FP 3 [CEO]
[Steve]
2*0,25*0,33/(0,25+0,33) = 0,28 [today]
FN 2 [Windows 7]
[Steve Ballmer]
Copyright IKS Consortium
www.iks-project.eu

Page: 44

NER Evaluation
 Nobel Prize Corpus from NYT, BBC, CNN
 538 documents (Ø 735 words/document)
 28948 person, 16948 organization occurrences

Sprout Calais Stanford OpenNLP
NER
Precision 77,26 94,22 73,21 57,69
Recall 65,85 86,66 73,62 42,86
F1 71,10 90,28 73,41 49,18


Page: 45

References
 Microformats: http://microformats.org/
 RDFa: http://www.w3.org/TR/xhtml-rdfa-primer/
 Google Rich Snippets:
http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html
 Linked Data: http://linkeddata.org/guides-and-tutorials
 Linked Data: Heath and Bizer, Linked Data: Evolving the Web into a Global Data
Space. Morgan & Claypool, 2011. (Online: http://linkeddatabook.com/book)
 Information Extraction: Moens, Information Extraction: Algorithms and Prospects in
a Retrieval Context. Springer 2006
 Text Mining: Feldman and Sanger, The Text Mining Handbook: Advanced
Approaches in Analyzing Unstructured Data, CUP, 2007


Lecture semantic lifting_presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Lecture semantic lifting_presentation

Similar to Lecture semantic lifting_presentation (20)

Lecture semantic lifting_presentation