Building and Using Knowledge Bases

WeST – Web Science & Technologies
University of Koblenz Landau, Germany

Building and Using
Knowledge Bases

Steffen Staab
Saqib Mir – European Bioinformatics Institute
Ermelinda d„Oro, Massimo Ruffolo – Univ. Calabria, Italy
& WeST Team

Institut WeST – Web Science & Technologies

Semantic Web Web Retrieval Social Web Multimedia Web Software Web GESIS

WeST – Web Science & Steffen Staab Slide 2
Technologies staab@uni-koblenz.de

PhD thesis trauma 17 years ago

„Nach dem Auspacken der LPS 105 präsentiert sich dem
Betrachter ein stabiles Laufwerk, das genauso geringe
Außenmaße besitzt wie die Maxtor.“

Having unwrapped the LPS 105 – reveals itself to the
onlooker - a stable disk drive, which has similarly small
volume as the Maxtor.“


GENERAL MOTIVATION

General motivation is not information extraction,
but it is solving tasks!


General objective: Extracting to LOD

useAsExample hasLivedIn

Crucial to know: Ontologies nowadays reflect this structure
Ontologies are
• Modular (vs one to rule them all)
• Distributed (vs defined in one place)
• Connected (vs isolated templates)
• Extensible (vs claimed to be finished)
• Lightweight (vs computationally intractable)
• Popular ones are used more often (vs people disagreeing)

Ontologies – LEGO style

Most famous applications

 Steve Macbeth (Microsoft): - discussion wrt Schema.org -
“about 7% of pages we crawl have mark-up”
 http://www.w3.org/2012/06/06-schema-minutes.html
 LOD Cloud

 Google Knowledge Graph
 Bing gets its own knowledge graph
http://searchengineland.com/bing-britannica-partnership-123930

Example ontology-based application 1:

ANALYSIS OF
URBAN PARAMETERS


General objective: Analysing LOD

useAsExample hasLivedIn


http://lisa.west.uni-koblenz.de/lisa-demo/
Family„s analysis of Koblenz LOD + Open Street Map data


http://lisa.west.uni-koblenz.de/lisa-demo/
Entrepreneur„s analysis of Koblenz LOD + Open Street Map data

1. Prize
German
Linked Open Gov Data
Competition 2012


Example ontology-based application :

FACETED MULTIMEDIA
EXPLORATION


Making Web 2.0 More Accessible

[Schenk et al; JoWS 2009]
GeoNames

Links Location

low- to
xxxxx
Persons xxxx midlevel
features

Knowledge Tags


Choosing between Koblenz – and Koblenz

Video at: http://vimeo.com/2057249

Contextual Information


Tag-based refinement


A tag view of „Koblenz“ & „Castle“


Semantic Identity – Festung Ehrenbreitstein


Persons – Celebrities, FOAFers & Flickr Users

Billion Triples Challenge 1. Prize
2008

Technologies
[Schenk et al; JoWS 2009]
staab@uni-koblenz.de

Now on to information extraction:

OBSERVATIONS ON
INFORMATION EXTRACTION


Challenges & Opportunities for IE

Not all web pages are created equal



Some challenges are the same, e.g. finding type instances



Some challenges are the same, e.g. finding relation instances



Some contain concepts and their descriptions, some don„t
No types here,
few relation types



Knowing that they are instances and of which type
Textual Positional
indication indication



To some extent
positional and layout
indications work across
languages and sites



owl:sameAs
We should not only think about
Web pages, but about Web sites


We should not only think about
Web pages, but about Web sites

owl:sameAs


Comparing related work to our objectives
Related work objectives Our objectives
 IE on Web pages  IE on Web sites
 Acquiring instances and  Acquiring items
relationship instances  Classifying items in
 Instances
 Concepts
 Relation instances
 Relationships
 IE also based
 IE based on linear text
on spatial position
There is overlap and of course there are
exceptions in related work

Outline

The Social Media-Case The Bio-Case
 Motivation
 State-of-the-Art
 Core idea of SXPath
 Implementation
 Evaluation

[Oro et al; VLDB 2010]


Presentation-oriented documents


Presentation-oriented documents

• HTML DOM structure is site specific
• Spatial arrangements are rarely explicit
• Spatial layout is hidden in complex nesting of layout elements
• Intricate DOM tree structures are conceptually difficult to query
for the user (or a tool!)


Related Work

Web Query languages
 Xpath 1.0 and XQuery1.0
 Established
 Too difficult to use for scraping from intricate DOM structures

Visual languages
 Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency
 Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface
[Gottlob et al.] [Sahuguet et al.]
 generate XPath location paths of DOM nodes
 can benefit from using Spatial XPath

Outline

 Motivation
 Implementation
 Evaluation


Representing Spatial Relations between DOM Nodes

b

e


Idea: Use Spatial Relations among DOM Nodes


Spatial DOM (SDOM)


SXPath System Architecture


Querying for Relations Among Nodes

Rectangular Cardinal Relations (RCR)

r1 E:NE r2

Spatial models allow for expressing
disjunctive relations among regions
Topological Relations


XPath Example


SXPath Example


From XPath 1.0 towards Spatial Querying with SXPath

SXPath features
 adopts intuitive path notation:
 axis::nodetest [pred]*
 adds to XPath
 spatial axes
 spatial position functions
 natural semantics for spatial querying


SXPath System Architecture


Complexity Results

 Formal model defined in the paper
[Oro et al; VLDB 2010]


Outline

 Motivation
 Implementation
 Evaluation


SXPath System


Summative User Study


Outline

The Social Media Case The Bio-Case
 Motivation  Motivation
 State-of-the-Art  The (Biochemical) Deep
 Core idea of SXPath Web
 SXPath Language  Contributions
 Spatial Data Model  Page-level wrapper
induction
 Syntax & Semantics
 Site-wide wrapper
 Complexity
generation
 Implementation  Error Correction by
 Evaluation Mutual Reinforcement
 Conclusions and Future
Directions

>1000 Life Science DBs, number growing quickly


Biochemical Web Sites: Observations - 1

Labeled Data

Full survey:
http://sabio.villa-
bosch.de/labelsurvey.html (404)

Total Labeled Unlabeled Unlabeled
(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites



Dynamic Web Pages



Rich Site Structure



 Semantics is often only in the report,
not in the underlying relational database

 Web Services
 Survey: 11 of 100 Databases1 provide APIs
 Incomplete coverage
 Varying granularity
 No semantics in the service description

1 Databases indexed by the Nucleic Acids Research Journal
(http://www3.oup.co.uk/nar/database/). Complete survey was available at
http://sabiork.villa-bosch.de/index.html/survey.html


Biochemical Web Sites: Extraction Tasks
[Mir et al; DILS 2009]
[Mir et al; ESWC 2010]

Induce Wrapper

Induce Wrapper

Induce Wrapper


Contributions

 Unsupervised Page-Level Wrapper Induction

 Unsupervised Site-Wide Wrapper Induction
(Site Structure Discovery)

 (Acquiring the Schema/Ontology)

 Automatic Error Detection and Correction by
Mutual Reinforcement


Page-Level Wrapper Induction – 1
D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}
O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

//*[text()]

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }
O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}

Page-Level Wrapper Induction - 2

Reclassify – Growing Data Regions


D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}
O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }
O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}



Selecting Labels for Data
html/…./table[1]/tr[8]/td[1]/…/code[1]/a[1]
(“1.1.1.47” )

html/…./table[1]/tr[6]/th[1]/…/code[1]/
(“Reaction”)
html/…./table[1]/tr[8]/th[1]/…/code[1]/
(“Enzyme”)



Anchor the Path
Enzyme - html/table[1]/tr[8]/th[1]/code[1]/
html/table[1]/tr[8]/td[1]/code[1]/a[1]
html/table[1]/tr[8]/td[1]/code[1]/a[2]

//*[text()=„Enzyme‟] ../…./../td[1]/code[1]/a[position()≥2]/text()

Pivot Relative Generalize


Selected Sources

 KEGG, ChEBI, MSDChem
 Basic qualitative data
 Popular
 Overlapping/complementary data


Wrapper Induction - Evaluation

SOURCE #L #D #S TP FN FP P R

KEGG Compound 10 762 3 411 351 46 89.9 53.9
http://www.genome.jp/kegg/ compound/
15 759 3 0 100 99.6
KEGG Reaction 10 205 3 173 32 0 100 84.4
http://www.genome.jp/kegg/ reaction/
15 205 0 0 100 100
ChEBI 22 831 3 595 236 41 93.5 71.6
http://www.ebi.ac.uk/chebi
15 829 2 0 100 99.7
MSDChem 30 600 3 600 0 20 96.7 100
http://www.ebi.ac.uk/msd-srv/msdchem/
15 600 0 20 96.7 100
Average (based on final wrappers for each source) 99.1 99.8
Table 2: Page-level wrapper induction results, 20 test pages
(L=Labels, D=Data entries, S=Training pages)
~9 samples – ~99% P, ~98% R


Site-Wide Wrapper Induction: Observations

Not all pages contain data (e.g. Legal disclaimers,
contact pages, navigational menus)
 An efficient approach should ignore these pages
 We dont need to learn the entire site-structure


Site-Wide Wrapper Induction: Observations - 2

Classified Link-Collections point to data-intensive
pages of the same class.


Site-Wide Wrapper Induction: Observations - 3

 Pages belong to the same class describe the same
concepts
 Some concepts are sometimes omitted
 Ordering is always the same


Site-Wide Wrapper Induction

1. Start with C0 L1
S={C0}
2. Follow all classified
link-collections C0
C1
3. Generate wrappers L3
for each set of target
L2
pages
C2
4. Determine if new C3
class is formed
5. Add navigation step If C0 != Ci (i>0)
S=S+Ci;
6. Repeat 2 – 5 for each
Navigation Steps
new class formed in 4
W= {(C0 → L1→ C0),
(C0 → L2→ C2),
(C0 → L3→ C3)}


Site-Wide Wrapper Induction – Evaluation
SOURCE #C #C’ #D TP FN FP P R

MSDChem 1 1 N/A N/A N/A N/A N/A N/A

ChEBI 3 1 1711 1195 516 0 100 69.8

KEGG 10 7 6223 5044 1179 188 97 81.1

Average 98.5 75.5

Table 3: Site-wide wrapper induction results, 20 test pages for each class
(C=Classes, C =Classes discovered, D=Data entries)


Error Detection and Correction:

Observation: Certain data reappear on more
than one class of pages


Error Detection and Correction:
 Reinforcement if reappearing data correctly classified as
Data
 Otherwise it points to misclassification
 Label-Data Mismatch
• Correction: Introduce more samples
 Label-Label Mismatch
• Cannot be detected


Where to go next?

 Reverse engineering production
1. LOD emitting RDF & RDFS
2. Navigation model what belongs to what
3. Interaction model (- not treated at all by us so far -)
4. Layout model spatial positioning

 Capture this generative model using machine learning
 Relational learning
• Markov logic programmes?
• …?


Bibliography

 Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath –
Extending XPath towards Spatial Querying on Web
Documents. In: PVLDB – Proceedings of the VLDB
Endowment, 4(2): 129-140, 2010.
 S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for
Life Science Deep Web Databases. In: DILS-2009 – Proc.
of the Data Integration in the Life Sciences Workshop,
Manchester, UK, July 20-22, LNCS, Springer, 2009.
 Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised
Approach for Acquiring Ontologies and RDF Data from
Online Life Science Databases. In: 7th Extended Semantic
Web Conference (ESWC2010), Heraklion, Greece, May
30-June 3, 2010, pp. 319-333.

WeST – Web Science & Technologies
University of Koblenz Landau, Germany

Thank you for your attention!

Building and Using Knowledge Bases

Recommended

Recommended

More Related Content

Similar to Building and Using Knowledge Bases

Similar to Building and Using Knowledge Bases (20)

More from Steffen Staab

More from Steffen Staab (20)

Recently uploaded

Recently uploaded (20)

Building and Using Knowledge Bases

Editor's Notes