Knowledge discoverylaurahollink

Knowlege Discovery for the Semantic Web
An Application to Web Usage Mining
&
How to use semantics in the Preprocessing stage
Input
Data
Data Preprocessing
and Transformation
Data Mining
Interpretation
and Evaluation
Information/
Taking Action
Data fusion (multiple sources)
Data Cleaning (noise,missing val.)
Feature Selection
Dimensionality Reduction
Data Normalization
Filtering Patterns
Visualization
Statistical Analysis
- Hypothesis testing
- Attribute evaluation
- Comparing learned models
- Computing Confidence Intervals
Claudia D’Amato - University of Bari, IT.

Laura Hollink - Centrum Wiskunde & Informatica, Amsterdam, NL.

An application to Web Usage Mining
Web Usage Mining = discovering patterns in logs of user interaction with Web
resources

• logs typically contain an identiﬁer for users (e.g. ip address), their queries
and clicks

resources

and clicks
• What about usage of Linked
Open Data?

resources

and clicks
• What about usage of Linked
Open Data?
• Can we use semantics to
improve mining of Web Usage?

Mining Usage of Linked Open Data in USEWOD
USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]

1. USEWOD workshop series @ ESWC / WWW since 2011

2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc.,
and client side logs from YASGUI.

USEWOD: http://usewod.org/ [B. Berendt, L. Hollink., M. Luczak-Roesch, et al.]

1. USEWOD workshop series @ ESWC / WWW since 2011

2. USEWOD dataset: server logs of DBpedia, BioPortal, LinkedGeoData, etc.,
and client side logs from YASGUI.
example removed

• Results of USEWOD: LOD usage mining for more eﬃcient indexing [1],
cashing [2], auto-completion [3], etc.
[1] Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study
of real-world SPARQL queries. USEWOD @ WWW 2011
[2] Lorey, J., & Naumann, F. Caching and prefetching strategies for sparql queries. USEWOD @
ESWC 2013.
[3] K. Kramer,R.Q. Dividino, and G. Gröner. SPACE: SPARQL Index for Efficient Autocompletion.
ISWC (Posters & Demos) 2013.
[4] Rietveld, L., & Hoekstra, R. Man vs. Machine: Differences in SPARQL Queries. USEWOD @
ESWC 2014
[5] Huelss, J., & Paulheim, H. What SPARQL Query Logs Tell and do not Tell about Semantic
Relatedness in LOD. NoISE @ ESWC 2015
• Issues:

• what is the diﬀerence between
queries by machines and humans? [4]

• what is the meaning of repeated
queries by bots/tools?

• a lot of the usage is invisible due to
data dump download [5]

Usage mining example 1: clustering rdf:properties
in DBpedia
Instead of listing all DBpedia properties
alphabetically, can we display them in a
more meaningful way? Can we use query
logs for this?
[5]

in DBpedia
Instead of listing all DBpedia properties
alphabetically, can we display them in a
more meaningful way? Can we use query
logs for this?
[5]
[5] Huelss, J., & Paulheim, H. What SPARQL
Query Logs Tell and do not Tell about Semantic
Disclaimer: simplified discussion of this paper!

in DBpedia
Approach: Hierarchical Clustering of
properties, where the distance between a
pair of properties is based on how often
they co-occur in a SPARQL query in the
USEWOD2015 logs.

in DBpedia
USEWOD2015 logs.
Evaluation: run an experiment to
measure how quickly and accurately
people identify facts when looking
at the standard view or the clustered
view.

in DBpedia
USEWOD2015 logs.
Evaluation: run an experiment to
measure how quickly and accurately
people identify facts when looking
at the standard view or the clustered
view.
Result: no signiﬁcant diﬀerences ☹

Usage mining example 2: mining semantically
enriched query logs
[5] Laura Hollink, Peter Mika and Roi Blanco. Web
Usage Mining with Semantic Analysis. WWW 2013.

enriched query logs
Data: queries and clicks on Yahoo! search engine.

enriched query logs
Data: queries and clicks on Yahoo! search engine.
Problem when mining ‘raw’ logs: low support of even the most
frequent patterns

enriched query logs
Approach:

1. link queries to entities in
LOD cloud

2. choose class of entity +
selected properties

3. detect modiﬁer words
(download, trailer, cast,
date, etc.)

enriched query logs
Approach:

1. link queries to entities in
LOD cloud

2. choose class of entity +
selected properties

3. detect modiﬁer words
(download, trailer, cast,
date, etc.)
1. Link queries to entities in LOD cloud:

• Freebase (has a lot of movie related info)

• DBpedia (Wikipedia is widely used)

enriched query logs
•Sequential
pattern mining
on the class-
level using
PreﬁxSpan.

enriched query logs
1.Discover frequent patterns on class-level using
• Using the eﬃcient PreﬁxSpan algorithm to mine all possible subsequence
patterns

Usage mining example 3: semantic patterns of
query modification
•Goal: Identify frequent query modifications in an image archive

• state of the art = 3 classes: generalization, specification,
reformulation

•Approach:

1.link queries to entities in the LOD cloud

2.Choose class of entity

3.Determine shortest path between consecutive queries Q1 and
Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P.
(2011). Semantic search log analysis: a
method and a study on professional image
search. JASIST 62(4), 691-713.
See also:
Huurnink, B., Hollink, L., Van Den Heuvel,
W., & De Rijke, M. (2010). Search behavior
of media professionals at an audiovisual
archive: A transaction log analysis. JASIST,
61(6), 1180-1197.

Usage mining example 3: semantic patterns of
query modification
•Goal: Identify frequent query modifications in an image archive

• state of the art = 3 classes: generalization, specification,
reformulation

•Approach:

1.link queries to entities in the LOD cloud

2.Choose class of entity

3.Determine shortest path between consecutive queries Q1 and
Q2
4.Rank property-paths according to support and confidence.
Hollink, V., Tsikrika, T., & de Vries, A. P.
(2011). Semantic search log analysis: a
method and a study on professional image
search. JASIST 62(4), 691-713.
See also:
Huurnink, B., Hollink, L., Van Den Heuvel,
W., & De Rijke, M. (2010). Search behavior
of media professionals at an audiovisual
archive: A transaction log analysis. JASIST,
61(6), 1180-1197.
Conclusions:
• Identified patterns not visible on raw
data.
• but “the method is only moderately
successful in identifying the most
prominent relations for a given query
pair”

The feature selection issue when using LOD
Input
Data
Data Preprocessing
and Transformation
Data Mining
Interpretation
and Evaluation
Information/
Taking Action
Feature Selection
Data Normalization
Filtering Patterns
Visualization

Feature Selection
• Feature selection = Limiting the number of features for faster computation
times, more understandable models, better prediction value.

• Using Linked Open Data can lead to large number of features per data point.

• a DBpedia resource easily has 50 property-value pairs.

• more are easily added using reasoning

• note: these numbers are not large compared to the number of features in
DNA strings, or all words in a text corpus!

• Still, many of them are irrelevant or redundant.

Feature Selection Example
• Goal: learn a relation R between x and y.
• In this paper, R = ‘occupation’, ‘gender’, ‘instance_of’, ‘acted_in’, ‘genre’,
‘position_played_on_team’

• Approach: given a training set of pairs of x, y, learn a “whitelist” of properties
in DBpedia, WikiData, YAGO and WordNet that indicate a relation R between
x and y

• Cast as a subset selection problem:

• E = the set of possible properties

• local search over the power set of E (i.a. all subsets) to ﬁnd the optimal
subset.
Learning to Exploit Structured Resources
for Lexical Inference. Vered Shwartz, Omer
Levy, Ido Dagan and Jacob Goldberger.
CoNLL 2015 (to appear)july

Data Fusion
Input
Data
Data Preprocessing
and Transformation
Data Mining
Interpretation
and Evaluation
Information/
Taking Action
Feature Selection
Data Normalization
Filtering Patterns
Visualization

Data Fusion / Ontology Alignment / Mapping /
Matching / Linking / Integration
Ontology /
Schema / T-box
level
Instance / data /
A-box level

Data Fusion
Ontology
Alignment

Data Fusion
~~~ ~~~ ~
~~~ ~~ ~
~~~ ~~~ ~
~~~ ~~ ~
~~~ ~~~ ~
~~~ ~~ ~
~~~ ~~~ ~
~~~ ~~ ~
~~~ ~~~ ~
~~~ ~~ ~
~~~ ~~~ ~
~~~ ~~ ~
Entity detection /
entity linking

Methods for Data Fusion (ontology alignment)
label
label
label
label

Methods for Data Fusion: structural matchers
label
label
label
label

Methods for Data Fusion: structural matchers
label
label
label
label
• E.g. Similarity Flooding: the similarity of a matched pair s1
and t1 propagates to their respective neighbors s2 and t2.

• neighbors can be deﬁned as subclasses,
superclasses, instances, domain/ranges, etc.

• Structural measures are in practice never used stand
alone.
[10] Ngo, Duy Hoa, and Zohra Bellahsene.
YAM++-results for OAEI 2012. OAEI @
ISWC 2012.

[11] Sergey Melnik, Hector Garcia-Molina,
and Erhard Rahm. Similarity ﬂooding: A
versatile graph matching algorithm and its
application to schema matching.

ICDE 2002.

Methods for Data Fusion: instance based matchers
label
label
label
label

Methods for Data Fusion: instance based matchers
label
label
label
label
• Match classes based on similarity of their instances

• note: you need a way to assess similarity of the instances!

Methods for Data Fusion: string based
label
label
label
label

• This is the most important feature in ontology alignment.

• “nearly all [ontology alignment systems] use a string similarity metric” [12]

• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]

• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String
similarity metrics for ontology alignment.
ISWC 2013.

[13] The debates of the European
Parliament as Linked Open Data. Under
review. See http://www.talkofeurope.eu/
data/ for details.

• This is the most important feature in ontology alignment.

• “nearly all [ontology alignment systems] use a string similarity metric” [12]

• stopping and stemming is not helpful! Nor is using WordNet synonyms. [12]

• In [13] we took an even less semantic approach: linking based on URL syntax.
label
label
label
label
[12] Cheatham, M., & Hitzler, P. String
similarity metrics for ontology alignment.
ISWC 2013.

[13] The debates of the European
Parliament as Linked Open Data. Under
review. See http://www.talkofeurope.eu/
data/ for details.

http://www.dbpedia.org/page/Judith_Sargentini

Link types
Equality
SameAs
EquivalentClasses
EquivalentProperties
“Den Haag” = “The Hague”
wood-material = wood
Hierarchical
rdfs:subClassOf
rdf:type
rdfs:subPropertyOf
aat:Artist ⊇ wn:Artist
tgn:Africa ∈ wn:Continent
conf:has_the_last_name =
edas:hasLastName
Weaker semantics
skos:closeMatch / exactMatch /
broadMatch /narrowMatch /
relatedMatch
geonames:Italy skos:closeMatch
librarytopics:Italy
Domain speciﬁc links
E.g. born-in
E.g. hasStyle
E.g. hasPart
Van Gogh (ULAN) born-in Groot-
Zundert (TGN)

Representation of links
architecten architectsskos:exactMatch

architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methode
auteur

architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methode
auteur
• Open Question: how valid are the
patterns we discover in data when
the quality of the links is low?

architecten
architects
Link 001
skos:exactMatch
handmatigL. Hollink
concept1
concept2
link type
link methode
auteur
• Open Question: how valid are the
patterns we discover in data when
the quality of the links is low?
• Even more important to be critical
and evaluate the data

• source criticism

• tool criticism (see http://
event.cwi.nl/toolcriticism/)

Evaluation of Data Fusion / Linking

1. Manually rating (a sample of) mappings

• relatively cheap and easy to interpret

• only precision, no recall



2. Comparison to a reference alignment

• precision and recall

• used in OAEI on the SEALS platform

• more expensive if a reference alignment has to be
created (but: crowd sourcing!)






3. End-to-end evaluation (a.k.a. evaluating an application
that uses the mappings)

• arguably the best method!

• need to have access to an application + users

• Comparison to a reference alignment: Alternative measures:

• 1. instead of a binary classiﬁcation into correct/incorrect mappings, take
into account how wrong an link is:

• Comparison to a reference alignment: Alternative measures:

• 1. instead of a binary classiﬁcation into correct/incorrect mappings, take
into account how wrong an link is:
• where r(a) is the semantic distance between correspondence a and
correspondence a’ in the reference alignment, A is the number of
correspondences.

• 2. weight score of mappings based on the frequency of their use

• e.g from usage logs! Laura Hollink, Mark van Assem, Shenghui
Wang, Antoine Isaac, Guus Schreiber. Two
Variations on Ontology Alignment
Evaluation: Methodological Issues.ESWC
2008.








3. End-to-end evaluation (a.k.a. evaluating an application
that uses the mappings)

• arguably the best method!

• need to have access to an application + users

Discovering links from text
Pointers to what happens in other communities
• Word2Vec: eﬃcient deep learning algorithm to learn vector representations of
words

• vector similarity captures semantics between words

• No explicit semantics, but we can’t deny that there is meaning there!

• Success seems to be mostly due to big data

words



Mikolov, Tomas, et al. "Distributed
representations of words and phrases and
their compositionality." Advances in neural
information processing systems. 2013.

words



Mikolov, Tomas, et al. "Distributed
representations of words and phrases and
their compositionality." Advances in neural
information processing systems. 2013.
Example:

Vec(Madrid) - Vec(Spain) + Vec(France)
is closer to Vec(Paris) than to any other
vector

NELL: Never-Ending Language Learning
• several machine learning approaches to discover facts (beliefs) from text on
the web

• string features, distribution of context words, html structure, visual image
analysis.

• Running since 2010, has so far learned over 80 million beliefs

NELL: Never-Ending Language Learning
• several machine learning approaches to discover facts (beliefs) from text on
the web

• string features, distribution of context words, html structure, visual image
analysis.

• Running since 2010, has so far learned over 80 million beliefs
T. Mitchell, W. Cohen, E. Hruschka, P.
Talukdar, J. Betteridge, A. Carlson, B. Dalvi,
M. Gardner, B. Kisiel, J. Krishnamurthy, N.
Lao, K. Mazaitis, T. Mohamed, N.
Nakashole, E. Platanios, A. Ritter, M.
Samadi, B. Settles, R. Wang, D. Wijaya, A.
Gupta, X. Chen, A. Saparov, M. Greaves, J.
Welling. In Proceedings of the Conference
on Artiﬁcial Intelligence (AAAI), 2015.

Research Task Format
Work in 6 groups of 10 students

• 5 people design an approach to
association rules with semantics

• 5 people focus on how that
approach should be evaluated

The idea is to work together!
E.g. which measures are best
for this approach? Which
versions of the approach
should be evaluated? Will this
approach score high on these
measures? In which cases?




• We would like one presentation per group of 10 people

• of 3 or 4 slides

• of max 4 minutes (less is ﬁne too!)

• Send me the slides in PDF, with your group number in the title,
by email to l.hollink@cwi.nl, today before 16:30.

• The presentation should show clearly:

1. the AR method

2. how did you take into account semantics?

3. the evaluation method

• BONUS: argue when and why your approach will score high.

• BONUS: discuss how the newly learned links can be
represented and used.




• We would like one presentation per group of 10 people

• of 3 or 4 slides

• of max 4 minutes (less is ﬁne too!)

• Send me the slides in PDF, with your group number in the title,
by email to l.hollink@cwi.nl, today before 16:30.

• The presentation should show clearly:

1. the AR method

2. how did you take into account semantics?

3. the evaluation method

• BONUS: argue when and why your approach will score high.

• BONUS: discuss how the newly learned links can be
represented and used.
Tips:

• you may pick a dataset that
you will use as an example

Knowledge discoverylaurahollink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Knowledge discoverylaurahollink

Similar to Knowledge discoverylaurahollink (20)

Recently uploaded

Recently uploaded (20)

Knowledge discoverylaurahollink