Link to the PhD book: http://hdl.handle.net/1854/LU-8530790
When you lookup someting on the web but you don't know exactly which document you are looking for or when you rather want to browse 'surf' information, this is considered exploratory search (on the web).
Exploratory search goes beyond simply looking up someting when you want to expand and refine your initial search results.
For machines to be able to reveal relationships between the results of your your searches and other queries, both the query and information need to be aligned.
The web of today consists more and more of pieces of information interpretable by machines that support this kind of exploration.
This PhD thesis focuses on scenarios where users want to explore relationships.
In particular in support of scenarios where search results are visualized following a workflow that allows users to refine a search space until the desired level of detail and then expand again.
Furthermore there is emphasis on bridging how users see results (visually and textually) and how the same results are represented for machines.
This thesis also investigates a technique that optimises indirect connections between search results in terms of serendipity.
Finallyt, the techniques are applied to a use case where information about scientific publications, conferences and researchers are related to each other.
Experimental results highlight the impact on search efficiency and effectiveness.
---
Wanneer gebruikers ‘iets opzoeken‘ op het web en daarbij niet exact weten naar welke documenten ze op zoek zijn of wanneer ze eerder willen bladeren door informatie, spreekt men van verkennend zoeken. Verkennend zoeken gaat ook verder dan iets opzoeken wanneer gebruikers hun initiële zoekresultaten verder willen onderzoeken. Bovendien bestaat het Web steeds meer uit stukjes informatie die zodanig beschreven zijn dat machines ze kunnen verwerken. Opdat zoekopdrachten en vragen die gebruikers stellen ook verbanden tussen resultaten onthullen, moeten zowel de vraagstelling als de informatie op elkaar afgestemd worden.
Deze doctoraatsthesis focust daarbij op scenario’s waar gebruikers verbanden willen verkennen. Daarbij gaat het over het visualiseren van zoekresultaten en het ondersteunen van een workflow die toelaat om een zoekruimte te vernauwen tot op het gewenste detailniveau en vervolgens opnieuw te verbreden. Daarnaast is er aandacht voor het het overbruggen van hoe gebruikers zaken te zien krijgen, visueel of tekstueel, en hoe dezelfde zaken gerepresenteerd worden voor machines. Deze thesis onderzoekt een techniek die indirecte verbanden tussen zoekresultaten optimaliseert in functie van hun ‘toevalstreffer’-gehalte. Ten slotte wordt het geheel toegepast in een use case waar informatie over wetenschappelijke publicaties, conferenties en onderzoekers met elkaar in verband worden gebracht. Experimentele resultaten duiden de impact op de zoek-efficiëntie en -effectiviteit.
15. CLASSICAL RETRIEVAL MODEL OF SEARCH ENGINES
Document
Document-
representation
Query Information ‘need’
[Bates, 1989; Robertson, 1977]
Search engine
‘match’
Index
16. SEARCHING THE WEB
Impressive state of the art
Millions of results, almost always relevant results among the first
10
Increidbly fast
< 1s
Billions of documents, spread across the globe, within a few
>50 billion estimated in index of largest search engines.
[van den Bosch, Bogers & Kunder, 2016]
17. LIMITATIONS OF CURRENT WEB SEARCH ENGINES
A. Further explore search results?
Exploratory search
B. What if the keywords intended something else?
Semantics
C. Combine different search results?
Find relationships
17
18. SEARCHING THE WEB: NEXT STEPS
A. Exploratory search
B. Semantics
C. Find relationships
18
24. SEMANTICS IN WEB DOCUMENTS
24
𝒆 = 𝒎𝒄 𝟐
General relativity
Theory of special relativity
Albert Einstein
Twin paradox
25. ANALOGY ON HOW MACHINES FIND THINGS IN WEB
DOCUMENTS
25
Unieke identification on the web:
Uniform Resource Identifier: URI
Unique identification in a printed atlas
7
L
26. URI’S VOOR DATA OP HET WEB
𝒆 = 𝒎𝒄 𝟐
http://dbpedia.org/resource/Albert_Einstein
http://dbpedia.org/resource/Special_relativity
http://dbpedia.org/resource/Twin_Paradoxhttp://dbpedia.org/resource/General_relativity
26
33. FIND RELATIONSHIPS
Non common properties? More distant relationships?
Not all things are being related to each other and described within a single
document.
34. 1
2
Efficient revealing relationships between data.
Allow users to gradually refine their search queries.
Map the influence of different search actions on the search precision.
Determine the contribution of revealed relationships while searching.
34
3
PURPOSE OF THIS PHD RESEARCH
4
part II
part III
35. I. Searching the web
II. Reveal releationships
III. Visually explore relationships
35
39. ITERATIVE ALGORITHM TO FIND RELATIONSHIPS
39
initialisation
filtering
find relationships
score relationships
next iteration?
(…)
index
RDF
relationships with
different path lengths
40. ITERATIVE ALGORITHM TO FIND RELATIONSHIPS
40
expand
search space
filtering
find relationships
score relationships
next iteration?
(…)
index
RDF
relationships with
different path lengths
43. A PRIORI ESTIMATION
43
Heuristics
“the art of finding”
Examples:
Jaccard distance
difference in semantic
relationships
Normalized (DBpedia) distance
based on common references
Confidence
possibility a resource does not
occur if another already does
44. A PRIORI ESTIMATION
44
Weights
Assign value to a relationship
Examples:
Jaccard distance
difference in properties
Combined node degree
rare things
Jiang & Conrath
relations on the same level of
abstraction
46. EVALUATION: TRIVIAABOUT (KNOWN) SCIENTISTS
A priori estimates evaluated according to
som semantic ranking mechanisms and a user study.
Different relationships combined in a short ‘story’ about combinations of pairs:
Carl Linnaeus
Charles Darwin
Albert Einstein
Isaac Newton
Dataset
46
47. PATH SCORE: SEMANTIC RANKING
Focus on
Semantic commonalities
Focus on
Semantic differencesMixed
47
48. USER STUDY RESULTS
% voorkeur relatief t.o.v
mediaan in paarsgewijze
A/B beoordelingen.
48
49. EVALUATION: CONFERENCES & DIGITAL LIBRARIES IN COMPUTER
SCIENCES
Check the precision of search results during the search.
Comparison between:
own method (minimal cost paths with optimale estimates)
de de-facto baseline for many semantic applications, ‘Virtuoso’ (kortste paden)
Datasets
49
50. Eigen methode
SEARCH PRECISION
50
Virtuoso (baseline)
Baseline: more stable and on average similar
Ownl method: notable high scores for Q1, Q4 en Q7
Gemiddelde
Precisie
51. I. Searching the web
II. Reveal releationships
III. Visually explore relationships
51
52. WHEN EXPLORATORY SEARCH?
When users
(i) Do not know exactly how to formulate the most suited search query;
(ii) Rather want to browse or surf information than lookup something
specific.
52
53. FROM SEARCHING IN DOCUMENTS TO SEARCHING IN
DATA
53
Zoekmachine
Zoekresultaten
Vraagstellin
g
(…)
54. FROM SEARCHING IN DOCUMENTS TO SEARCHING IN
DATA
Zoekmachine
Zoekresultaten
Vraagstellin
g
(…)
?
54
55. EXPLORATORY SEARCH IN DATA
[Tvazorek et al., 2010] [Smith et al., 2005]
Via interacting the underlying data structure
Network based Tabular or faceted
55
58. Hypothesis
Revealing realtionships
among indirect related computer science publications, conferences and researchers,
facilities adding new relevant results to already found results.
Testing
A. Added value of revealing relationships among search results
B. Effectiveness and productivity of different search actions
Datasets
58
EVALUATION: SCENARIO
59. A. ADDED VALUE OF REVEALING RELATIONSHIPS AMONG SEARCH RESULTS
59
Effect with a simple and a complex query.
Simple
“Find a publication. Find a number of publications that have common co-authors with
the found publications.”
Complex
“Find multiple persons that had a publication two consequent years in the same
conference series”.
Search details to be filled in by the users.
The users were not aware if the pathfinding functionality was activated or not
60. A. ADDED VALUE OF REVEALING RELATIONSHIPS AMONG SEARCH RESULTS
60
0
10
20
30
40
50
60
70
Simple Query Complex Query
Negative (%) Positive (%)
61. B. EFFECTIVENESS AND PRODUCTIVITY OF DIFFERENT SEARCH ACTIONS
Tested actions:
1. Keyword-based search query
2. Add a top related resource
3. Expand neighbours
4. Expand neighbour of neigbour
5. Expand further related resource
61
Einstein
Search Query
Top related
Special Relativity
General Relativity
Twin Paradox
62. EFFECTIVITY OF A SEARCH ACTION
62
‘All’ data Showed data Relevante
showed data
Effectivity here equals precision
𝑬
E = amount of
relevant showed data
to showed data
63. PRODUCTIVITY OF CONSECUTIVE SEARCH ACTIONS
0
1
2
…
k
Consecutive
Search Actions
𝑬 𝟎
63
P = average increase of effectivity
after k search actions measured from
the second search action on (I)
64. B. EFFECTIVENESS AND PRODUCTIVITY OF DIFFERENT SEARCH ACTIONS
0
10
20
30
40
50
60
Lookup Add top related Neighbour expand Neighbour of
neigbour expand
Expand further
related resource
Effectiveness (%) Productivity (%)
64
66. EXPLORING SEMANTIC RELATIONSHIPS ON THE WEB
66
Compared searching the web vs. searching physical documents; impressive state of
the art.
From searching to exploring via ‘berrypicking’, more possibilities than pure ‘lookup’.
Semantics:
the meaning of resources, aside from their expression, description or
representation;
documents describe resources and consist of data;
‘linked’ data has a threefold structure ‘triples’ to express semantics.
Exploring relationships between resources is not trivial for non-common properties.
67. Alternative for searching in different data sources using each time another search interface:
→ exploratory search via semantic relationships between data
Choice of heuristics and weights contribute to and influence the serendipity among results.
Focus on revealing semantic relationships
→ supporting visually exploratory search in data on the web
The techniques are mainly tested with data on:
→ encyclopedic facts from Wikipedia (DBpedia)
→ academic digital libraries (DBLP) en conferences (COLINDA)
Proposed techniques remain close to the structure of the linked data (RDF),
→ methods applicable in other domains that have linked data.
MOST IMPORTANT TAKEAWAYS
67