This document describes using collaborative knowledge bases like Wikipedia to support exploratory search tasks. It presents an approach that extracts concepts and their relationships from Wikipedia to build a concept network. Documents are then ranked based on their relationships to these concepts. An experiment ranks journal abstracts given a seed abstract, comparing the proposed Wikipedia-based approach to a maximal marginal relevance technique. The Wikipedia approach provided more diverse results while maintaining high relevance, showing potential for improving exploratory search.
7. Exploratory Search Task
Given a journal abstract, rank other abstracts
based on their relevancy to the seed abstract.
Evaluation is based on relevancy and diversity.
8. Concepts
Candidates
Seed
Document
(candidates that match
to a Wikipedia Page title
and connected through Ontology)
n-grams
(1 to 3)
CONCEPT– WORD
K (W x K)
d
Tf-idf(D)
DOCUMENT – CONCEPT
Θ (D x K)
k
DOCUMENT – W0RD
D (D x W )
k
*
D: Documents
=
d
Tf-idf(K)
K: Concepts
Argsort (row.sum(Θ) )
W: Words
9. EXTRACTING CONCEPT NETWORK
“Representation independence formally characterizes the
encapsulation provided by language constructs for data
abstraction and justifies reasoning by simulation.
Representation independence has been shown for a
variety of languages and constructs but not for shared
references to mutable state; indeed it fails in general for
such languages. This article formulates representation
independence for classes, in an imperative, objectoriented language with pointers, subclassing and dynamic
dispatch, class oriented visibility control, recursive types
and methods, and a simple form of module. An instance
of a class is considered to implement an abstraction using
private fields and so-called representation objects.
Encapsulation of representation objects is expressed by a
restriction,
called
confinement,
on
aliasing.
Representation independence is proved for programs
satisfying the confinement condition. A static analysis is
given for confinement that accepts common designs such
as the observer and factory patterns. The formalization
takes into account not only the usual interface between a
client and a class that provides an abstraction but also the
interface (often called protected") between the class
and its subclasses."
10. EXTRACTING CONCEPT NETWORK
“Representation independence formally characterizes the
encapsulation provided by language constructs for data
abstraction and justifies reasoning by simulation.
Representation independence has been shown for a
variety of languages and constructs but not for shared
references to mutable state; indeed it fails in general for
such languages. This article formulates representation
independence for classes, in an imperative, objectoriented language with pointers, subclassing and dynamic
dispatch, class oriented visibility control, recursive types
and methods, and a simple form of module. An instance
of a class is considered to implement an abstraction using
private fields and so-called representation objects.
Encapsulation of representation objects is expressed by a
restriction,
called
confinement,
on
aliasing.
Representation independence is proved for programs
satisfying the confinement condition. A static analysis is
given for confinement that accepts common designs such
as the observer and factory patterns. The formalization
takes into account not only the usual interface between a
client and a class that provides an abstraction but also the
interface (often called protected") between the class
and its subclasses."
11. WIKIPEDIA PAGES AS CONCEPTS
Solar System
“The Solar System[a] consists
of the Sun and the
astronomical objects
gravitationally bound in orbit
around it, all of which formed
from the collapse of a giant
molecular cloud
approximately 4.6 billion
years ago…”
(http://en.wikipedia.org/wiki/Solar
_System)
Word Stem
Occ. Freq.
abstract
53
0.056
program
44
0.046
langu
33
0.035
spec
16
0.017
comput
12
0.013
conceiv
12
0.013
dat
12
0.013
bk = p(Wi | k) =
{Wi Î k}
N
å {W Î k}
i
i
βk : Per-concept word distribution
12. RANKING DOCUMENTS
DOCUMENT – W0RD
D (D x W )
DOCUMENT – CONCEPT
Θ (D x K)
CONCEPT– WORD
K (W x K)
k
k
d
=
*
D: Documents
K: Concepts
W: Words
d
13. SORT DOCUMENTS
DOCUMENT – W0RD
D (D x W )
DOCUMENT – CONCEPT
Θ (D x K)
CONCEPT– WORD
K (W x K)
k
k
d
=
*
D: Documents
K: Concepts
W: Words
d
14. EXPERIMENT
Given a journal abstract, rank other abstracts based on
their relevancy to the seed abstract.
• Data: 619 abstracts of the Journal of the ACM
(JACM) and their references.
• Task: Select Top-k (5,10,15, and 20) relevant
abstracts.
• Observe: Relevancy (measured by LSA vector
similarity) and Diversity (measured through the
coverage of the references.)
15. MAXIMAL MARGINAL RELEVANCE
• a measure to increase the diversity of documents
retrieved by an IR system
-Similarity to query: BM25 (Xapian1)
-Similarity to results: LSA similarity (Gensim2)
1.
2.
http://xapian.org
http://radimrehurek.com/gensim/
18. CONCLUDING REMARKS
• Our Wiki based technique provides high
diversity with low relevancy loss.
• Semantics embedded in concept networks
extracted from Wikipedia can improve
exploratory search tasks.
Editor's Notes
However, majority of the inquires go beyond simple fact checks
searches involving the cognitive processing and interpretation of new knowledgesearches requiring critical assessment before being integrated into knowledge basesSearch driven exploration activitiesExploratory Search relies on other information/cognitive behaviors:sense-making organizing and analyzing search resultsdecision making
p.24: This kind of ill-structured problems 1) begin with a lack of information necessary to develop a solution or even precisely define the problem, 2) have no single right approach for solution, 3) have problem definitions that change as new information is gathered, and 4) have no identifiable ‘correct’ solution [3]. -- Highlighted jul 19, 2013
It’s hard for search systems to identify concepts and their relationships --
Concepts are characterized as distributions over observed words in Wikipedia pagesUse posterior expectations / approximate posterior inference: gibbs sampling, variational inference
ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences.Ontologies can be used to model concepts and their interrelationships (Lanzenberger et al., 2010).In this sense, ontologies represent the relevant aspects of context. To effectively comprehend cross-lingual corpora, tools that can explore the dependencies between language and context are needed.
ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences.Ontologies can be used to model concepts and their interrelationships (Lanzenberger et al., 2010).In this sense, ontologies represent the relevant aspects of context. To effectively comprehend cross-lingual corpora, tools that can explore the dependencies between language and context are needed.
Concepts are characterized as distributions over observed words in Wikipedia pagesEach topic is a distribution over words
Today, most user searches are of an exploratorynature, in the sense that users are interested inretrieving pieces of information that cover manyaspects of their information needs.
retrieved by an IR systemThe principle is similar to TF-IDF where query terms are weighted based on frequency in a document (tf) and across the corpus (idf). In addition, ratio of the document length to the average document length is taken into account in K and BM25 is parameterized for further optimization. We used xapian’s implementation of BM25 with default parameters.\subsection{Maximal Marginal Relevance (MMR)}One approach to diversify search result is optimizing the search results based on two criteria: similarity to the query -- relevance, and dissimilarity to the other relevant documents -- novelty. Maximal Marginal Relevance (MMR) \cite{Carbonell:1998ja}, for example, work on this principle: similarity of a document to a query is adjusted based on its similarity to the other documents that are more similar to the query.\small\begin{displaymath}MMR=\underset { D_{ i }\in R\setminus S}{ argmax }\left[\lambda Sim_{ 1 }\left(D_{ i },Q \right)-\left(1-\lambda\right)\max_{ D_{ j }\in S } \left( Sim_{ 2 }\left( D_{ i },D_{ j } \right) \right) \right] \end{displaymath}
can arguably be lessened, because the semantics strips away extraneous context while at the same time providing better diversity within the universe of relevant documents