SlideShare a Scribd company logo
1 of 12
Download to read offline
A COMBINATION OF REDUCTION
AND EXPANSION APPROACHES 

TO HANDLE WITH LONG NATURAL
LANGUAGE QUERIES
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000
www.elsevier.com/locate/procedia
22nd International Conference on Knowledge-Based and
Intelligent Information & Engineering Systems
A Combination of Reduction and Expansion Approaches to to
Handle with Long Natural Language queries
Mohamed ETTALEBa
, Chiraz LATIRIb
, Patrice BELLOTb
aUniversity of Tunis El Manar, Faculty of Sciences of Tunis, LIPAH research Laboratory, Tunis ,Tunisia
bAix-Marseille Universit, CNRS, LIS UMR 7020, 13397, Marseille, France
Abstract
Most of the queries submitted to search engines are composed of keywords but it is not enough for users to express their needs.
Through verbose natural language queries, users can express complex or highly specific information needs. However, it is di cult
for search engine to deal with this type of queries. Moreover, the emergence of social medias allows users to get opinions, sug-
gestions or recommendations from other users about complex information needs. In order to increase the understanding of user
needs, tasks as the CLEF Social Book Search Suggestion Track have been proposed from 2011 to 2016. The aim is to investigate
techniques to support users in searching for books in catalogs of professional metadata and complementary social media. In this re-
spect, we introduce in the present paper a statical approach to deal with long verbose queries in Social Information Retrieval (SIR)
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) !2
0pen	Access	
Contents	:	XML	TEI	
https://www.openedition.org
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Searching for Books : a difficult task
User needs : very diverse facets or aspects
— topic oriented aspects
— With / without a precise context

eg. arts in China during the XXth century
— Books dealing with named entities : locations (the book is about a specific location OR
the action takes place at this location), proper names…
— What are the most important / more popular books about …
— style / type / language aspects
— category : fiction, novel, essay, proceedings, position papers…
— target : for experts / for dummies / for children …
— and also… : well illustrated, cheap, short…
=> Keyword queries are not enough => verbose natural language queries
Book Contents : long stories, metaphoric language, several topics…
=> Metadata (tags, ToC, indexes…), summaries, reader reviews can help us
!3
(IR based) Book Suggestion System
!4
Hypothesis	:	Reviews	can	help	to	expand	the	queries
Information	
Retrieval	
System
Book	content
Metadata
Summaries
Reviews
Needs
Example		
books
Query
Index
Book	
Suggestion
Reduction	
Expansion
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) !5
http://social-book-search.humanities.uva.nl/#/overview
2 The Amazon collection
The document used for this year’s Book Track is composed of Amazon pages of
existing books. These pages consist of editorial information such as ISBN num-
ber, title, number of pages etc... However, in this collection the most important
content resides in social data. Indeed Amazon is social-oriented, and user can
comment and rate products they purchased or they own. Reviews are identi-
fied by the <review> fields and are unique for a single user: Amazon does not
allow a forum-like discussion. They can also assign tags of their creation to a
product. These tags are useful for refining the search of other users in the way
that they are not fixed: they reflect the trends for a specific product. In the
XML documents, they can be found in the <tag> fields. Apart from this user
classification, Amazon provides its own category labels that are contained in the
<browseNode> fields.
Table 1. Some facts about the Amazon collection.
Number of pages (i.e. books) 2, 781, 400
Number of reviews 15, 785, 133
Number of pages that contain a least a review 1, 915, 336
3 Retrieval model
3.1 Sequential Dependence Model
Like the previous year, we used a language modeling approach to retrieval [4].
We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integrate
multiword phrases in the query. Specifically, we use the Sequential Dependance
a	CLEF	lab	2011-2016
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Examples of queries
!6
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) !7
Datasets
Aix-Marseille University Amal Htait, Sébastien Fournier, and Patrice Bellot
Amazon collection of 2.8M records of professional and social metadata
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Our proposal for dealing with queries
— 1- Combine query reduction and query expansion
— 2- Apply an Information Retrieval Model over (meta)data and reviews
— Query Reduction : removes the less informative words / the most complex to deal with
— Query Expansion : fights against word mismatch problem (a concept is described by
different terms in user queries and in source documents)
— Association rules « query words / word in the reviews » 

= inter-term correlations (if the query words occur then these words occur as well)

— Use of the examples given by the user : Pseudo-Relevance Feedback (Rocchio)
!8
• stopword removal and stemming for the English language to reduce the verbose queries
• query expansion based on Associations rules between terms.
• query expansion using similar books mentioned in the topics.
4.1. Query Reduction
Removing stopwords has long been a standard query processing step[12]. We used three di↵erent stopwords lists
in this study: the standard stopword list1
, as well as two stopwords lists based on morphosyntactic analysis and
according to the ranks of terms by some weight. [1] present several methods for automatically constructing a collection
dependent stopword list. Their methods generally involve ranking collection terms by some weight, then choosing
some rank threshold above which terms are considered stopwords. [17] constructed a specific stopword list of a given
collection and used statistic measure IDF to rank terms and decide which term is a stopword or not. Next, they applied
these techniques by removing from the query all words which occur on the stopword list. Our proposal is to reduce the
verbose queries based on two steps: first, all terms that appear in the standard stopwords list2
are eliminated. Second,
we process the linguistic filtering method and execute TreeTagger3
a part of speech tagging on the queries. Then, we
select only the particular words of noun type(nouns, proper nouns, etc.) and query words that have a form as noun
phrase such as Syrian Civil War. The aim is to keep the appropriate words that can improve the quality of the user
query.
4.2. Query expansion using Associations rules between terms(ART)
Query expansion is the process of adding additional relevant terms to the original queries to improve the per-
formance of information retrieval systems. However, previous studies showed that automatic query expansion using
Association rules do not lead to an improvement in the performance.
The main idea of this approach is to extract a set of non redundant rules, representing inter-terms correlations in a
contextual manner. We use the rules that convey the most interesting correlations amongst terms, to extend the initial
queries. Then, we extract a list of books for each query using the MB25 scoring [25].
4.2.1. Representation and Query Expansion:
We represent a query q as a bag of terms:
Association Rules Mining for Query Expansion
!9
CT = {w1, ..., wm}
Where wi is a candidate term. This set
terms detailed in the next section.
4.2.2. Association Rules:
An association rule, i.e., between ter
of ⌧ where ⌧ := {t1, ..., wl} is a finite set
and T2 are, respectively, called the pre
equal to T1 [ T2. The support of a rule
S upp(R) = S upp(T)
while its confidence is computed as:
Con f(R) =
S upp(T)
S upp(T1)
An association rule R is said to be valid
threshold denoted mincon f. This confid
4.2.3. Candidate Terms Generation Ap
The main idea of this approach is t
between terms [4]. The set of query te
conclusion parts of the retained rules w
example of association rules is highligh
=
S upp(T)
S upp(T1)
(4)
rule R is said to be valid if its confidence value, i.e., Con f(R), is greater than or equal to a user-defined
ted mincon f. This confidence threshold is used to exclude non valid rules.
ate Terms Generation Approach based on Association Rules:
dea of this approach is to use the association rules mining technique to discover strong correlations
[4]. The set of query terms will be expanded using the maximal possible set of terms located in the
ts of the retained rules while checking that the terms are located in their premise part. An illustrative
ociation rules is highlighted in Table4.2.3.
R Support Confidence
military ) warfare 83 0.741
romance ) love 64 0.723
on Rules examples.
generating candidate terms for a given query is performed as in the following steps:
f a sub-set of 12000 books according to the querys subject. The books are represented only by their
tion, we chose to select the title, reviews and the tags as content of the book.
the selected books using TreeTagger. The choice of TreeTagger was based on the ability of this tool to
nature(morphosyntactic category) of a word in its context.
R Support Confidence
military ) warfare 83 0.741
romance ) love 64 0.723
Table 2. Association Rules examples.
The process of generating candidate terms for a given query is performed as in the following steps:
1) Selection of a sub-set of 12000 books according to the querys subject. The books are represented only by
social information, we chose to select the title, reviews and the tags as content of the book.
2) Annotating the selected books using TreeTagger. The choice of TreeTagger was based on the ability of this t
recognize the nature(morphosyntactic category) of a word in its context.
3) Extraction of nouns (terms) from the annotated books, and removing the most frequent ones.
4) Generating the association rules using an e cient algorithm: Closed Association Rule Mining (CHARM)[2
mining all the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative minimal su
and minconf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zipf distri
of the collection, the maximum threshold of the support values is experimentally set in order to spread trivial
which occur in the most of the documents, and are then related to too many terms. On the other hand, the m
threshold allows eliminating marginal terms which occur in few documents, and are then not statistically imp
when occurring in a rule. CHARM gives as output, the association rules with their appropriate support and confid
Table4.2.3 describes the output of CHARM.
military ) warfare 83 0.741
romance ) love 64 0.723
Table 2. Association Rules examples.
The process of generating candidate terms for a given query is performed as in the following steps:
1) Selection of a sub-set of 12000 books according to the querys subject. The books are represented on
social information, we chose to select the title, reviews and the tags as content of the book.
2) Annotating the selected books using TreeTagger. The choice of TreeTagger was based on the ability of
recognize the nature(morphosyntactic category) of a word in its context.
3) Extraction of nouns (terms) from the annotated books, and removing the most frequent ones.
4) Generating the association rules using an e cient algorithm: Closed Association Rule Mining (CHAR
mining all the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative minim
and minconf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zipf d
of the collection, the maximum threshold of the support values is experimentally set in order to spread tr
which occur in the most of the documents, and are then related to too many terms. On the other hand, t
threshold allows eliminating marginal terms which occur in few documents, and are then not statistically
when occurring in a rule. CHARM gives as output, the association rules with their appropriate support and
Table4.2.3 describes the output of CHARM.
R Support Confidence
military ) warfare 83 0.741
romance ) love 64 0.723
Table 2. Association Rules examples.
The process of generating candidate terms for a given query is performed as in the following steps:
1) Selection of a sub-set of 12000 books according to the querys subject. The books are represented
social information, we chose to select the title, reviews and the tags as content of the book.
2) Annotating the selected books using TreeTagger. The choice of TreeTagger was based on the ability
recognize the nature(morphosyntactic category) of a word in its context.
3) Extraction of nouns (terms) from the annotated books, and removing the most frequent ones.
4) Generating the association rules using an e cient algorithm: Closed Association Rule Mining (CHA
mining all the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative min
and minconf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zip
of the collection, the maximum threshold of the support values is experimentally set in order to spread
which occur in the most of the documents, and are then related to too many terms. On the other hand
threshold allows eliminating marginal terms which occur in few documents, and are then not statistica
when occurring in a rule. CHARM gives as output, the association rules with their appropriate support an
Table4.2.3 describes the output of CHARM.
R Support Confidence
military ) warfare 83 0.741
romance ) love 64 0.723
ssociation Rules examples.
cess of generating candidate terms for a given query is performed as in the following steps:
tion of a sub-set of 12000 books according to the querys subject. The books are represented only by their
formation, we chose to select the title, reviews and the tags as content of the book.
tating the selected books using TreeTagger. The choice of TreeTagger was based on the ability of this tool to
e the nature(morphosyntactic category) of a word in its context.
ction of nouns (terms) from the annotated books, and removing the most frequent ones.
rating the association rules using an e cient algorithm: Closed Association Rule Mining (CHARM)[29] for
ll the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative minimal support
conf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zipf distribution
ollection, the maximum threshold of the support values is experimentally set in order to spread trivial terms
ccur in the most of the documents, and are then related to too many terms. On the other hand, the minimal
d allows eliminating marginal terms which occur in few documents, and are then not statistically important
curring in a rule. CHARM gives as output, the association rules with their appropriate support and confidence.
.3 describes the output of CHARM.
6
a set of candidate terms for q:
CT = {w1, ..., wm}
Where wi is a candidate term.
terms detailed in the next sectio
4.2.2. Association Rules:
An association rule, i.e., betw
of ⌧ where ⌧ := {t1, ..., wl} is a fi
and T2 are, respectively, called
equal to T1 [ T2. The support
S upp(R) = S upp(T)
while its confidence is compute
Con f(R) =
S upp(T)
S upp(T1)
ia Computer Science 00 (2018) 000–000
e terms, denoted CT, is selected using association rules betw
mplication of the form R : T1 ) T2, where T1 and T2 are sub
terms in the books collection and T1  T2 = ;. The termsets
e conclusion of R. The rule R is said to be based on the termse
query		
term
	term	in	book		
title	+	review	+	metadata
For	each	query	(bag	of	words)
termsets
Expansion	using	the	maximal	set	of	terms	in	the	conclusion	parts
T	=
Experiments
— Experimental setup
- Terrier Information Retrieval System
- BM25 model
!10
In our experiments, we present experimental results on SBS 2014 a
to compare the performances of di↵erent components of our system. Fi
framework developed at the University of Glasgow [22]. Terrier is a modu
scale IR applications. It provides indexing and retrieval functionalities. Th
the usual parameter values (b=0, k3=1000, k1=2). Using the BM25 mode
Q is given by:
S (D, Q) =
X
t2Q
(K1 + 1)w(t, d)
K1 + w(t, d)
.id f(t).
(K3 + 1)w(t, Q)
K3 + w(t, Q)
4 http://social-book-search.humanities.uva.nl/#/data/suggestion
8 Author name / Procedia Comput
Where w(t, d) and w(t, Q) are respectively the weights of te
document frequency of term t, given as follow:
id f(t) = log
|D| d f(t) + 0.5
d f(t) + 0.5
Where d f(t) is the number of documents containing t, an
conducted three di↵erent runs, namely:
1. Run-RQ: We used only the reduced queries we showed
2. Run-QEART: We added the association rules between
3. Run-QEEB: Query expansion using examples books.
Strategy NDCG@10 MAP Improved
Baseline model 0.1041 0.0965 -
RQ 0.1158 0.1014 11.24%
QEART 0.1429 0.1153 23.4%
QEEB 0.1518 0.1194 6.23%
Table 4. Results of SBS 2014 with di↵erent strategies T
We used two topic sets provided by CLEF SBS in 2014 (6
narrative fields for each topic. In the beginning, we used th
stop-words and keep the appropriate words in the query. Se
terms using ART. In this step, we applied the CHARM alg
mincon f = 0.7. Then, we used the similar books for each to
to expand again the query. The rocchio function was used wi
of terms selected from each similar book was set to 10. Ta
reduced query and query expansion based on association ru
relevant and contain terms can be important to the query. In order to exploit these similar books, we expand again the
queries processed in the section 4.2 by automatically adding terms from these similar books.
5. Experiments and results
In this section, we describe the experimental setup we used for our experiments.
5.1. Experimental data
To evaluate our approach, the data provided by CLEF SBS suggestion track 20164
are used.
• Documents: The documents collection consists of 2.8 millions of books descriptions with meta-data from Ama-
zon and LibraryThing. Each document is represented by book-title, author, publisher, publication year, library
classification codes and user-generated content in the form of user ratings and reviews.
• Queries: the collection of queries from 2011 to 2016 the organizers of SBS have used Librarything forum to
extract a di↵erent set of queries with relevance judgments for each year. In our case, we chose to combine the
title with the narrative as a representation of the queries.
Year #Queries Fields
2011 211 Title,Group,Narrative,type,genre,specificity
2012 96 Title,Group,Narrative,type,genre
2013 370 Title,Group,Narrative,Query
2014 672 Title,Group,Narrative,mediated query
2015 178 Title,Group,Narrative,mediated query
2016 119 Title,Group,Request
Table 3. The six years topics used for SBS Suggestion track
For fair comparison, the queries and the corresponding relevance judgments in the others years are utilized as the
RQ 0.1158 0.1014 11.24%
QEART 0.1429 0.1153 23.4%
QEEB 0.1518 0.1194 6.23%
Table 4. Results of SBS 2014 with di↵erent strategies
RQ 0.1240 0.0904 5.53%
QEART 0.1549 0.1013 24.92%
QEEB 0.1688 0.1054 8.97%
Table 5. Results of SBS 2016 with di↵erent strategies
We used two topic sets provided by CLEF SBS in 2014 (680 topics) and 2016 (120 topic). We selected the title and
narrative fields for each topic. In the beginning, we used the techniques we showed in the section4.1 to remove the
stop-words and keep the appropriate words in the query. Secondly, the reduced query was expanded by adding new
terms using ART. In this step, we applied the CHARM algorithm with the following parameters : minsup = 15, and
mincon f = 0.7. Then, we used the similar books for each topic and applied the pseudo-relevance feedback technique
to expand again the query. The rocchio function was used with their default parameter settings = 0.4, and the number
of terms selected from each similar book was set to 10. Table5.2 describe an example of both approaches based on
reduced query and query expansion based on association rules between terms.
Original Query Does anyone know of a good book on the Battle of Gazala?
Reduced Query good book battle gazala
Query Expansion using ART good book battle gazala / military history gazala war attack army
Table 6. Examples of reduced and expansion Approaches for handling verbose queries
5.3. Experimental results
We first compare our baseline retrieval results with results from di↵erent expansion strategies which are shown
in Table5.2 and Table5.2. Where the columns RQ, QEART and QEEB represent the results obtained by the reduced
query, expanded reduced query using association rules and expanded (QEART) using examples books respectively. As
can be seen from the two tables, with the proposed di↵erent expansion strategies, the results are well-performed and
improve the baseline to some extent. We show that when we used the query reduction technique, the results perform
better than the baseline in the two sets of topics. we also show that when applying query expansion technique using
pseudo-relevance feedback, the results are better across all the sets of topics. In term of ndcg@10 the results increase
from 0.1429 to 0.1518 in the set of 2014 and from 0.1549 to 0.1688 in the set of 2016. From the overall perspective,
the best performance is obtained by QEART strategy with the greatest improvements of the score of 24.9% in the
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition) !11
Author name / Procedia Computer Science 00 (2018) 000–000 9
wins in 2016(O cial run), the authors proposed a searching framework which builds at any moment a reading list for
any specific topic, where the relevance between topics and books, the books quality, the popularities timeliness and the
results diversity are respectively embedded into vector representations based on user-generated contents and statistics
on social media. The obtained evaluation results also shed light that our proposed approaches o↵er interesting results.
Run NDCG@10 MAP
Our run 0.1518 0.1194
O cial run 0.1420 0.102
Medium run 0.096 0.068
Worst run 0.010 0.007
Table 7. Comparison results on Social Book Search 2014.
Run NDCG@10 MAP
Our run 0.1688 0.1054
O cial run 0.2157 0.1253
Medium run 0.0861 0.0524
Worst run 0.0018 0.0004
Table 8. Comparison results on Social Book Search 2016.
However, we noticed that the QEART worked well in the reviews also, this is justified by the fact that the association
rules allowed us to find the terms having a strong correlation with the querys terms.
Lastly, to further the e↵ectiveness analysis, we present a gain and failure analysis our approach. Table 5.3 presents the
percentages of queries R+
and R for which QE techniques perform better or lower/equal than the di↵erent baselines
in terms of NDCG@10. As depicted in Table 5.3, the average percentage for the set of queries R+
is of about 67.40%
for the SBS 2014 collection and 66.11% for SBS 2016 collection. The high percentage for R+ queries is reached
when we combined the ART technique with PRF for QE. these results confirms the e↵ectiveness of using mainly the
association rules as well as PRF in query expansion as proved in the literature.
Author name / Procedia Computer Science 00 (2018) 000–000 9
wins in 2016(O cial run), the authors proposed a searching framework which builds at any moment a reading list for
any specific topic, where the relevance between topics and books, the books quality, the popularities timeliness and the
results diversity are respectively embedded into vector representations based on user-generated contents and statistics
on social media. The obtained evaluation results also shed light that our proposed approaches o↵er interesting results.
Run NDCG@10 MAP
Our run 0.1518 0.1194
O cial run 0.1420 0.102
Medium run 0.096 0.068
Worst run 0.010 0.007
Table 7. Comparison results on Social Book Search 2014.
Run NDCG@10 MAP
Our run 0.1688 0.1054
O cial run 0.2157 0.1253
Medium run 0.0861 0.0524
Worst run 0.0018 0.0004
Table 8. Comparison results on Social Book Search 2016.
However, we noticed that the QEART worked well in the reviews also, this is justified by the fact that the association
rules allowed us to find the terms having a strong correlation with the querys terms.
Lastly, to further the e↵ectiveness analysis, we present a gain and failure analysis our approach. Table 5.3 presents the
percentages of queries R+
and R for which QE techniques perform better or lower/equal than the di↵erent baselines
in terms of NDCG@10. As depicted in Table 5.3, the average percentage for the set of queries R+
is of about 67.40%
for the SBS 2014 collection and 66.11% for SBS 2016 collection. The high percentage for R+ queries is reached
when we combined the ART technique with PRF for QE. these results confirms the e↵ectiveness of using mainly the
association rules as well as PRF in query expansion as proved in the literature.
Run QEART QEEB
SBS 2014 COLLECTIONS
R+
62.55 72.24
R 37.45 29.16
SBS 2016 COLLECTIONS
R+
61.39 70.83
R 38.61 29.17
Table 9. Percentage of queries R+ and R for each set of query (better or lower/equal) than the di↵erent baselines in terms of NDCG.
id f(t) = log
d f(t) + 0.5
(6)
Where d f(t) is the number of documents containing t, and |D| is the number of documents in the collection. We
conducted three di↵erent runs, namely:
1. Run-RQ: We used only the reduced queries we showed in the section4.1.
2. Run-QEART: We added the association rules between terms to extend the reduced queries.
3. Run-QEEB: Query expansion using examples books.
Strategy NDCG@10 MAP Improved
Baseline model 0.1041 0.0965 -
RQ 0.1158 0.1014 11.24%
QEART 0.1429 0.1153 23.4%
QEEB 0.1518 0.1194 6.23%
Table 4. Results of SBS 2014 with di↵erent strategies
Strategy NDCG@10 MAP Improved
Baseline model 0.1175 0.0872
RQ 0.1240 0.0904 5.53%
QEART 0.1549 0.1013 24.92%
QEEB 0.1688 0.1054 8.97%
Table 5. Results of SBS 2016 with di↵erent strategies
We used two topic sets provided by CLEF SBS in 2014 (680 topics) and 2016 (120 topic). We selected the title and
narrative fields for each topic. In the beginning, we used the techniques we showed in the section4.1 to remove the
stop-words and keep the appropriate words in the query. Secondly, the reduced query was expanded by adding new
terms using ART. In this step, we applied the CHARM algorithm with the following parameters : minsup = 15, and
mincon f = 0.7. Then, we used the similar books for each topic and applied the pseudo-relevance feedback technique
to expand again the query. The rocchio function was used with their default parameter settings = 0.4, and the number
of terms selected from each similar book was set to 10. Table5.2 describe an example of both approaches based on
reduced query and query expansion based on association rules between terms.
Original Query Does anyone know of a good book on the Battle of Gazala?
Reduced Query good book battle gazala
Query Expansion using ART good book battle gazala / military history gazala war attack army
Table 6. Examples of reduced and expansion Approaches for handling verbose queries
5.3. Experimental results
(QEART	with	pseudo	relevance	feedback)
more
ework
ystem
l now
ormal-
there
ethods
o the
. The
evalu-
of the
matical
f eval-
ollow-
st.
uation
and
S, it is
redic-
Error
mean
ut.
he RS
ru,i = 
m i on
u hav-
ystem
te dif-
forms
the proportion of relevant recommended items from the total
number of recommended items, (2) recall, which indicates the pro-
portion of relevant recommended items from the number of rele-
vant items, and (3) F1, which is a combination of precision and
recall.
Let Xu as the set of recommendations to user u, and Zu as the set
of n recommendations to user u. We will represent the evaluation
precision, recall and F1 measures for recommendations obtained
by making n test recommendations to the user u, taking a h rele-
vancy threshold. Assuming that all users accept n test
recommendations:
precision ¼
1
#U
X
u2U
#fi 2 Zujru;i P hg
n
ð4Þ
recall ¼
1
#U
X
u2U
#fi 2 Zujru;i P hg
#fi 2 Zujru;i P hg þ # i 2 Zc
u

ru;i P h
È É ð5Þ
F1 ¼
2  precision  recall
precision þ recall
ð6Þ
4.3. Quality of the list of recommendations: rank measures
When the number n of recommended items is not small, users
give greater importance to the first items on the list of recommen-
dations. The mistakes incurred in these items are more serious er-
rors than those in the last items on the list. The ranking measures
consider this situation. Among the ranking measures most often
used are the following standard information retrieval measures:
(a) half-life (7) [43], which assumes an exponential decrease in
the interest of users as they move away from the recommenda-
tions at the top and (b) discounted cumulative gain (8) [17], wherein
decay is logarithmic.
HL ¼
1
#U
X
u2U
XN
i¼1
maxðru;pi
À d; 0Þ
2ðiÀ 1Þ=ðaÀ 1Þ
ð7Þ
DCGk
¼
1
#U
X
u2U
ru;p1
þ
Xk
i¼2
ru;pi
log2ðiÞ
!
ð8Þ
p1,. . .,pn represents the recommendation list, ru,pi represents
the true rating of the user u for the item pi, k is the rank of the eval-
uated item, d is the default rating, a is the number of the item on
the list such that there is a 50% chance the user will review that
item.
4.4. Novelty and diversity
The novelty evaluation measure indicates the degree of differ-
ence between the items recommended to and known by the user.
The diversity quality measure indicates the degree of differentia-
tion among recommended items.
Currently, novelty and diversity measures do not have a stan-
dard; therefore, different authors propose different metrics
4.6. Reliability
The reliability of a prediction or a recommendation informs
about how seriously we may consider this prediction. When RS
recommends an item to a user with prediction 4.5 in a scale
{1,. . .,5}, this user hopes to be satisfied by this item. However, this
value of prediction (4.5 over 5) does not reflect with which certain
degree the RS has concluded that the user will like this item (with
value 4.5 over 5). Indeed, this prediction of 4.5 is much more reli-
able if it has obtained by means of 200 similar users than if it has
obtained by only two similar users.
In Hernando et al. [96], a realibility measure is proposed accord-
ing the usual notion that the more reliable a prediction, the less lia-
ble to be wrong. Although this reliability measure is not a quality
Fig. 7. Recommender systems evaluation process.
118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132
P.	Bellot	(AMU-CNRS,	LSIS-OpenEdition)
Conclusion
— Dealing with long and verbose natural language queries for information retrieval is still
an open problem
— For Book Search : using social information (reviews) and metadata can efficiently
replace the content of the books
— Combining Query Reduction / Expansion is useful and even mandatory for long queries
— Association Rules are efficient and effective
— Perspectives :
— Retrieving and analysing book reviews (aspect based sentiment analysis)
!12
ent classification mod-
aseline the naive Bayes
Savoy, 2010). The clas-
oose between two pos-
is a Review and h1 =
lass that has the maxi-
e Equation (5). Where
f words included in the
s the number of words
t.
|w|
Y
j=1
P(wj|hi) (5)
ities with the Equation
etween the lexical fre-
the whole size of the
,hi
) and the size of the
Words) where all words are considered as features.
We also used feature selection based on the nor-
malized z-score by keeping the first 1000 words
according to this score (after removing all words
that appear less than 5 times). As the third ap-
proach, we suggested that the common features
between the Review collection can be located in
the Named Entity distribution in the text.
Table 4: Results showing the performances of
the classification models using different indexing
schemes on the test set. The best values for the
Review class are noted in bold and those for
Review class are are underlined
Review Review
# Model R P F-M R P F-M
1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%
SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%
SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%
* C = 5.0
* = 0.00185
2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%
SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%
SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%
* C = 32.0
* = 0.00781
3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6%
SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1%
SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1%
* C = 8.0
* = 0.03125
A	paraître	:	conférence	NLDB	(Paris,	2018)
LREC	2014
IEEE-ACM	WI	2018
https://lab.hypotheses.org
You	can	follow	us	/	participate:

More Related Content

What's hot

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
KU Leuven
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
9866825059
 

What's hot (20)

Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomies
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Text Data Mining
Text Data MiningText Data Mining
Text Data Mining
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Text mining
Text miningText mining
Text mining
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Medical Persona Classification in Social Media
Medical Persona Classification in Social MediaMedical Persona Classification in Social Media
Medical Persona Classification in Social Media
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
Last But Not Least - Managing The Indexing Process
Last But Not Least  - Managing The Indexing ProcessLast But Not Least  - Managing The Indexing Process
Last But Not Least - Managing The Indexing Process
 
Model of information retrieval (3)
Model  of information retrieval (3)Model  of information retrieval (3)
Model of information retrieval (3)
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Term weighting
Term weightingTerm weighting
Term weighting
 

Similar to A combination of reduction and expansion approaches to handle with long natural language queries

Similar to A combination of reduction and expansion approaches to handle with long natural language queries (20)

Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
 
Dr.saleem gul assignment summary
Dr.saleem gul assignment summaryDr.saleem gul assignment summary
Dr.saleem gul assignment summary
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papers
 
Toxic Comment Classification
Toxic Comment ClassificationToxic Comment Classification
Toxic Comment Classification
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 
Interview_Search_Process (1).pptx
Interview_Search_Process (1).pptxInterview_Search_Process (1).pptx
Interview_Search_Process (1).pptx
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resources
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
G04124041046
G04124041046G04124041046
G04124041046
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Law and Economics.docx
Law and Economics.docxLaw and Economics.docx
Law and Economics.docx
 
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGRAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSING
 
Thesaurus 2101
Thesaurus 2101Thesaurus 2101
Thesaurus 2101
 

More from Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 

More from Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I) (8)

Analyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et WekaAnalyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et Weka
 
Introduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielleIntroduction à la fouille de textes et positionnement de l'offre logicielle
Introduction à la fouille de textes et positionnement de l'offre logicielle
 
Introduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDMIntroduction générale sur les enjeux du Text and Data Mining TDM
Introduction générale sur les enjeux du Text and Data Mining TDM
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
Scholarly Book Recommendation
Scholarly Book RecommendationScholarly Book Recommendation
Scholarly Book Recommendation
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Huma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHSHuma-Num une Infrastructure pour les SHS
Huma-Num une Infrastructure pour les SHS
 
OpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text MiningOpenEdition Lab projects in Text Mining
OpenEdition Lab projects in Text Mining
 

Recently uploaded

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 

Recently uploaded (20)

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 

A combination of reduction and expansion approaches to handle with long natural language queries

  • 1. A COMBINATION OF REDUCTION AND EXPANSION APPROACHES 
 TO HANDLE WITH LONG NATURAL LANGUAGE QUERIES Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia 22nd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems A Combination of Reduction and Expansion Approaches to to Handle with Long Natural Language queries Mohamed ETTALEBa , Chiraz LATIRIb , Patrice BELLOTb aUniversity of Tunis El Manar, Faculty of Sciences of Tunis, LIPAH research Laboratory, Tunis ,Tunisia bAix-Marseille Universit, CNRS, LIS UMR 7020, 13397, Marseille, France Abstract Most of the queries submitted to search engines are composed of keywords but it is not enough for users to express their needs. Through verbose natural language queries, users can express complex or highly specific information needs. However, it is di cult for search engine to deal with this type of queries. Moreover, the emergence of social medias allows users to get opinions, sug- gestions or recommendations from other users about complex information needs. In order to increase the understanding of user needs, tasks as the CLEF Social Book Search Suggestion Track have been proposed from 2011 to 2016. The aim is to investigate techniques to support users in searching for books in catalogs of professional metadata and complementary social media. In this re- spect, we introduce in the present paper a statical approach to deal with long verbose queries in Social Information Retrieval (SIR)
  • 3. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Searching for Books : a difficult task User needs : very diverse facets or aspects — topic oriented aspects — With / without a precise context
 eg. arts in China during the XXth century — Books dealing with named entities : locations (the book is about a specific location OR the action takes place at this location), proper names… — What are the most important / more popular books about … — style / type / language aspects — category : fiction, novel, essay, proceedings, position papers… — target : for experts / for dummies / for children … — and also… : well illustrated, cheap, short… => Keyword queries are not enough => verbose natural language queries Book Contents : long stories, metaphoric language, several topics… => Metadata (tags, ToC, indexes…), summaries, reader reviews can help us !3
  • 4. (IR based) Book Suggestion System !4 Hypothesis : Reviews can help to expand the queries Information Retrieval System Book content Metadata Summaries Reviews Needs Example books Query Index Book Suggestion Reduction Expansion
  • 5. P. Bellot (AMU-CNRS, LSIS-OpenEdition) !5 http://social-book-search.humanities.uva.nl/#/overview 2 The Amazon collection The document used for this year’s Book Track is composed of Amazon pages of existing books. These pages consist of editorial information such as ISBN num- ber, title, number of pages etc... However, in this collection the most important content resides in social data. Indeed Amazon is social-oriented, and user can comment and rate products they purchased or they own. Reviews are identi- fied by the <review> fields and are unique for a single user: Amazon does not allow a forum-like discussion. They can also assign tags of their creation to a product. These tags are useful for refining the search of other users in the way that they are not fixed: they reflect the trends for a specific product. In the XML documents, they can be found in the <tag> fields. Apart from this user classification, Amazon provides its own category labels that are contained in the <browseNode> fields. Table 1. Some facts about the Amazon collection. Number of pages (i.e. books) 2, 781, 400 Number of reviews 15, 785, 133 Number of pages that contain a least a review 1, 915, 336 3 Retrieval model 3.1 Sequential Dependence Model Like the previous year, we used a language modeling approach to retrieval [4]. We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integrate multiword phrases in the query. Specifically, we use the Sequential Dependance a CLEF lab 2011-2016
  • 7. P. Bellot (AMU-CNRS, LSIS-OpenEdition) !7 Datasets Aix-Marseille University Amal Htait, Sébastien Fournier, and Patrice Bellot Amazon collection of 2.8M records of professional and social metadata
  • 8. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Our proposal for dealing with queries — 1- Combine query reduction and query expansion — 2- Apply an Information Retrieval Model over (meta)data and reviews — Query Reduction : removes the less informative words / the most complex to deal with — Query Expansion : fights against word mismatch problem (a concept is described by different terms in user queries and in source documents) — Association rules « query words / word in the reviews » 
 = inter-term correlations (if the query words occur then these words occur as well)
 — Use of the examples given by the user : Pseudo-Relevance Feedback (Rocchio) !8 • stopword removal and stemming for the English language to reduce the verbose queries • query expansion based on Associations rules between terms. • query expansion using similar books mentioned in the topics. 4.1. Query Reduction Removing stopwords has long been a standard query processing step[12]. We used three di↵erent stopwords lists in this study: the standard stopword list1 , as well as two stopwords lists based on morphosyntactic analysis and according to the ranks of terms by some weight. [1] present several methods for automatically constructing a collection dependent stopword list. Their methods generally involve ranking collection terms by some weight, then choosing some rank threshold above which terms are considered stopwords. [17] constructed a specific stopword list of a given collection and used statistic measure IDF to rank terms and decide which term is a stopword or not. Next, they applied these techniques by removing from the query all words which occur on the stopword list. Our proposal is to reduce the verbose queries based on two steps: first, all terms that appear in the standard stopwords list2 are eliminated. Second, we process the linguistic filtering method and execute TreeTagger3 a part of speech tagging on the queries. Then, we select only the particular words of noun type(nouns, proper nouns, etc.) and query words that have a form as noun phrase such as Syrian Civil War. The aim is to keep the appropriate words that can improve the quality of the user query. 4.2. Query expansion using Associations rules between terms(ART) Query expansion is the process of adding additional relevant terms to the original queries to improve the per- formance of information retrieval systems. However, previous studies showed that automatic query expansion using Association rules do not lead to an improvement in the performance. The main idea of this approach is to extract a set of non redundant rules, representing inter-terms correlations in a contextual manner. We use the rules that convey the most interesting correlations amongst terms, to extend the initial queries. Then, we extract a list of books for each query using the MB25 scoring [25]. 4.2.1. Representation and Query Expansion: We represent a query q as a bag of terms:
  • 9. Association Rules Mining for Query Expansion !9 CT = {w1, ..., wm} Where wi is a candidate term. This set terms detailed in the next section. 4.2.2. Association Rules: An association rule, i.e., between ter of ⌧ where ⌧ := {t1, ..., wl} is a finite set and T2 are, respectively, called the pre equal to T1 [ T2. The support of a rule S upp(R) = S upp(T) while its confidence is computed as: Con f(R) = S upp(T) S upp(T1) An association rule R is said to be valid threshold denoted mincon f. This confid 4.2.3. Candidate Terms Generation Ap The main idea of this approach is t between terms [4]. The set of query te conclusion parts of the retained rules w example of association rules is highligh = S upp(T) S upp(T1) (4) rule R is said to be valid if its confidence value, i.e., Con f(R), is greater than or equal to a user-defined ted mincon f. This confidence threshold is used to exclude non valid rules. ate Terms Generation Approach based on Association Rules: dea of this approach is to use the association rules mining technique to discover strong correlations [4]. The set of query terms will be expanded using the maximal possible set of terms located in the ts of the retained rules while checking that the terms are located in their premise part. An illustrative ociation rules is highlighted in Table4.2.3. R Support Confidence military ) warfare 83 0.741 romance ) love 64 0.723 on Rules examples. generating candidate terms for a given query is performed as in the following steps: f a sub-set of 12000 books according to the querys subject. The books are represented only by their tion, we chose to select the title, reviews and the tags as content of the book. the selected books using TreeTagger. The choice of TreeTagger was based on the ability of this tool to nature(morphosyntactic category) of a word in its context. R Support Confidence military ) warfare 83 0.741 romance ) love 64 0.723 Table 2. Association Rules examples. The process of generating candidate terms for a given query is performed as in the following steps: 1) Selection of a sub-set of 12000 books according to the querys subject. The books are represented only by social information, we chose to select the title, reviews and the tags as content of the book. 2) Annotating the selected books using TreeTagger. The choice of TreeTagger was based on the ability of this t recognize the nature(morphosyntactic category) of a word in its context. 3) Extraction of nouns (terms) from the annotated books, and removing the most frequent ones. 4) Generating the association rules using an e cient algorithm: Closed Association Rule Mining (CHARM)[2 mining all the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative minimal su and minconf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zipf distri of the collection, the maximum threshold of the support values is experimentally set in order to spread trivial which occur in the most of the documents, and are then related to too many terms. On the other hand, the m threshold allows eliminating marginal terms which occur in few documents, and are then not statistically imp when occurring in a rule. CHARM gives as output, the association rules with their appropriate support and confid Table4.2.3 describes the output of CHARM. military ) warfare 83 0.741 romance ) love 64 0.723 Table 2. Association Rules examples. The process of generating candidate terms for a given query is performed as in the following steps: 1) Selection of a sub-set of 12000 books according to the querys subject. The books are represented on social information, we chose to select the title, reviews and the tags as content of the book. 2) Annotating the selected books using TreeTagger. The choice of TreeTagger was based on the ability of recognize the nature(morphosyntactic category) of a word in its context. 3) Extraction of nouns (terms) from the annotated books, and removing the most frequent ones. 4) Generating the association rules using an e cient algorithm: Closed Association Rule Mining (CHAR mining all the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative minim and minconf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zipf d of the collection, the maximum threshold of the support values is experimentally set in order to spread tr which occur in the most of the documents, and are then related to too many terms. On the other hand, t threshold allows eliminating marginal terms which occur in few documents, and are then not statistically when occurring in a rule. CHARM gives as output, the association rules with their appropriate support and Table4.2.3 describes the output of CHARM. R Support Confidence military ) warfare 83 0.741 romance ) love 64 0.723 Table 2. Association Rules examples. The process of generating candidate terms for a given query is performed as in the following steps: 1) Selection of a sub-set of 12000 books according to the querys subject. The books are represented social information, we chose to select the title, reviews and the tags as content of the book. 2) Annotating the selected books using TreeTagger. The choice of TreeTagger was based on the ability recognize the nature(morphosyntactic category) of a word in its context. 3) Extraction of nouns (terms) from the annotated books, and removing the most frequent ones. 4) Generating the association rules using an e cient algorithm: Closed Association Rule Mining (CHA mining all the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative min and minconf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zip of the collection, the maximum threshold of the support values is experimentally set in order to spread which occur in the most of the documents, and are then related to too many terms. On the other hand threshold allows eliminating marginal terms which occur in few documents, and are then not statistica when occurring in a rule. CHARM gives as output, the association rules with their appropriate support an Table4.2.3 describes the output of CHARM. R Support Confidence military ) warfare 83 0.741 romance ) love 64 0.723 ssociation Rules examples. cess of generating candidate terms for a given query is performed as in the following steps: tion of a sub-set of 12000 books according to the querys subject. The books are represented only by their formation, we chose to select the title, reviews and the tags as content of the book. tating the selected books using TreeTagger. The choice of TreeTagger was based on the ability of this tool to e the nature(morphosyntactic category) of a word in its context. ction of nouns (terms) from the annotated books, and removing the most frequent ones. rating the association rules using an e cient algorithm: Closed Association Rule Mining (CHARM)[29] for ll the closed frequent termsets, As parameters, CHARM takes minsupp = 15 as the relative minimal support conf = 0.7 as the minimum confidence of the association rules [18]. While considering the Zipf distribution ollection, the maximum threshold of the support values is experimentally set in order to spread trivial terms ccur in the most of the documents, and are then related to too many terms. On the other hand, the minimal d allows eliminating marginal terms which occur in few documents, and are then not statistically important curring in a rule. CHARM gives as output, the association rules with their appropriate support and confidence. .3 describes the output of CHARM. 6 a set of candidate terms for q: CT = {w1, ..., wm} Where wi is a candidate term. terms detailed in the next sectio 4.2.2. Association Rules: An association rule, i.e., betw of ⌧ where ⌧ := {t1, ..., wl} is a fi and T2 are, respectively, called equal to T1 [ T2. The support S upp(R) = S upp(T) while its confidence is compute Con f(R) = S upp(T) S upp(T1) ia Computer Science 00 (2018) 000–000 e terms, denoted CT, is selected using association rules betw mplication of the form R : T1 ) T2, where T1 and T2 are sub terms in the books collection and T1 T2 = ;. The termsets e conclusion of R. The rule R is said to be based on the termse query term term in book title + review + metadata For each query (bag of words) termsets Expansion using the maximal set of terms in the conclusion parts T =
  • 10. Experiments — Experimental setup - Terrier Information Retrieval System - BM25 model !10 In our experiments, we present experimental results on SBS 2014 a to compare the performances of di↵erent components of our system. Fi framework developed at the University of Glasgow [22]. Terrier is a modu scale IR applications. It provides indexing and retrieval functionalities. Th the usual parameter values (b=0, k3=1000, k1=2). Using the BM25 mode Q is given by: S (D, Q) = X t2Q (K1 + 1)w(t, d) K1 + w(t, d) .id f(t). (K3 + 1)w(t, Q) K3 + w(t, Q) 4 http://social-book-search.humanities.uva.nl/#/data/suggestion 8 Author name / Procedia Comput Where w(t, d) and w(t, Q) are respectively the weights of te document frequency of term t, given as follow: id f(t) = log |D| d f(t) + 0.5 d f(t) + 0.5 Where d f(t) is the number of documents containing t, an conducted three di↵erent runs, namely: 1. Run-RQ: We used only the reduced queries we showed 2. Run-QEART: We added the association rules between 3. Run-QEEB: Query expansion using examples books. Strategy NDCG@10 MAP Improved Baseline model 0.1041 0.0965 - RQ 0.1158 0.1014 11.24% QEART 0.1429 0.1153 23.4% QEEB 0.1518 0.1194 6.23% Table 4. Results of SBS 2014 with di↵erent strategies T We used two topic sets provided by CLEF SBS in 2014 (6 narrative fields for each topic. In the beginning, we used th stop-words and keep the appropriate words in the query. Se terms using ART. In this step, we applied the CHARM alg mincon f = 0.7. Then, we used the similar books for each to to expand again the query. The rocchio function was used wi of terms selected from each similar book was set to 10. Ta reduced query and query expansion based on association ru relevant and contain terms can be important to the query. In order to exploit these similar books, we expand again the queries processed in the section 4.2 by automatically adding terms from these similar books. 5. Experiments and results In this section, we describe the experimental setup we used for our experiments. 5.1. Experimental data To evaluate our approach, the data provided by CLEF SBS suggestion track 20164 are used. • Documents: The documents collection consists of 2.8 millions of books descriptions with meta-data from Ama- zon and LibraryThing. Each document is represented by book-title, author, publisher, publication year, library classification codes and user-generated content in the form of user ratings and reviews. • Queries: the collection of queries from 2011 to 2016 the organizers of SBS have used Librarything forum to extract a di↵erent set of queries with relevance judgments for each year. In our case, we chose to combine the title with the narrative as a representation of the queries. Year #Queries Fields 2011 211 Title,Group,Narrative,type,genre,specificity 2012 96 Title,Group,Narrative,type,genre 2013 370 Title,Group,Narrative,Query 2014 672 Title,Group,Narrative,mediated query 2015 178 Title,Group,Narrative,mediated query 2016 119 Title,Group,Request Table 3. The six years topics used for SBS Suggestion track For fair comparison, the queries and the corresponding relevance judgments in the others years are utilized as the RQ 0.1158 0.1014 11.24% QEART 0.1429 0.1153 23.4% QEEB 0.1518 0.1194 6.23% Table 4. Results of SBS 2014 with di↵erent strategies RQ 0.1240 0.0904 5.53% QEART 0.1549 0.1013 24.92% QEEB 0.1688 0.1054 8.97% Table 5. Results of SBS 2016 with di↵erent strategies We used two topic sets provided by CLEF SBS in 2014 (680 topics) and 2016 (120 topic). We selected the title and narrative fields for each topic. In the beginning, we used the techniques we showed in the section4.1 to remove the stop-words and keep the appropriate words in the query. Secondly, the reduced query was expanded by adding new terms using ART. In this step, we applied the CHARM algorithm with the following parameters : minsup = 15, and mincon f = 0.7. Then, we used the similar books for each topic and applied the pseudo-relevance feedback technique to expand again the query. The rocchio function was used with their default parameter settings = 0.4, and the number of terms selected from each similar book was set to 10. Table5.2 describe an example of both approaches based on reduced query and query expansion based on association rules between terms. Original Query Does anyone know of a good book on the Battle of Gazala? Reduced Query good book battle gazala Query Expansion using ART good book battle gazala / military history gazala war attack army Table 6. Examples of reduced and expansion Approaches for handling verbose queries 5.3. Experimental results We first compare our baseline retrieval results with results from di↵erent expansion strategies which are shown in Table5.2 and Table5.2. Where the columns RQ, QEART and QEEB represent the results obtained by the reduced query, expanded reduced query using association rules and expanded (QEART) using examples books respectively. As can be seen from the two tables, with the proposed di↵erent expansion strategies, the results are well-performed and improve the baseline to some extent. We show that when we used the query reduction technique, the results perform better than the baseline in the two sets of topics. we also show that when applying query expansion technique using pseudo-relevance feedback, the results are better across all the sets of topics. In term of ndcg@10 the results increase from 0.1429 to 0.1518 in the set of 2014 and from 0.1549 to 0.1688 in the set of 2016. From the overall perspective, the best performance is obtained by QEART strategy with the greatest improvements of the score of 24.9% in the
  • 11. P. Bellot (AMU-CNRS, LSIS-OpenEdition) !11 Author name / Procedia Computer Science 00 (2018) 000–000 9 wins in 2016(O cial run), the authors proposed a searching framework which builds at any moment a reading list for any specific topic, where the relevance between topics and books, the books quality, the popularities timeliness and the results diversity are respectively embedded into vector representations based on user-generated contents and statistics on social media. The obtained evaluation results also shed light that our proposed approaches o↵er interesting results. Run NDCG@10 MAP Our run 0.1518 0.1194 O cial run 0.1420 0.102 Medium run 0.096 0.068 Worst run 0.010 0.007 Table 7. Comparison results on Social Book Search 2014. Run NDCG@10 MAP Our run 0.1688 0.1054 O cial run 0.2157 0.1253 Medium run 0.0861 0.0524 Worst run 0.0018 0.0004 Table 8. Comparison results on Social Book Search 2016. However, we noticed that the QEART worked well in the reviews also, this is justified by the fact that the association rules allowed us to find the terms having a strong correlation with the querys terms. Lastly, to further the e↵ectiveness analysis, we present a gain and failure analysis our approach. Table 5.3 presents the percentages of queries R+ and R for which QE techniques perform better or lower/equal than the di↵erent baselines in terms of NDCG@10. As depicted in Table 5.3, the average percentage for the set of queries R+ is of about 67.40% for the SBS 2014 collection and 66.11% for SBS 2016 collection. The high percentage for R+ queries is reached when we combined the ART technique with PRF for QE. these results confirms the e↵ectiveness of using mainly the association rules as well as PRF in query expansion as proved in the literature. Author name / Procedia Computer Science 00 (2018) 000–000 9 wins in 2016(O cial run), the authors proposed a searching framework which builds at any moment a reading list for any specific topic, where the relevance between topics and books, the books quality, the popularities timeliness and the results diversity are respectively embedded into vector representations based on user-generated contents and statistics on social media. The obtained evaluation results also shed light that our proposed approaches o↵er interesting results. Run NDCG@10 MAP Our run 0.1518 0.1194 O cial run 0.1420 0.102 Medium run 0.096 0.068 Worst run 0.010 0.007 Table 7. Comparison results on Social Book Search 2014. Run NDCG@10 MAP Our run 0.1688 0.1054 O cial run 0.2157 0.1253 Medium run 0.0861 0.0524 Worst run 0.0018 0.0004 Table 8. Comparison results on Social Book Search 2016. However, we noticed that the QEART worked well in the reviews also, this is justified by the fact that the association rules allowed us to find the terms having a strong correlation with the querys terms. Lastly, to further the e↵ectiveness analysis, we present a gain and failure analysis our approach. Table 5.3 presents the percentages of queries R+ and R for which QE techniques perform better or lower/equal than the di↵erent baselines in terms of NDCG@10. As depicted in Table 5.3, the average percentage for the set of queries R+ is of about 67.40% for the SBS 2014 collection and 66.11% for SBS 2016 collection. The high percentage for R+ queries is reached when we combined the ART technique with PRF for QE. these results confirms the e↵ectiveness of using mainly the association rules as well as PRF in query expansion as proved in the literature. Run QEART QEEB SBS 2014 COLLECTIONS R+ 62.55 72.24 R 37.45 29.16 SBS 2016 COLLECTIONS R+ 61.39 70.83 R 38.61 29.17 Table 9. Percentage of queries R+ and R for each set of query (better or lower/equal) than the di↵erent baselines in terms of NDCG. id f(t) = log d f(t) + 0.5 (6) Where d f(t) is the number of documents containing t, and |D| is the number of documents in the collection. We conducted three di↵erent runs, namely: 1. Run-RQ: We used only the reduced queries we showed in the section4.1. 2. Run-QEART: We added the association rules between terms to extend the reduced queries. 3. Run-QEEB: Query expansion using examples books. Strategy NDCG@10 MAP Improved Baseline model 0.1041 0.0965 - RQ 0.1158 0.1014 11.24% QEART 0.1429 0.1153 23.4% QEEB 0.1518 0.1194 6.23% Table 4. Results of SBS 2014 with di↵erent strategies Strategy NDCG@10 MAP Improved Baseline model 0.1175 0.0872 RQ 0.1240 0.0904 5.53% QEART 0.1549 0.1013 24.92% QEEB 0.1688 0.1054 8.97% Table 5. Results of SBS 2016 with di↵erent strategies We used two topic sets provided by CLEF SBS in 2014 (680 topics) and 2016 (120 topic). We selected the title and narrative fields for each topic. In the beginning, we used the techniques we showed in the section4.1 to remove the stop-words and keep the appropriate words in the query. Secondly, the reduced query was expanded by adding new terms using ART. In this step, we applied the CHARM algorithm with the following parameters : minsup = 15, and mincon f = 0.7. Then, we used the similar books for each topic and applied the pseudo-relevance feedback technique to expand again the query. The rocchio function was used with their default parameter settings = 0.4, and the number of terms selected from each similar book was set to 10. Table5.2 describe an example of both approaches based on reduced query and query expansion based on association rules between terms. Original Query Does anyone know of a good book on the Battle of Gazala? Reduced Query good book battle gazala Query Expansion using ART good book battle gazala / military history gazala war attack army Table 6. Examples of reduced and expansion Approaches for handling verbose queries 5.3. Experimental results (QEART with pseudo relevance feedback) more ework ystem l now ormal- there ethods o the . The evalu- of the matical f eval- ollow- st. uation and S, it is redic- Error mean ut. he RS ru,i = m i on u hav- ystem te dif- forms the proportion of relevant recommended items from the total number of recommended items, (2) recall, which indicates the pro- portion of relevant recommended items from the number of rele- vant items, and (3) F1, which is a combination of precision and recall. Let Xu as the set of recommendations to user u, and Zu as the set of n recommendations to user u. We will represent the evaluation precision, recall and F1 measures for recommendations obtained by making n test recommendations to the user u, taking a h rele- vancy threshold. Assuming that all users accept n test recommendations: precision ¼ 1 #U X u2U #fi 2 Zujru;i P hg n ð4Þ recall ¼ 1 #U X u2U #fi 2 Zujru;i P hg #fi 2 Zujru;i P hg þ # i 2 Zc u ru;i P h È É ð5Þ F1 ¼ 2  precision  recall precision þ recall ð6Þ 4.3. Quality of the list of recommendations: rank measures When the number n of recommended items is not small, users give greater importance to the first items on the list of recommen- dations. The mistakes incurred in these items are more serious er- rors than those in the last items on the list. The ranking measures consider this situation. Among the ranking measures most often used are the following standard information retrieval measures: (a) half-life (7) [43], which assumes an exponential decrease in the interest of users as they move away from the recommenda- tions at the top and (b) discounted cumulative gain (8) [17], wherein decay is logarithmic. HL ¼ 1 #U X u2U XN i¼1 maxðru;pi À d; 0Þ 2ðiÀ 1Þ=ðaÀ 1Þ ð7Þ DCGk ¼ 1 #U X u2U ru;p1 þ Xk i¼2 ru;pi log2ðiÞ ! ð8Þ p1,. . .,pn represents the recommendation list, ru,pi represents the true rating of the user u for the item pi, k is the rank of the eval- uated item, d is the default rating, a is the number of the item on the list such that there is a 50% chance the user will review that item. 4.4. Novelty and diversity The novelty evaluation measure indicates the degree of differ- ence between the items recommended to and known by the user. The diversity quality measure indicates the degree of differentia- tion among recommended items. Currently, novelty and diversity measures do not have a stan- dard; therefore, different authors propose different metrics 4.6. Reliability The reliability of a prediction or a recommendation informs about how seriously we may consider this prediction. When RS recommends an item to a user with prediction 4.5 in a scale {1,. . .,5}, this user hopes to be satisfied by this item. However, this value of prediction (4.5 over 5) does not reflect with which certain degree the RS has concluded that the user will like this item (with value 4.5 over 5). Indeed, this prediction of 4.5 is much more reli- able if it has obtained by means of 200 similar users than if it has obtained by only two similar users. In Hernando et al. [96], a realibility measure is proposed accord- ing the usual notion that the more reliable a prediction, the less lia- ble to be wrong. Although this reliability measure is not a quality Fig. 7. Recommender systems evaluation process. 118 J. Bobadilla et al. / Knowledge-Based Systems 46 (2013) 109–132
  • 12. P. Bellot (AMU-CNRS, LSIS-OpenEdition) Conclusion — Dealing with long and verbose natural language queries for information retrieval is still an open problem — For Book Search : using social information (reviews) and metadata can efficiently replace the content of the books — Combining Query Reduction / Expansion is useful and even mandatory for long queries — Association Rules are efficient and effective — Perspectives : — Retrieving and analysing book reviews (aspect based sentiment analysis) !12 ent classification mod- aseline the naive Bayes Savoy, 2010). The clas- oose between two pos- is a Review and h1 = lass that has the maxi- e Equation (5). Where f words included in the s the number of words t. |w| Y j=1 P(wj|hi) (5) ities with the Equation etween the lexical fre- the whole size of the ,hi ) and the size of the Words) where all words are considered as features. We also used feature selection based on the nor- malized z-score by keeping the first 1000 words according to this score (after removing all words that appear less than 5 times). As the third ap- proach, we suggested that the common features between the Review collection can be located in the Named Entity distribution in the text. Table 4: Results showing the performances of the classification models using different indexing schemes on the test set. The best values for the Review class are noted in bold and those for Review class are are underlined Review Review # Model R P F-M R P F-M 1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8% SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7% SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5% * C = 5.0 * = 0.00185 2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2% SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8% SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6% * C = 32.0 * = 0.00781 3 NB 80.0% 68.4% 73.7% 54.2% 68.7% 60.6% SVM (Linear) 77.0% 81.9% 79.4% 78.9% 73.5% 76.1% SVM (RBF) 81.2% 48.6% 79.9% 72.6% 75.8% 74.1% * C = 8.0 * = 0.03125 A paraître : conférence NLDB (Paris, 2018) LREC 2014 IEEE-ACM WI 2018 https://lab.hypotheses.org You can follow us / participate: