SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

LAST UPDATED: 26 OCTOBER 2012

SearchInFocus
Exploratory Study on Query Logs and Actionable Intelligence

Marina Santini

Exploratory Query-log Analysis Workshop
Organized by Findwise, AB - www.findwise.com/
Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST)
Lund, Sweden

SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund.

Query Logs and Actionable Intelligence:
Questions to LinkedIn-ers
• “Can anyone suggest references about mining query logs for BI and
CEM?” (3rd May 2012) [BI=Business Intelligence; CEM=Customer
Experience Management]

• Applying Findability to Mine Query Logs for BI: Preliminaries “How
can I profitably use query logs for making better business decisions
and predict future trends?” (14th May 2012)

• Mining Query Logs: Query Disambiguation & Understanding
through a KB “some linguistic problems can be sorted out -- for
example those related to sublanguage, terminology, multi-word
expressions, etc. -- through a dictionary-shaped knowledge base
where the different uses of language are stored and continually
updated. I will call this knowledge base DaisyKB” (21st May 2012)

My preliminary reflections based on
this info…
• “The average length of a search query was 2.4 terms"

• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same
user were repeat queries and that 87% of the time the user would click on the
same result. This suggests that many users use repeat queries to revisit or re-find
information. This analysis is confirmed by a Bing search engine blog post telling
about 30% queries are navigational queries."

• “… much research has shown that query term frequency distributions conform to
the power law, or long tail distribution curves. That is, a small portion of the terms
observed in a large query log (e.g. > 100 million queries) are used most often,
while the remaining terms are used less often individually."

• “… in a recent study in 2011 it was found that the average length of queries has
grown steadily over time and average length of non-English languages queries had
increased more than English queries."

Then came the corpus…
• Enterprise query logs: VGR (27 August 2012)
– easier to handle and interpret than general-
purpose search engines’ query logs!

So… that’s the Outline
1. The query log genre
2. Actionable Intelligence
3. A possible use case
4. Preliminary conclusions

What is a (textual) genre?
• Simply simply simply put:
– A genre is a class of text

What characterize a genre?

1. Must have a name
2. Must be recognized within a community
3. Must be produced during a task
4. Must have conventions
5. Must raise expectations
6. Can change over time. It is an cultural
artifact (culture here includes society, media,
techonology, etc.)

Genre Characterization
1. Name formation: a genre must indicate a class, a family (for genre name
formation, see Görlach, 2004). Recent webgenres:
blogs, tweets, chatlogs, etc.
2. Community: a genre is not something individual. A genre is a textual form
that is used and recognized by a community (vs. style can be
individualized). Ex: Blogs bloggers and blog readers; academic home
pages  academics; etc.)
3. Task: a genre meets a RECURRENT communication need. Ex: personal
home page genre tells us something about a person; a technical blog is
informative about a specific technology; etc.)
4. Conventions: ex : a personal blog is made of posts organized in
chronological order where a blogger communicates personal and
subjective views on some facts.
5. Expectations: when reading a personal blog, readers expect to read
something personal (personal facts or personal opinions) and expect the
possibility to leave a comment if they wish to do so.
6. A genre is a cultural artifact: it might evolve over time (see the History of
Blog by Rebecca Blood, 2000) might disappear if the society changes (ex :
Chansons des gestes). New genres emerge with new media, new
technologies, new information needs.

The query log genre is…
a novel and fully-emerged webgenre
1. Name: in line with other digital genres (ex: web log
 blog)
2. Community: internet users, IR practitioners
3. Task: information needs specified in a search
engine
4. Conventions: short texts written in”keywordese”
5. Expectations: to find relevant information
6. Cultural artifact: a product of our media-based,
internet-based society OR a subproduct of search
engines

The query log genre:
Languistic and Textual Conventions
• Length: short text (a query log can be seen as
a corpus of very short texts, shorter than
tweets, mobile text messages, chat logs, etc.)
• Sublanguage/Jargon: ”keywordese”
• Register: neutral
• Morphology: LITTLE
• Syntax : OCCASIONALLY (usually no articles, no
prepositions, no subclauses, etc.)

Query Log Genre: The Benefits
• Expressed in a ”lean” sublanguage, the
keywordese:
– reduced morphology
– reduced syntax
– short texts
– Mostly Nouns and Verbs
• Reduced size: compare a 2-years collection of
emails vs a 2-year collection of query logs
• = REDUCED SIZE, REDUCED PRE-PROCESSING;
NO DATA CLEANING!

Expectations: a text written by a user for a
search engine to find relevant information
• The texts (queries) must express information
needs aka users’ intents

• It is good practice to be cautious with the
interpretation of users’ intents. However…
• If we mine query logs with a simple
quantitative approach, it is possible to extract
recurrent information needs and build upon
them…

Actionable Intelligence

• It must be accurate, and verifiably
• It must be timely
• It must be comprehensive
• It must be comprehensible
• ability to act on that information straightaway

I would argue:
a Query Log is an ”Actionable” Corpus

• Let’s see…

Mining query logs for actionable
intelligence:
Description and Basic Statistics
• Corpus Time frame: 2010-2011 (2 years)

• “These logs come from the search at hittavard.vgregion.se.
The biggest bulk should come from 1177.se. The rest
should be from vgregion.se. The target audience are both
VGR (Västra Götalands Region) users/employees as well as
the general public, as it is a public site. The internal files are
searches made from within the VGR…”

• Corpus size:
– size = 3,167 KB (only queries) (BIG DATA is usually > 1TB)
– number of queries = 249,243
– number of words = 306,453
• Average query length: 1.23 words

Case study enterprise search – VGR

FINDWISE SLIDESHARE:
http://www.slideshare.net/findwise/case-study-enterprise-search-vgr

http://www.vgregion.se/en/Vastra-Gotalandsregionen/Home/

Business Decision:
Improve Search Quality and Usability to
increase Users’ Satisfaction &
Competitiveness

How?
• The simplest approach…

ANALIZE THE HEAD

(1) Take the Top-Ranked Queries

(2) Use them as TAGS (metadata
creation)
1. egenremiss
2. mina vårdkontakter
3. webbisar
4. sjukresor
5. vårdgaranti
6. sjukresa
7. mammografi
8.
9.
vårdval
influensa
Tags are keywords describing the content
10. urinvägsinfektion
11. halsfluss
12. förnya recept
13. magkatarr
14. vattkoppor
15. byta vårdcentral
16. blanketter
17. svinkoppor
18. reseersättning
19. klamydia
20. feber
21. högkostnadsskydd
22. vinterkräksjukan
23. patientombudsman
24. öroninflammation
25. logga in
26. frikort
27. hosta
28. magsjuka
29. njursten
30. als

3) Use TAG metadata to automatically
annotate only documents selected by
users

4) Use most frequent queries to create
a query suggester

5) If you want, you can sort queries
automatically into query types and
build…
• a taxonomy

• The categories of the taxonomy can be also used
to annotate existing documents automatically
(another layer of METADATA)
– TAGS describe the content
– CATEGORIES IN A TAXONOMY organize the content
– Categories can be hierarchical whereas tags cannot

If you want, you can give the taxonomy
to document creators, so they can
annotate the text with metadata
• … in short you will have a multilabelled corpus
that can be used with machine learning.

The importance of metadata to
structure unstructured data & to
extract actionable intelligence
• From Unstructured Data to Actionable Intelligence by
Ramana Rao, 2003

• ” We access information for various purposes and in
various ways according to our purpose. Sometimes
we’re surveying an area of knowledge, trying to get a
general understanding of what it’s about or what’s
available. At other times we’re searching for specific
answers. […] It is this range of purpose and context
that we can better address by providing a richer set of
information access tools based on exploiting
metadata.”

Linguistic Remarks
• At the top of the frequency list:
– Nouns
– Compounds
– A+N
– V+N

• More complex constructions at the bottom

In this case, automatic annotation can
help a lot

Benefit for the Search Provider
• Mining query logs to extract user-created knowlege, ie
queries that can be used as tags (metadata)
• Quickly create domain-specific taxonomies you can
capitalize upon, especially for new client companies
working in related fields
• Enhancements of current search products
• Inexpensive creation of annotated corpora: document
annotation through query logs is a simple technique
that in the a short time will build massive annotated
corpora to use for machine learning, which will allow
more sophisticated search refinements.

Benefits for Clients & End Users
• Somebody said: SEARCH MUST BE MIND READER!
• BUT ALSO faster, more friendly, more exhaustive
and more accurate.
• If this happens, clients will spend less for customer
care. If you find what you need online, there is no
need to call an helpdesk or customer care service.

Query Pre-processing ?
Absolutely YES If you want… Nj
• Normalization • Compound decomposition
– egen remiss & egenremiss & Tokanization. Text chunks
– Spelling correction (such as queries) are more
informative and less
• Terminology expansion ambiguous than single
(domain-dependent) words. No need to tokenize
– anemi & blodbrist (ex: taken or decompose, if RECALL is
from Freberg Heppin, 2010; ok.
ex: painkiller & analgesic)
• Ontology? Uhm.. not sure
– Stemming/Lemmatization we need a semantic
(blanketter  blankett;
structure here….
sjukresor  sjukresa)

Tokinization ? Domain-dependent?
Top query frequencies Top word frequencies
• 21388 egenremiss • 21565 egenremiss
• 17360 mina vårdkontakter • 17717 vårdkontakter
• 10553 webbisar • 17407 mina
• 8787 sjukresor • 10567 webbisar
• 7345 vårdgaranti • 8880 sjukresor
• 3938 sjukresa • 7357 vårdgaranti
• 3734 mammografi • 4044 sjukresa
• 3723 vårdval • 3763 vårdcentral
• 3653 influensa • 3754 mammografi
• 2908 urinvägsinfektion • 3732 influensa
• 2803 halsfluss • 3730 vårdval
• 2542 förnya recept • 2932 urinvägsinfektion
• 2460 magkatarr • 2819 halsfluss
• 2394 vattkoppor • 2805 recept
• 2274 byta vårdcentral • 2543 förnya
• 2256 blanketter • 2463 magkatarr
• 1878 svinkoppor • 2413 vattkoppor
• 1840 reseersättning • 2349 i
• 1653 klamydia • 2296 byta
• 1559 feber • 2269 blanketter
• 1525 högkostnadsskydd • 1881 svinkoppor
• 1420 vinterkräksjukan • 1840 reseersättning
• 1405 patientombudsman • 1802 feber
• 1326 öroninflammation • 1666 klamydia
• 1252 logga in • 1571 högkostnadsskydd
• 1251 frikort • 1422 vinterkräksjukan
• 1199 hosta • 1405 patientombudsman
• 1193 magsjuka • 1383 hepatit
• 1184 njursten • 1338 öroninflammation
• 1167 als • 1331 frikort

Different search users’ behaviour:
Enterprise vs. Web?
Första hjälpen till psykisk hälsa
VGR: Swedish – Enterprise Search (MHFA-Sverige) Swedish – Web Search

Preliminary reflections revisited…
• “The average length of a search query was 2.4 terms“… uhm.. It depends: enterprise vs. web

• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were
repeat queries and that 87% of the time the user would click on the same result. This
suggests that many users use repeat queries to revisit or re-find information. This analysis is
confirmed by a Bing search engine blog post telling about 30% queries are navigational
queries.“ not investigated

• "much research has shown that query term frequency distributions conform to the power
law, or long tail distribution curves. That is, a small portion of the terms observed in a large
query log (e.g. > 100 million queries) are used most often, while the remaining terms are
used less often individually.“ … definitely yes

• "in a recent study in 2011 it was found that the average length of queries has grown steadily
over time and average length of non-English languages queries had increased more than
English queries.“uhm.. It depends : enterprise vs. web + language

Conclusions from this Exploration
• Query logs are a genre that is easier to exploit for
extracting actionable intelligence.

• Query logs are a good, handy and economic source of
information for actionable business decisions, such as:
– keeping a cutting-edge profile on the market,
– enhancing enterprise search usability (query suggester/autofill),
– disambiguation,
– annotation and taxonomy creation
– preventing huge cost for customer helpdesk and similar services
throught a cutting-edge search functionality!

• Future: More and diversified use cases…

THANK YOU FOR YOUR ATTENTION

QUESTIONS?

SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Similar to SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence (20)

More from Marina Santini

More from Marina Santini (20)

SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Editor's Notes