This document summarizes an exploratory study on mining query logs for actionable intelligence. It discusses query logs as a genre and how they can be analyzed to extract useful metadata like frequently searched terms that can then be used to automatically tag and categorize documents. Analyzing the top queries from an enterprise search log, keywords were identified that could help structure the search experience and improve search quality and user satisfaction, providing actionable intelligence for business decisions.
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
1. LAST UPDATED: 26 OCTOBER 2012
SearchInFocus
Exploratory Study on Query Logs and Actionable Intelligence
Marina Santini
Exploratory Query-log Analysis Workshop
Organized by Findwise, AB - www.findwise.com/
Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST)
Lund, Sweden
SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund.
2. Query Logs and Actionable Intelligence:
Questions to LinkedIn-ers
• “Can anyone suggest references about mining query logs for BI and
CEM?” (3rd May 2012) [BI=Business Intelligence; CEM=Customer
Experience Management]
• Applying Findability to Mine Query Logs for BI: Preliminaries “How
can I profitably use query logs for making better business decisions
and predict future trends?” (14th May 2012)
• Mining Query Logs: Query Disambiguation & Understanding
through a KB “some linguistic problems can be sorted out -- for
example those related to sublanguage, terminology, multi-word
expressions, etc. -- through a dictionary-shaped knowledge base
where the different uses of language are stored and continually
updated. I will call this knowledge base DaisyKB” (21st May 2012)
3. My preliminary reflections based on
this info…
• “The average length of a search query was 2.4 terms"
• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same
user were repeat queries and that 87% of the time the user would click on the
same result. This suggests that many users use repeat queries to revisit or re-find
information. This analysis is confirmed by a Bing search engine blog post telling
about 30% queries are navigational queries."
• “… much research has shown that query term frequency distributions conform to
the power law, or long tail distribution curves. That is, a small portion of the terms
observed in a large query log (e.g. > 100 million queries) are used most often,
while the remaining terms are used less often individually."
• “… in a recent study in 2011 it was found that the average length of queries has
grown steadily over time and average length of non-English languages queries had
increased more than English queries."
4. Then came the corpus…
• Enterprise query logs: VGR (27 August 2012)
– easier to handle and interpret than general-
purpose search engines’ query logs!
5. So… that’s the Outline
1. The query log genre
2. Actionable Intelligence
3. A possible use case
4. Preliminary conclusions
6. What is a (textual) genre?
• Simply simply simply put:
– A genre is a class of text
7. What characterize a genre?
1. Must have a name
2. Must be recognized within a community
3. Must be produced during a task
4. Must have conventions
5. Must raise expectations
6. Can change over time. It is an cultural
artifact (culture here includes society, media,
techonology, etc.)
8. Genre Characterization
1. Name formation: a genre must indicate a class, a family (for genre name
formation, see Görlach, 2004). Recent webgenres:
blogs, tweets, chatlogs, etc.
2. Community: a genre is not something individual. A genre is a textual form
that is used and recognized by a community (vs. style can be
individualized). Ex: Blogs bloggers and blog readers; academic home
pages academics; etc.)
3. Task: a genre meets a RECURRENT communication need. Ex: personal
home page genre tells us something about a person; a technical blog is
informative about a specific technology; etc.)
4. Conventions: ex : a personal blog is made of posts organized in
chronological order where a blogger communicates personal and
subjective views on some facts.
5. Expectations: when reading a personal blog, readers expect to read
something personal (personal facts or personal opinions) and expect the
possibility to leave a comment if they wish to do so.
6. A genre is a cultural artifact: it might evolve over time (see the History of
Blog by Rebecca Blood, 2000) might disappear if the society changes (ex :
Chansons des gestes). New genres emerge with new media, new
technologies, new information needs.
9. The query log genre is…
a novel and fully-emerged webgenre
1. Name: in line with other digital genres (ex: web log
blog)
2. Community: internet users, IR practitioners
3. Task: information needs specified in a search
engine
4. Conventions: short texts written in”keywordese”
5. Expectations: to find relevant information
6. Cultural artifact: a product of our media-based,
internet-based society OR a subproduct of search
engines
10. The query log genre:
Languistic and Textual Conventions
• Length: short text (a query log can be seen as
a corpus of very short texts, shorter than
tweets, mobile text messages, chat logs, etc.)
• Sublanguage/Jargon: ”keywordese”
• Register: neutral
• Morphology: LITTLE
• Syntax : OCCASIONALLY (usually no articles, no
prepositions, no subclauses, etc.)
11. Query Log Genre: The Benefits
• Expressed in a ”lean” sublanguage, the
keywordese:
– reduced morphology
– reduced syntax
– short texts
– Mostly Nouns and Verbs
• Reduced size: compare a 2-years collection of
emails vs a 2-year collection of query logs
• = REDUCED SIZE, REDUCED PRE-PROCESSING;
NO DATA CLEANING!
12. Expectations: a text written by a user for a
search engine to find relevant information
• The texts (queries) must express information
needs aka users’ intents
• It is good practice to be cautious with the
interpretation of users’ intents. However…
• If we mine query logs with a simple
quantitative approach, it is possible to extract
recurrent information needs and build upon
them…
13. Actionable Intelligence
• It must be accurate, and verifiably
• It must be timely
• It must be comprehensive
• It must be comprehensible
• ability to act on that information straightaway
14. I would argue:
a Query Log is an ”Actionable” Corpus
• Let’s see…
15. Mining query logs for actionable
intelligence:
Description and Basic Statistics
• Corpus Time frame: 2010-2011 (2 years)
• “These logs come from the search at hittavard.vgregion.se.
The biggest bulk should come from 1177.se. The rest
should be from vgregion.se. The target audience are both
VGR (Västra Götalands Region) users/employees as well as
the general public, as it is a public site. The internal files are
searches made from within the VGR…”
• Corpus size:
– size = 3,167 KB (only queries) (BIG DATA is usually > 1TB)
– number of queries = 249,243
– number of words = 306,453
• Average query length: 1.23 words
16. Case study enterprise search – VGR
FINDWISE SLIDESHARE:
http://www.slideshare.net/findwise/case-study-enterprise-search-vgr
http://www.vgregion.se/en/Vastra-Gotalandsregionen/Home/
24. 4) Use most frequent queries to create
a query suggester
25. 5) If you want, you can sort queries
automatically into query types and
build…
• a taxonomy
• The categories of the taxonomy can be also used
to annotate existing documents automatically
(another layer of METADATA)
– TAGS describe the content
– CATEGORIES IN A TAXONOMY organize the content
– Categories can be hierarchical whereas tags cannot
26. If you want, you can give the taxonomy
to document creators, so they can
annotate the text with metadata
• … in short you will have a multilabelled corpus
that can be used with machine learning.
27. The importance of metadata to
structure unstructured data & to
extract actionable intelligence
• From Unstructured Data to Actionable Intelligence by
Ramana Rao, 2003
• ” We access information for various purposes and in
various ways according to our purpose. Sometimes
we’re surveying an area of knowledge, trying to get a
general understanding of what it’s about or what’s
available. At other times we’re searching for specific
answers. […] It is this range of purpose and context
that we can better address by providing a richer set of
information access tools based on exploiting
metadata.”
28. Linguistic Remarks
• At the top of the frequency list:
– Nouns
– Compounds
– A+N
– V+N
• More complex constructions at the bottom
31. Benefit for the Search Provider
• Mining query logs to extract user-created knowlege, ie
queries that can be used as tags (metadata)
• Quickly create domain-specific taxonomies you can
capitalize upon, especially for new client companies
working in related fields
• Enhancements of current search products
• Inexpensive creation of annotated corpora: document
annotation through query logs is a simple technique
that in the a short time will build massive annotated
corpora to use for machine learning, which will allow
more sophisticated search refinements.
32. Benefits for Clients & End Users
• Somebody said: SEARCH MUST BE MIND READER!
• BUT ALSO faster, more friendly, more exhaustive
and more accurate.
• If this happens, clients will spend less for customer
care. If you find what you need online, there is no
need to call an helpdesk or customer care service.
33. Query Pre-processing ?
Absolutely YES If you want… Nj
• Normalization • Compound decomposition
– egen remiss & egenremiss & Tokanization. Text chunks
– Spelling correction (such as queries) are more
informative and less
• Terminology expansion ambiguous than single
(domain-dependent) words. No need to tokenize
– anemi & blodbrist (ex: taken or decompose, if RECALL is
from Freberg Heppin, 2010; ok.
ex: painkiller & analgesic)
• Ontology? Uhm.. not sure
– Stemming/Lemmatization we need a semantic
(blanketter blankett;
structure here….
sjukresor sjukresa)
35. Different search users’ behaviour:
Enterprise vs. Web?
Första hjälpen till psykisk hälsa
VGR: Swedish – Enterprise Search (MHFA-Sverige) Swedish – Web Search
36. Preliminary reflections revisited…
• “The average length of a search query was 2.4 terms“… uhm.. It depends: enterprise vs. web
• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were
repeat queries and that 87% of the time the user would click on the same result. This
suggests that many users use repeat queries to revisit or re-find information. This analysis is
confirmed by a Bing search engine blog post telling about 30% queries are navigational
queries.“ not investigated
• "much research has shown that query term frequency distributions conform to the power
law, or long tail distribution curves. That is, a small portion of the terms observed in a large
query log (e.g. > 100 million queries) are used most often, while the remaining terms are
used less often individually.“ … definitely yes
• "in a recent study in 2011 it was found that the average length of queries has grown steadily
over time and average length of non-English languages queries had increased more than
English queries.“uhm.. It depends : enterprise vs. web + language
37. Conclusions from this Exploration
• Query logs are a genre that is easier to exploit for
extracting actionable intelligence.
• Query logs are a good, handy and economic source of
information for actionable business decisions, such as:
– keeping a cutting-edge profile on the market,
– enhancing enterprise search usability (query suggester/autofill),
– disambiguation,
– annotation and taxonomy creation
– preventing huge cost for customer helpdesk and similar services
throught a cutting-edge search functionality!
• Future: More and diversified use cases…
Query logs are an important source of information to surmize users intents'. Although Karlgren (2010) points out that “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; [...]“, some linguistic problems could be sorted out by applying more advanced text/content analytics, such as register/sublanguage identification and terminology classification (see FribergHeppin, 2011) . In this presentation, I will argue that query logs can be considered a digital textual genre alike emails, blogs, chats, tweets and so forth. All these genres contain unstructured information that, still today, is difficult to leverage upon satisfactorily. The hypothesis that I would like to put forward in this workshop is that query logs might be easier to exploit to extract useful information and actionable intelligence than other digital genres.
What are the expectations from a insandare (letter to the editor), from an interview, from a review, from an editorial?)
“keywordese”, i.e. the kind of sublanguage/jargonweuse to communicate with searchengines (that is, a languagewithoutarticle, without prepositions, and other stop words, withoutmuch syntax or hedges, etc.), query logs are skimmed texts that require no cleaning from redundancies or rhetorical ornaments, and reducedpre-processing.
For information to be actionable, it must have at leastfourcharacteristics: It must be accurate, and verifiably soIt must be timelyIt must be comprehensiveIt must be comprehensible These are necessarybut not sufficientconditions; to make information trulyactionable, the information must be accompanied by tools that allow you to act on the information. In a perfect world, you wouldalsohavetools that allow you to monitor the effect of your actions and to receive feedback. But at the veryleast, you need information that is accurate, timely, comprehensive and comprehensible, with someability to act on that information straightaway.
Not big data
Usergroups: patients, relatives, doctors, administrative staff, help desk, etc.