SlideShare a Scribd company logo
LAST UPDATED: 26 OCTOBER 2012




                          SearchInFocus
Exploratory Study on Query Logs and Actionable Intelligence



                                   Marina Santini



                        Exploratory Query-log Analysis Workshop
                    Organized by Findwise, AB - www.findwise.com/
              Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST)
                                      Lund, Sweden

 SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund.
Query Logs and Actionable Intelligence:
      Questions to LinkedIn-ers
• “Can anyone suggest references about mining query logs for BI and
  CEM?” (3rd May 2012) [BI=Business Intelligence; CEM=Customer
  Experience Management]

• Applying Findability to Mine Query Logs for BI: Preliminaries “How
  can I profitably use query logs for making better business decisions
  and predict future trends?” (14th May 2012)

• Mining Query Logs: Query Disambiguation & Understanding
  through a KB “some linguistic problems can be sorted out -- for
  example those related to sublanguage, terminology, multi-word
  expressions, etc. -- through a dictionary-shaped knowledge base
  where the different uses of language are stored and continually
  updated. I will call this knowledge base DaisyKB” (21st May 2012)
My preliminary reflections based on
                this info…
•   “The average length of a search query was 2.4 terms"

•   "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same
    user were repeat queries and that 87% of the time the user would click on the
    same result. This suggests that many users use repeat queries to revisit or re-find
    information. This analysis is confirmed by a Bing search engine blog post telling
    about 30% queries are navigational queries."

•   “… much research has shown that query term frequency distributions conform to
    the power law, or long tail distribution curves. That is, a small portion of the terms
    observed in a large query log (e.g. > 100 million queries) are used most often,
    while the remaining terms are used less often individually."

•   “… in a recent study in 2011 it was found that the average length of queries has
    grown steadily over time and average length of non-English languages queries had
    increased more than English queries."
Then came the corpus…
• Enterprise query logs: VGR (27 August 2012)
  – easier to handle and interpret than general-
    purpose search engines’ query logs!
So… that’s the Outline
1.   The query log genre
2.   Actionable Intelligence
3.   A possible use case
4.   Preliminary conclusions
What is a (textual) genre?
• Simply simply simply put:
  – A genre is a class of text
What characterize a genre?

1.   Must have a name
2.   Must be recognized within a community
3.   Must be produced during a task
4.   Must have conventions
5.   Must raise expectations
6.   Can change over time. It is an cultural
     artifact (culture here includes society, media,
     techonology, etc.)
Genre Characterization
1.   Name formation: a genre must indicate a class, a family (for genre name
     formation, see Görlach, 2004). Recent webgenres:
     blogs, tweets, chatlogs, etc.
2.    Community: a genre is not something individual. A genre is a textual form
     that is used and recognized by a community (vs. style can be
     individualized). Ex: Blogs bloggers and blog readers; academic home
     pages  academics; etc.)
3.   Task: a genre meets a RECURRENT communication need. Ex: personal
     home page genre tells us something about a person; a technical blog is
     informative about a specific technology; etc.)
4.   Conventions: ex : a personal blog is made of posts organized in
     chronological order where a blogger communicates personal and
     subjective views on some facts.
5.   Expectations: when reading a personal blog, readers expect to read
     something personal (personal facts or personal opinions) and expect the
     possibility to leave a comment if they wish to do so.
6.   A genre is a cultural artifact: it might evolve over time (see the History of
     Blog by Rebecca Blood, 2000) might disappear if the society changes (ex :
     Chansons des gestes). New genres emerge with new media, new
     technologies, new information needs.
The query log genre is…
 a novel and fully-emerged webgenre
1. Name: in line with other digital genres (ex: web log
    blog)
2. Community: internet users, IR practitioners
3. Task: information needs specified in a search
   engine
4. Conventions: short texts written in”keywordese”
5. Expectations: to find relevant information
6. Cultural artifact: a product of our media-based,
   internet-based society OR a subproduct of search
   engines
The query log genre:
  Languistic and Textual Conventions
• Length: short text (a query log can be seen as
  a corpus of very short texts, shorter than
  tweets, mobile text messages, chat logs, etc.)
• Sublanguage/Jargon: ”keywordese”
• Register: neutral
• Morphology: LITTLE
• Syntax : OCCASIONALLY (usually no articles, no
  prepositions, no subclauses, etc.)
Query Log Genre: The Benefits
• Expressed in a ”lean” sublanguage, the
  keywordese:
  –   reduced morphology
  –   reduced syntax
  –   short texts
  –   Mostly Nouns and Verbs
• Reduced size: compare a 2-years collection of
  emails vs a 2-year collection of query logs
• = REDUCED SIZE, REDUCED PRE-PROCESSING;
  NO DATA CLEANING!
Expectations: a text written by a user for a
search engine to find relevant information
 • The texts (queries) must express information
   needs aka users’ intents

 • It is good practice to be cautious with the
   interpretation of users’ intents. However…
 • If we mine query logs with a simple
   quantitative approach, it is possible to extract
   recurrent information needs and build upon
   them…
Actionable Intelligence

•   It must be accurate, and verifiably
•   It must be timely
•   It must be comprehensive
•   It must be comprehensible
•   ability to act on that information straightaway
I would argue:
a Query Log is an ”Actionable” Corpus

 • Let’s see…
Mining query logs for actionable
              intelligence:
     Description and Basic Statistics
• Corpus Time frame: 2010-2011 (2 years)

• “These logs come from the search at hittavard.vgregion.se.
  The biggest bulk should come from 1177.se. The rest
  should be from vgregion.se. The target audience are both
  VGR (Västra Götalands Region) users/employees as well as
  the general public, as it is a public site. The internal files are
  searches made from within the VGR…”

• Corpus size:
   – size = 3,167 KB (only queries) (BIG DATA is usually > 1TB)
   – number of queries = 249,243
   – number of words = 306,453
• Average query length: 1.23 words
Case study enterprise search – VGR

   FINDWISE SLIDESHARE:
   http://www.slideshare.net/findwise/case-study-enterprise-search-vgr




http://www.vgregion.se/en/Vastra-Gotalandsregionen/Home/
Business Decision:
Improve Search Quality and Usability to
    increase Users’ Satisfaction &
          Competitiveness
How?
• The simplest approach…

         ANALIZE THE HEAD
(1) Take the Top-Ranked Queries
(2) Use them as TAGS (metadata
                    creation)
1.    egenremiss
2.    mina vårdkontakter
3.    webbisar
4.    sjukresor
5.    vårdgaranti
6.    sjukresa
7.    mammografi
8.
9.
      vårdval
      influensa
                           Tags are keywords describing the content
10.   urinvägsinfektion
11.   halsfluss
12.   förnya recept
13.   magkatarr
14.   vattkoppor
15.   byta vårdcentral
16.   blanketter
17.   svinkoppor
18.   reseersättning
19.   klamydia
20.   feber
21.   högkostnadsskydd
22.   vinterkräksjukan
23.   patientombudsman
24.   öroninflammation
25.   logga in
26.   frikort
27.   hosta
28.   magsjuka
29.   njursten
30.   als
3) Use TAG metadata to automatically
annotate only documents selected by
               users
Watch out!
Mismatch or Ambiguity?
4) Use most frequent queries to create
         a query suggester
5) If you want, you can sort queries
 automatically into query types and
                build…
• a taxonomy

• The categories of the taxonomy can be also used
  to annotate existing documents automatically
  (another layer of METADATA)
  – TAGS describe the content
  – CATEGORIES IN A TAXONOMY organize the content
  – Categories can be hierarchical whereas tags cannot
If you want, you can give the taxonomy
    to document creators, so they can
     annotate the text with metadata
• … in short you will have a multilabelled corpus
  that can be used with machine learning.
The importance of metadata to
     structure unstructured data & to
       extract actionable intelligence
• From Unstructured Data to Actionable Intelligence by
  Ramana Rao, 2003

• ” We access information for various purposes and in
  various ways according to our purpose. Sometimes
  we’re surveying an area of knowledge, trying to get a
  general understanding of what it’s about or what’s
  available. At other times we’re searching for specific
  answers. […] It is this range of purpose and context
  that we can better address by providing a richer set of
  information access tools based on exploiting
  metadata.”
Linguistic Remarks
• At the top of the frequency list:
  – Nouns
  – Compounds
  – A+N
  – V+N


• More complex constructions at the bottom
Syntactic Patterns
In this case, automatic annotation can
                help a lot
Benefit for the Search Provider
• Mining query logs to extract user-created knowlege, ie
  queries that can be used as tags (metadata)
• Quickly create domain-specific taxonomies you can
  capitalize upon, especially for new client companies
  working in related fields
• Enhancements of current search products
• Inexpensive creation of annotated corpora: document
  annotation through query logs is a simple technique
  that in the a short time will build massive annotated
  corpora to use for machine learning, which will allow
  more sophisticated search refinements.
Benefits for Clients & End Users
• Somebody said: SEARCH MUST BE MIND READER!
• BUT ALSO faster, more friendly, more exhaustive
  and more accurate.
• If this happens, clients will spend less for customer
  care. If you find what you need online, there is no
  need to call an helpdesk or customer care service.
Query Pre-processing ?
Absolutely YES                      If you want… Nj
• Normalization                     • Compound decomposition
   – egen remiss & egenremiss           & Tokanization. Text chunks
   – Spelling correction                (such as queries) are more
                                        informative and less
• Terminology expansion                 ambiguous than single
  (domain-dependent)                    words. No need to tokenize
   – anemi & blodbrist (ex: taken       or decompose, if RECALL is
     from Freberg Heppin, 2010;         ok.
     ex: painkiller & analgesic)
                                    • Ontology? Uhm.. not sure
   – Stemming/Lemmatization             we need a semantic
     (blanketter  blankett;
                                        structure here….
     sjukresor  sjukresa)
Tokinization ? Domain-dependent?
Top query frequencies             Top word frequencies
•    21388   egenremiss           •   21565   egenremiss
•    17360   mina vårdkontakter   •   17717   vårdkontakter
•    10553   webbisar             •   17407   mina
•    8787    sjukresor            •   10567   webbisar
•    7345    vårdgaranti          •   8880    sjukresor
•    3938    sjukresa             •   7357    vårdgaranti
•    3734    mammografi           •   4044    sjukresa
•    3723    vårdval              •   3763    vårdcentral
•    3653    influensa            •   3754    mammografi
•    2908    urinvägsinfektion    •   3732    influensa
•    2803    halsfluss            •   3730    vårdval
•    2542    förnya recept        •   2932    urinvägsinfektion
•    2460    magkatarr            •   2819    halsfluss
•    2394    vattkoppor           •   2805    recept
•    2274    byta vårdcentral     •   2543    förnya
•    2256    blanketter           •   2463    magkatarr
•    1878    svinkoppor           •   2413    vattkoppor
•    1840    reseersättning       •   2349    i
•    1653    klamydia             •   2296    byta
•    1559    feber                •   2269    blanketter
•    1525    högkostnadsskydd     •   1881    svinkoppor
•    1420    vinterkräksjukan     •   1840    reseersättning
•    1405    patientombudsman     •   1802    feber
•    1326    öroninflammation     •   1666    klamydia
•    1252    logga in             •   1571    högkostnadsskydd
•    1251    frikort              •   1422    vinterkräksjukan
•    1199    hosta                •   1405    patientombudsman
•    1193    magsjuka             •   1383    hepatit
•    1184    njursten             •   1338    öroninflammation
•    1167    als                  •   1331    frikort
Different search users’ behaviour:
              Enterprise vs. Web?
                                   Första hjälpen till psykisk hälsa
VGR: Swedish – Enterprise Search   (MHFA-Sverige) Swedish – Web Search
Preliminary reflections revisited…
•   “The average length of a search query was 2.4 terms“… uhm.. It depends: enterprise vs. web

•   "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were
    repeat queries and that 87% of the time the user would click on the same result. This
    suggests that many users use repeat queries to revisit or re-find information. This analysis is
    confirmed by a Bing search engine blog post telling about 30% queries are navigational
    queries.“ not investigated

•   "much research has shown that query term frequency distributions conform to the power
    law, or long tail distribution curves. That is, a small portion of the terms observed in a large
    query log (e.g. > 100 million queries) are used most often, while the remaining terms are
    used less often individually.“ … definitely yes

•   "in a recent study in 2011 it was found that the average length of queries has grown steadily
    over time and average length of non-English languages queries had increased more than
    English queries.“uhm.. It depends : enterprise vs. web + language
Conclusions from this Exploration
• Query logs are a genre that is easier to exploit for
  extracting actionable intelligence.

• Query logs are a good, handy and economic source of
  information for actionable business decisions, such as:
   –    keeping a cutting-edge profile on the market,
   –   enhancing enterprise search usability (query suggester/autofill),
   –   disambiguation,
   –   annotation and taxonomy creation
   –   preventing huge cost for customer helpdesk and similar services
       throught a cutting-edge search functionality!

• Future: More and diversified use cases…
THANK YOU FOR YOUR ATTENTION




         QUESTIONS?

More Related Content

What's hot

Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Grace Hui Yang
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
Sumit Sony
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
Peter Mika
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
Peter Mika
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading
Shalin Hai-Jew
 
[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
IJET - International Journal of Engineering and Techniques
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
IOSR Journals
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagement
STIinnsbruck
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
Marianne Sweeny
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
Krish_ver2
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
ijsrd.com
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
Thanh Tran
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
Peter Mika
 
58 64
58 6458 64
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic Wave
Kaniska Mandal
 
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
IJET - International Journal of Engineering and Techniques
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
IJDKP
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
csandit
 

What's hot (20)

Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012Semantic Search tutorial at SemTech 2012
Semantic Search tutorial at SemTech 2012
 
SemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorialSemTech 2011 Semantic Search tutorial
SemTech 2011 Semantic Search tutorial
 
"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading"Mass Surveillance" through Distant Reading
"Mass Surveillance" through Distant Reading
 
[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET-V2I1P1] Authors:Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Semantic engagement
Semantic engagementSemantic engagement
Semantic engagement
 
Enterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices FinalEnterprise Search Share Point2009 Best Practices Final
Enterprise Search Share Point2009 Best Practices Final
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
58 64
58 6458 64
58 64
 
Riding The Semantic Wave
Riding The Semantic WaveRiding The Semantic Wave
Riding The Semantic Wave
 
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
[IJET V2I3P13] Authors: Anshika, Sujit Tak, Sandeep Ugale, Abhishek Pohekar
 
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
A SEMANTIC METADATA ENRICHMENT SOFTWARE ECOSYSTEM BASED ON TOPIC METADATA ENR...
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
 

Similar to SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
University of Edinburgh
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
petrknoth
 
Intl190 kahler guide
Intl190 kahler guideIntl190 kahler guide
Intl190 kahler guide
Annelise Sklar
 
Pratt Sils LIS653 4 Fall 2007
Pratt Sils LIS653 4 Fall 2007Pratt Sils LIS653 4 Fall 2007
Pratt Sils LIS653 4 Fall 2007
PrattSILS
 
Project literature search
Project literature searchProject literature search
Slawek Korea
Slawek KoreaSlawek Korea
Slawek Korea
Slawek
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
unyil96
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
UNCResearchHub
 
Taxonomy 101: Presented at Taxonomy Boot Camp 2019
Taxonomy 101: Presented at Taxonomy Boot Camp 2019Taxonomy 101: Presented at Taxonomy Boot Camp 2019
Taxonomy 101: Presented at Taxonomy Boot Camp 2019
Enterprise Knowledge
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
Carsten Eickhoff
 
Data analysis – qualitative data presentation 2
Data analysis – qualitative data   presentation 2Data analysis – qualitative data   presentation 2
Data analysis – qualitative data presentation 2
Azura Zaki
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
thenmozhip8
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
SamuelKetema1
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
RAJU852744
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
herminaprocter
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #fail
Ben van Mol
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search fail
BIWUG
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
Libcorpio
 
[Jaalouk, Vivas-Thomas] SR15 Poster
[Jaalouk, Vivas-Thomas] SR15 Poster[Jaalouk, Vivas-Thomas] SR15 Poster
[Jaalouk, Vivas-Thomas] SR15 Poster
Luciana Jaalouk
 

Similar to SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence (20)

Melissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLabMelissa Terras' Report on the #UKMHLiveLab
Melissa Terras' Report on the #UKMHLiveLab
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
 
Intl190 kahler guide
Intl190 kahler guideIntl190 kahler guide
Intl190 kahler guide
 
Pratt Sils LIS653 4 Fall 2007
Pratt Sils LIS653 4 Fall 2007Pratt Sils LIS653 4 Fall 2007
Pratt Sils LIS653 4 Fall 2007
 
Project literature search
Project literature searchProject literature search
Project literature search
 
Slawek Korea
Slawek KoreaSlawek Korea
Slawek Korea
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Taxonomy 101: Presented at Taxonomy Boot Camp 2019
Taxonomy 101: Presented at Taxonomy Boot Camp 2019Taxonomy 101: Presented at Taxonomy Boot Camp 2019
Taxonomy 101: Presented at Taxonomy Boot Camp 2019
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Data analysis – qualitative data presentation 2
Data analysis – qualitative data   presentation 2Data analysis – qualitative data   presentation 2
Data analysis – qualitative data presentation 2
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
 
16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx16     Decision Support and Business Intelligence Systems (9th E.docx
16     Decision Support and Business Intelligence Systems (9th E.docx
 
SPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #failSPSBE14 Intranet Search #fail
SPSBE14 Intranet Search #fail
 
SharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search failSharePoint Saturday Belgium 2013 Intranet search fail
SharePoint Saturday Belgium 2013 Intranet search fail
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
[Jaalouk, Vivas-Thomas] SR15 Poster
[Jaalouk, Vivas-Thomas] SR15 Poster[Jaalouk, Vivas-Thomas] SR15 Poster
[Jaalouk, Vivas-Thomas] SR15 Poster
 

More from Marina Santini

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Marina Santini
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Marina Santini
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
Marina Santini
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
Marina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
Marina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
Marina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
Marina Santini
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
Marina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
Marina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
Marina Santini
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
Marina Santini
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
Marina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Marina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
Marina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 

More from Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 

SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

  • 1. LAST UPDATED: 26 OCTOBER 2012 SearchInFocus Exploratory Study on Query Logs and Actionable Intelligence Marina Santini Exploratory Query-log Analysis Workshop Organized by Findwise, AB - www.findwise.com/ Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST) Lund, Sweden SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund.
  • 2. Query Logs and Actionable Intelligence: Questions to LinkedIn-ers • “Can anyone suggest references about mining query logs for BI and CEM?” (3rd May 2012) [BI=Business Intelligence; CEM=Customer Experience Management] • Applying Findability to Mine Query Logs for BI: Preliminaries “How can I profitably use query logs for making better business decisions and predict future trends?” (14th May 2012) • Mining Query Logs: Query Disambiguation & Understanding through a KB “some linguistic problems can be sorted out -- for example those related to sublanguage, terminology, multi-word expressions, etc. -- through a dictionary-shaped knowledge base where the different uses of language are stored and continually updated. I will call this knowledge base DaisyKB” (21st May 2012)
  • 3. My preliminary reflections based on this info… • “The average length of a search query was 2.4 terms" • "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries." • “… much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually." • “… in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries."
  • 4. Then came the corpus… • Enterprise query logs: VGR (27 August 2012) – easier to handle and interpret than general- purpose search engines’ query logs!
  • 5. So… that’s the Outline 1. The query log genre 2. Actionable Intelligence 3. A possible use case 4. Preliminary conclusions
  • 6. What is a (textual) genre? • Simply simply simply put: – A genre is a class of text
  • 7. What characterize a genre? 1. Must have a name 2. Must be recognized within a community 3. Must be produced during a task 4. Must have conventions 5. Must raise expectations 6. Can change over time. It is an cultural artifact (culture here includes society, media, techonology, etc.)
  • 8. Genre Characterization 1. Name formation: a genre must indicate a class, a family (for genre name formation, see Görlach, 2004). Recent webgenres: blogs, tweets, chatlogs, etc. 2. Community: a genre is not something individual. A genre is a textual form that is used and recognized by a community (vs. style can be individualized). Ex: Blogs bloggers and blog readers; academic home pages  academics; etc.) 3. Task: a genre meets a RECURRENT communication need. Ex: personal home page genre tells us something about a person; a technical blog is informative about a specific technology; etc.) 4. Conventions: ex : a personal blog is made of posts organized in chronological order where a blogger communicates personal and subjective views on some facts. 5. Expectations: when reading a personal blog, readers expect to read something personal (personal facts or personal opinions) and expect the possibility to leave a comment if they wish to do so. 6. A genre is a cultural artifact: it might evolve over time (see the History of Blog by Rebecca Blood, 2000) might disappear if the society changes (ex : Chansons des gestes). New genres emerge with new media, new technologies, new information needs.
  • 9. The query log genre is… a novel and fully-emerged webgenre 1. Name: in line with other digital genres (ex: web log  blog) 2. Community: internet users, IR practitioners 3. Task: information needs specified in a search engine 4. Conventions: short texts written in”keywordese” 5. Expectations: to find relevant information 6. Cultural artifact: a product of our media-based, internet-based society OR a subproduct of search engines
  • 10. The query log genre: Languistic and Textual Conventions • Length: short text (a query log can be seen as a corpus of very short texts, shorter than tweets, mobile text messages, chat logs, etc.) • Sublanguage/Jargon: ”keywordese” • Register: neutral • Morphology: LITTLE • Syntax : OCCASIONALLY (usually no articles, no prepositions, no subclauses, etc.)
  • 11. Query Log Genre: The Benefits • Expressed in a ”lean” sublanguage, the keywordese: – reduced morphology – reduced syntax – short texts – Mostly Nouns and Verbs • Reduced size: compare a 2-years collection of emails vs a 2-year collection of query logs • = REDUCED SIZE, REDUCED PRE-PROCESSING; NO DATA CLEANING!
  • 12. Expectations: a text written by a user for a search engine to find relevant information • The texts (queries) must express information needs aka users’ intents • It is good practice to be cautious with the interpretation of users’ intents. However… • If we mine query logs with a simple quantitative approach, it is possible to extract recurrent information needs and build upon them…
  • 13. Actionable Intelligence • It must be accurate, and verifiably • It must be timely • It must be comprehensive • It must be comprehensible • ability to act on that information straightaway
  • 14. I would argue: a Query Log is an ”Actionable” Corpus • Let’s see…
  • 15. Mining query logs for actionable intelligence: Description and Basic Statistics • Corpus Time frame: 2010-2011 (2 years) • “These logs come from the search at hittavard.vgregion.se. The biggest bulk should come from 1177.se. The rest should be from vgregion.se. The target audience are both VGR (Västra Götalands Region) users/employees as well as the general public, as it is a public site. The internal files are searches made from within the VGR…” • Corpus size: – size = 3,167 KB (only queries) (BIG DATA is usually > 1TB) – number of queries = 249,243 – number of words = 306,453 • Average query length: 1.23 words
  • 16. Case study enterprise search – VGR FINDWISE SLIDESHARE: http://www.slideshare.net/findwise/case-study-enterprise-search-vgr http://www.vgregion.se/en/Vastra-Gotalandsregionen/Home/
  • 17. Business Decision: Improve Search Quality and Usability to increase Users’ Satisfaction & Competitiveness
  • 18. How? • The simplest approach… ANALIZE THE HEAD
  • 19. (1) Take the Top-Ranked Queries
  • 20. (2) Use them as TAGS (metadata creation) 1. egenremiss 2. mina vårdkontakter 3. webbisar 4. sjukresor 5. vårdgaranti 6. sjukresa 7. mammografi 8. 9. vårdval influensa Tags are keywords describing the content 10. urinvägsinfektion 11. halsfluss 12. förnya recept 13. magkatarr 14. vattkoppor 15. byta vårdcentral 16. blanketter 17. svinkoppor 18. reseersättning 19. klamydia 20. feber 21. högkostnadsskydd 22. vinterkräksjukan 23. patientombudsman 24. öroninflammation 25. logga in 26. frikort 27. hosta 28. magsjuka 29. njursten 30. als
  • 21. 3) Use TAG metadata to automatically annotate only documents selected by users
  • 24. 4) Use most frequent queries to create a query suggester
  • 25. 5) If you want, you can sort queries automatically into query types and build… • a taxonomy • The categories of the taxonomy can be also used to annotate existing documents automatically (another layer of METADATA) – TAGS describe the content – CATEGORIES IN A TAXONOMY organize the content – Categories can be hierarchical whereas tags cannot
  • 26. If you want, you can give the taxonomy to document creators, so they can annotate the text with metadata • … in short you will have a multilabelled corpus that can be used with machine learning.
  • 27. The importance of metadata to structure unstructured data & to extract actionable intelligence • From Unstructured Data to Actionable Intelligence by Ramana Rao, 2003 • ” We access information for various purposes and in various ways according to our purpose. Sometimes we’re surveying an area of knowledge, trying to get a general understanding of what it’s about or what’s available. At other times we’re searching for specific answers. […] It is this range of purpose and context that we can better address by providing a richer set of information access tools based on exploiting metadata.”
  • 28. Linguistic Remarks • At the top of the frequency list: – Nouns – Compounds – A+N – V+N • More complex constructions at the bottom
  • 30. In this case, automatic annotation can help a lot
  • 31. Benefit for the Search Provider • Mining query logs to extract user-created knowlege, ie queries that can be used as tags (metadata) • Quickly create domain-specific taxonomies you can capitalize upon, especially for new client companies working in related fields • Enhancements of current search products • Inexpensive creation of annotated corpora: document annotation through query logs is a simple technique that in the a short time will build massive annotated corpora to use for machine learning, which will allow more sophisticated search refinements.
  • 32. Benefits for Clients & End Users • Somebody said: SEARCH MUST BE MIND READER! • BUT ALSO faster, more friendly, more exhaustive and more accurate. • If this happens, clients will spend less for customer care. If you find what you need online, there is no need to call an helpdesk or customer care service.
  • 33. Query Pre-processing ? Absolutely YES If you want… Nj • Normalization • Compound decomposition – egen remiss & egenremiss & Tokanization. Text chunks – Spelling correction (such as queries) are more informative and less • Terminology expansion ambiguous than single (domain-dependent) words. No need to tokenize – anemi & blodbrist (ex: taken or decompose, if RECALL is from Freberg Heppin, 2010; ok. ex: painkiller & analgesic) • Ontology? Uhm.. not sure – Stemming/Lemmatization we need a semantic (blanketter  blankett; structure here…. sjukresor  sjukresa)
  • 34. Tokinization ? Domain-dependent? Top query frequencies Top word frequencies • 21388 egenremiss • 21565 egenremiss • 17360 mina vårdkontakter • 17717 vårdkontakter • 10553 webbisar • 17407 mina • 8787 sjukresor • 10567 webbisar • 7345 vårdgaranti • 8880 sjukresor • 3938 sjukresa • 7357 vårdgaranti • 3734 mammografi • 4044 sjukresa • 3723 vårdval • 3763 vårdcentral • 3653 influensa • 3754 mammografi • 2908 urinvägsinfektion • 3732 influensa • 2803 halsfluss • 3730 vårdval • 2542 förnya recept • 2932 urinvägsinfektion • 2460 magkatarr • 2819 halsfluss • 2394 vattkoppor • 2805 recept • 2274 byta vårdcentral • 2543 förnya • 2256 blanketter • 2463 magkatarr • 1878 svinkoppor • 2413 vattkoppor • 1840 reseersättning • 2349 i • 1653 klamydia • 2296 byta • 1559 feber • 2269 blanketter • 1525 högkostnadsskydd • 1881 svinkoppor • 1420 vinterkräksjukan • 1840 reseersättning • 1405 patientombudsman • 1802 feber • 1326 öroninflammation • 1666 klamydia • 1252 logga in • 1571 högkostnadsskydd • 1251 frikort • 1422 vinterkräksjukan • 1199 hosta • 1405 patientombudsman • 1193 magsjuka • 1383 hepatit • 1184 njursten • 1338 öroninflammation • 1167 als • 1331 frikort
  • 35. Different search users’ behaviour: Enterprise vs. Web? Första hjälpen till psykisk hälsa VGR: Swedish – Enterprise Search (MHFA-Sverige) Swedish – Web Search
  • 36. Preliminary reflections revisited… • “The average length of a search query was 2.4 terms“… uhm.. It depends: enterprise vs. web • "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries.“ not investigated • "much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually.“ … definitely yes • "in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries.“uhm.. It depends : enterprise vs. web + language
  • 37. Conclusions from this Exploration • Query logs are a genre that is easier to exploit for extracting actionable intelligence. • Query logs are a good, handy and economic source of information for actionable business decisions, such as: – keeping a cutting-edge profile on the market, – enhancing enterprise search usability (query suggester/autofill), – disambiguation, – annotation and taxonomy creation – preventing huge cost for customer helpdesk and similar services throught a cutting-edge search functionality! • Future: More and diversified use cases…
  • 38. THANK YOU FOR YOUR ATTENTION QUESTIONS?

Editor's Notes

  1.  Query logs are an important source of information to surmize users intents'. Although Karlgren (2010) points out that “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; [...]“, some linguistic problems could be sorted out by applying more advanced text/content analytics, such as register/sublanguage identification and terminology classification (see FribergHeppin, 2011) . In this presentation, I will argue that query logs can be considered a digital textual genre alike emails, blogs, chats, tweets and so forth. All these genres contain unstructured information that, still today, is difficult to leverage upon satisfactorily. The hypothesis that I would like to put forward in this workshop is that query logs might be easier to exploit to extract useful information and actionable intelligence than other digital genres.
  2. What are the expectations from a insandare (letter to the editor), from an interview, from a review, from an editorial?)
  3. “keywordese”, i.e. the kind of sublanguage/jargonweuse to communicate with searchengines (that is, a languagewithoutarticle, without prepositions, and other stop words, withoutmuch syntax or hedges, etc.), query logs are skimmed texts that require no cleaning from redundancies or rhetorical ornaments, and reducedpre-processing.
  4. For information to be actionable, it must have at leastfourcharacteristics: It must be accurate, and verifiably soIt must be timelyIt must be comprehensiveIt must be comprehensible These are necessarybut not sufficientconditions; to make information trulyactionable, the information must be accompanied by tools that allow you to act on the information. In a perfect world, you wouldalsohavetools that allow you to monitor the effect of your actions and to receive feedback. But at the veryleast, you need information that is accurate, timely, comprehensive and comprehensible, with someability to act on that information straightaway.
  5. Not big data
  6. Usergroups: patients, relatives, doctors, administrative staff, help desk, etc.
  7. 20 occurrences…