SlideShare a Scribd company logo
1 of 42
Download to read offline
Search analytics –
   Understanding the long tail
   SIKM Leaders July 2012




Lee Romero
blog.leeromero.org
July 2012
About me

My background and early career are both in software engineering.

I've worked in the knowledge management field for the last 12+
    years – almost all of it in the technology of KM

I’ve worked with various search solutions for the last 7-8 years –
   and spent most of that time trying to figure out how to measure
   their usefulness and improve them in any way I can.

I’ve spoken at both Enterprise Search Summit and Taxonomy Boot
   Camp twice.

My writings on search analytics have been featured by a number of
  experts in the field including Lou Rosenfeld and Avi Rappoport

2
Search Analytics

Definition: Search analytics is the field of analyzing and
  aggregating usage statistics of your search solution to
  understand user behavior and to improve the experience.

Some search analytics are focused on SEO / SEM activities (for
  internet searches).

The focus here will be on enterprise search, so will primarily be
  focusing on the aspect of improving the user experience.

Further, I will primarily focus here on keyword search and
  understanding the user language found in search logs

Always remember – analytics without action does not have much
   value.
3
The challenge of your
search log
Understanding your search log

For enterprise search solutions1, the “80-20” rule is not true

The language variability is very high in a couple of ways (covered
  in the next few slides)

Yet having a good understanding of the language, frequency and
  commonality in your search log is critical to being able to make
  sustainable improvements to your search

The remainder of this presentation first provides some evidence
  supporting my claim and then will cover some ideas and research
  into this problem



                                1   This does not seem to apply equally to e-commerce solutions

5
Some facts about search terms

There’s an anecdote that goes something like, “80% of your
  searches are from 20% of your search terms”
    • Equivalently, some will say that you can make significant impact by paying
      attention to a few of your most common terms (you can, but in limited ways)


Fact: in enterprise search solutions the curve is much shallower:

    This chart shows the
    inverted power curve for
    two different solutions
    I’m currently working
    with


In the second case, it takes 13% of terms to cover 50% of searches,
    and that is over 7000 distinct terms in a typical month!
6
Some facts about search terms: part 2

Another myth: a large percent of searches repeat over and over
  again

Fact: on enterprise search solutions, there is surprisingly little
  commonality month-to-month

Over a recent six month period, which saw a total of ~289K distinct
  search terms, only 11% of terms occurred in more than 1 month!
                         # of months   # terms   % of searches
                         1             257665    89.2%
                         2             17994     6.2%
                         3             5790      2.0%
                         4             2900      1.0%
                         5             2019      0.7%
                         6             2340      0.8%


7
Some facts about search terms: part 3

Another myth: a good percentage of your search terms will repeat in
  sequential periods

Fact: There is much more churn even month-to-month than you
  might expect – in the period covered below, only about 13% of
  terms repeated from one month to the next (covering about 36%
  of searches)




8
What to do with your search log?

The summary of the previous slides:

• It is hard to understand a decent percentage of terms within a
  given time period (month)!
• If you could do that, the problem during the next time period isn’t
  that much easier!

The next sections describe a couple of research projects I’ve been
working on to tackle these issues




9
Understanding your
users’ information
needs
Categorizing your users’ language

Given the challenges previously laid out, using the search log to
   understand user needs seems very challenging

Beyond the first several dozen terms, it is hard to understand what
  users are looking for
     • And those several dozen terms cover a vanishingly small percentage of all
       searches!


However, it would be very useful to understand your users’
  information needs if we could somehow understand the entirety
  of the search log

How do we handle this? Categorize the search terms!



11
Categorizing your users’ language, p2

So we need to categorize search terms to really be able to
  understand our users’ information needs.

To do this, we face two challenges

     1. What categorization scheme should we use?
     2. How do we apply categorization in a repeatable, scalable and manageable
        way?


For the first challenge, I would recommend you use your taxonomy
  (you do have one, right?)

The second challenge is a bit more difficult but is addressed later in
  this deck


12
Categories to use

Proposal: Start with your own taxonomy and its vocabularies as the
  categories into which search terms are grouped

Some searches will not fit into any of these categories, so you can
  anticipate the need to add further categories


As an aside, this exercise actually provides a great measurement
  tool for your taxonomy
     • You can quantitatively assess the percent of your users’ language that is
       classifiable with your taxonomy
     • A number you may wish to drive up over time (through evolution of your
       taxonomy)




13
Automating categorization

Now we turn to the hairier challenge – how can we categorize
  search terms?

To describe the problem, we have:
     1. A set of categories, which may be hierarchically related (most taxonomies
        are)
     2. A set of search terms, as entered by users, that need to be assigned to
        those categories

                                                              Search Term
                                Category
                    Category               ?                  Search Term
                                Category
                                                              Search Term
                                Category
         Category   Category
                                                              Search Term
                                Category

                                   ...                        Search Term
                       ...                 ?                      ...
                                   ...




14
Automating categorization, p2

The proposed solution is based on a couple of concepts:
     1. You can think of this categorization problem as search!
     2. You are taking each search term and searching in an index in which the
        potential search results are categories!


Question: What is the “body” of what you are searching?
Answer: Previously-categorized search terms!

Using this approach, you can consider the set of previously-
   categorized search terms as a corpus against which to search
     •    You can apply all of the same heuristics to this search as any search:
         • Word matching (not string matching)
         • Stemming
         • Relevancy (word ordering, proximity, # of matches, etc.)


15
Automating categorization, p3

Here’s a depiction of this solution



                                           Previously
                                          categorized
                                             terms                                      Search Term
                               Category
                Category
                               Category                                                 Search Term

                               Category                                                 Search Term
     Category   Category                     Previously
                               Category     categorized                                 Search Term
                                               terms
                                   ...                                                  Search Term
                   ...
                                   ...                                                       ...

                            Previously
                           categorized
                              terms

                                                This red oval represents the “matching” process
                                                – it takes as input the search terms to be
                                                categorized, the set of categories along with
                                                previously-matched search terms and produces
                                                as output a set of categories associated with the
                                                new search terms
16
Automating categorization, p4: Bootstrapping

This approach depends on matching to previously-categorized terms
     • Every time you categorize a new search term, you expand the set of
       categorized terms, enabling more matches in the future


Bootstrapping: You can take the names of the categories (the terms
  in your taxonomy) as the first set of “categorized search terms”
     • This allows you to start with no search terms having been categorized at all
     • You run a first round of matching against the categories to find first-level
       matches
     • Take those that seem like “good” matches and pull those into the set of
       categorized search terms for a second iteration, etc.
     • Using this in initial testing resulted in 10% of distinct terms from a month
       being associated with at least one category


Another aspect: Any manual categorization of common search terms
  will add to the success of categorization
17
Automating categorization, p5: Iterative



                                                  Previously
                                                 categorized        Search Term
                                Category            terms
            Category
                                                                    Search Term
                                Category

                                Category     New categorizations    Search Term
 Category   Category
                                Category           Previously       Search Term
                                                  categorized
                                    ...              terms          Search Term
               ...
                                    ...                                 ...

                                              New categorizations
                            Previously
                           categorized
                              terms



                       New categorizations
Automating categorization, p5: Iterative

This approach also needs to be applied iteratively
     • You start with a set of categorized search terms and a new set of
       (uncategorized) search terms
     • You then apply this matching to the uncategorized search terms, getting a set
       of newly-categorized search terms (with some measure of probability of
       “correctness” of the match, i.e., relevancy)
     • You pull in the newly-categorized search terms and run the matching process
       again
     • Each time, as you expand the set of categorized search terms (from a
       previous match), you increase the possibility of more matches (in
       subsequent matches)




19
Automating categorization, p6: Iterative

It will be beneficial to have a human review the set of matches for
    each iteration and determine if they are accurate enough
     • The measurement of relevancy is intended to do this but would likely only be
       partially successful


Over time, using this process, you build up a larger and larger set of
  categorized search terms
     • This makes it more likely in future iterations that more terms will be
       categorizable




20
Automating categorization, p7: No matches

There will always be search terms that do not get matched.
     • This may be because the terminology used does not match
     • This may be because there are no categories in the global taxonomy that
       would be useful for categorization


The first issue would require a human to recognize the association
  (thus, categorizing the term and then enabling matches on future
  uses of that term)

The second issue would require adding in new categories (not part
  of the global taxonomy)
     • And then categorizing the term into the newly-added category(ies)




21
Summary

With this approach, we can take a set of search terms at any time
  and categorize them (partially) automatically
     • Over time, the accuracy of the matching will improve through human review-
       and-approval of matches
We then are able to relate these information needs to a variety of
  other pieces of data:
     • Volume of content available to users – significant mismatches can highlight
       need for new content
     • Rating of content in these categories – can highlight that a particular area of
       interest has content but it isn’t quality content
     • Downloads of content in these categories – could highlight navigational
       issues (e.g., when a category is much more highly represented in search
       than in downloads)
This does not require directly working with end-users and is scalable



22
Additional benefits: Measuring your taxonomy

As mentioned earlier, part of the challenge will be that there will be
  terms that do not match the starting categories (i.e., the global
  taxonomy)

This actually highlights some valuable insight obtainable from this:
     • We can identify gaps in our taxonomy (terms requiring new categories)
     • We can identify areas of our taxonomy where we have many search terms
       associated with a taxonomy term and consider if we need to either add or
       split search terms in order to better match our users’ real language
     • We can identify areas of the taxonomy that are of little use in terms of the
       language used by our users




23
Additional benefits: Linguistic statistics

                                                     Word             Distinct Terms Searches
                                                     management                   3128      8283
Word counts – independent of term usage,             sap                          1931      3873
                                                     strategy                     1414      3728
   what are the most common individual               business                     1558      3599
   words?                                            it                           1343      2992
                                                     process                      1515      2920
                                                     data                         1264      2899
                                                     project                      1249      2823
                                                     model                        1296      2791
                                                     plan                          987      2170

Word networks – we can understand
   the inter-relationships between
   individual words (which pairs
   occur commonly together,
   which words occur commonly
   for a given word)




These are not as much about information needs as about understanding the language
   users use (so this insight can help shape categorization)

These are also very useful to prioritize your efforts in reviewing your search logs
24
Additional benefits: Comparing to your content space

With the statistics described in the previous slide, you could,
  conceivably compare it to the same analysis applied to your
  “content space”

For example, derive the statistics for the titles of content available in
  your search
     • Do you find significant differences? This could represent differences in the
       names people apply to things and what they expect to use to find the content


Another interesting angle is to use other controlled lists as the
  matched terms in a category
     • People names (applied this and found about 8% of terms match a person’s
       name)
     • Client names



25
Understanding the
quality of your users’
experience
The Problem

Search sucks!

Yes, the common refrain from many users – “search doesn’t return
  what I’m looking for” or “I can never find what I’m looking for”

There are many tools available to improve the users’ experience,
   including:
 • Improving the UI
 • Improving the content included
 • Manipulating settings in the engine to modify relevancy
   calculations, possibly even the engine itself

The challenge for many of these is, once you make a change, how
  do you know it has improved the results?

27
A solution?

One way to assess the impact is to have a set of users perform
  either a set of pre-defined searches or a set of their own searches
  and then evaluate the quality of results

The challenge with this is that it is very labor intensive, can take a
  long calendar time and is hard to do iteratively.

An alternative could be to automate this evaluation!

It is important to keep in mind that this is not about the relevancy of
    the results or determining whether the engine is returning the
    “right” items
  • It’s about assessing the user-perceived quality of a set of
     results given a set of criteria for a search


28
Automating evaluation

The idea is to automate some of the analysis of the quality of the
  result set by examining properties of the result set

This approach attempts to perform a simple test similar to what a
   human user would do in scanning a set of search results
 • It uses the data returned by the search engine and displayed on
   the first page of results
 • It does not do a “deep” review of content




29
The approach

The algorithm takes the following approach:

• For each search term, it executes the query against the search
  engine and retrieves the results
  ‒For each individual result, it calculates a quality score from 0.0 to
   1.0 (a higher score implies the result looks like a better result)
  ‒The individual scores for a search term’s set of results are
   averaged to get a single score for that search term

• In addition, the current POC outputs data in a tabular format
  including most of the individual elements returned by the search
  engine along with the derived score




30
What are we looking at in assessing quality?
Facets that influence quality
• Focusing primarily on user-visible aspects
                                                 First page


                                               Result set size


                                                 Snippet




                                                    Title




                                                     Age

                                               Uniqueness of
                                                    title
31
What are we looking at in assessing quality?
Factors that influence quality
•    Only examining the first page of results
•    Similarity / dissimilarity of keywords to title
•    Similarity / dissimilarity of keywords to excerpt
•    Uniqueness of titles within the result set (just first page)
•    Size of total result set
•    Age of results

• Looking for specific “known” targets
• (one “cheat”) Presence of keywords in “concepts” identified by
  engine




32
What are we looking at in assessing quality?
Others that may be explored
• Balance across sources of content (does it match overall ratio?)

• Ratings of individual results

• Web domain of content (following an internet expectation that “some sources are
  better than others”)

• Match of terms could be altered to consider synonyms
• Examining taxonomy values
  ‒ Could apply matching to taxonomy values?
  ‒ Could be a “bonus” to items that have taxonomy?
• May want to make weights (e.g., impact of age) consider source or class of
  content
• Currently, in our search engine, best bets are automatically included.
  ‒ Would prefer to have them not included to see where they end up organically.
• Also, in our search engine, the exact order on a page has not been replicated so
  we can’t include the exact order as a factor

33
Validating the approach
Does this reflect how a human user would perceive the quality?
• This idea seems reasonable, but do we really have a way to
  determine if it is valid
  ‒Or, do we run the risk that this would lead to “local maximums” for
   the factors measured but not meaningfully improve the user’s
   experience?

• So far, I have 2 independent ways to assess this
  ‒Comparing the results of this against a human assessment
  ‒Comparing the results of this against other factors that have been
   used as indicators of quality in the past




34
Validating the approach, p2
Comparing against a human assessment
• One of our on-going operations in GCKM is to review the quality of
  results for a very small number of terms
  ‒The below takes the output of the most recent of this for our a
    subset of our “super search terms” and compares it against the
    programmatically calculated quality

     ‒There is at least a correlation
                                                          0.8
      between the automated score                                   y = 0.2781x + 0.3826
                                                          0.7
                                                                         R² = 0.5803
      (the Y axis) and the manual                         0.6
      score (the X axis)                Automated Score
                                                          0.5

                                                          0.4

                                                          0.3

                                                          0.2

                                                          0.1

                                                           0
                                                                0          0.2        0.4       0.6        0.8   1   1.2
                                                                                            Manual Score



35
Validating the approach, p3
Comparing against searches/term
• Within our search program, we use the ratio of searches per visit
  for a term as an indicator of the quality the results
  ‒The more pages of results a user looks at for a term, indicates
    that it’s harder for the user to find what they are looking for
  ‒The following chart displays a comparison between searches/visit
    (X-axis) and the automated quality score (Y-axis)

     ‒Again, we can see that there                                                                          80

      is a correlation, though perhaps        y = -0.6857x + 55.234
                                                   R² = 0.5225
                                                                                                            70


      not as strong as                                                                                      60

                                                                                                            50
      compared to the manual                                                                                40

      review                                                                                                30

                                                                                                            20

                                                                                                            10

                                                                                                            0
                                         50             40              30        20               10   0

                                                                      Quality   Linear (Quality)



36
Validating the approach, p4
Summing up
• At this point, I am confident that the quality assessment we are
  producing automatically is reflecting the user’s general experience.
  ‒On individual items, it can vary significantly but in aggregate it
   appears to be valid
  ‒I have not yet dug into this but the automation enables the
   weights of each factor to be adjusted and it’s possible that we can
   get the automated score closer still to the “real” quality of results
   through adjusting weights




37
Additional benefits of this tool
Better analysis
• Given that this utility can output data in a spreadsheet format, this
  presents some other capabilities
  ‒Estimate total “search impressions” for specific targets
      • Analyze “search impressions” vs. usage
     ‒Analyze spread of returned results across sources
     ‒Analyze quality along a variety of dimensions (source,
      taxonomy values, etc.)
     ‒Comparing results sets between terms that should show
      similar results
      • E.g., how similar are the results really for two synonyms?
     ‒Also, comparing result sets along a temporal dimension
      • How much change is there from one month (week) to the next?
     ‒Analyzing factors by depth into the “long tail”
     ‒Evaluating the quality of results for auto-complete terms

38
Quality of results split by taxonomy on the content
Better analysis - examples
• Quality of results averaged over the service area assigned to
  content

                                  Quality by Service Area of content
            38.0

            37.5

            37.0

            36.5
                                                                                               Overall Avg

            36.0

            35.5

            35.0

            34.5

            34.0

            33.5

            33.0
                    Enterprise       Human Capital   Outsourcing    Strategy &   Technology
                   Applications                      (Consulting)   Operations   Integration




39
Quality of results by depth into the “long tail”
Better analysis - examples
• A chart of the quality of the result pages by how far into the long
  tail a search term is

                    Quality by Depth into the "long tail"
          60.0



          50.0



          40.0



          30.0



          20.0                                              y = 55.685x-0.14
                                                              R² = 0.5253

          10.0



           0.0
                     0
                   500
                  1000
                  1500
                  2000
                  2500
                  3000
                  3500
                  4000
                  4500
                  5000
                  5500
                  6000
                  6500
                  7000
                  7500
                  8000
                  8500
                  9000
                  9500
                 10000
                 10500
                 11000
                 11500
                 12000
                 12500
                 13000
                 13500
                 14000
                 14500
                 15000
                 15500
                 16000
                 16500
                 17000
                 17500
40
Quality over time – comparing before and after an upgrade
Better analysis - examples
• This chart shows the # of terms by their change in quality through
  an upgrade of our search engine – overall change was +2%!
                   Change in Quality through an upgrade
     450


     400
                     Worse      Better
     350


     300


     250


     200


     150


     100


     50


      0
            11%
            13%
            15%
            17%
            19%
            21%
            23%
            25%
            27%
            29%
            31%
            33%
            35%
            37%
            39%
            41%
            44%
            47%
            49%
            51%
            54%
            56%
            59%
            66%
            81%
            -9%
            -7%
            -5%
            -3%
            -1%
             1%
             3%
             5%
             7%
             9%
           -46%
           -39%
           -34%
           -31%
           -29%
           -27%
           -25%
           -23%
           -21%
           -19%
           -17%
           -15%
           -13%
           -11%




41
And, finally

For more about search analytics, I highly would recommend:

• “Search Analytics for your Site” by Lou Rosenfeld
• www.searchtools.com – edited by Avi Rappoport

Also, you can find my own writings on search analytics (along with a
variety of other KM topics) on my blog:
• blog.leeromero.org




42

More Related Content

Similar to SIKM Leaders July 2012 - Understanding your Search Log

SEO Keyword Research & Mapping
SEO Keyword Research & Mapping SEO Keyword Research & Mapping
SEO Keyword Research & Mapping Vivastream
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & SearchJames Melzer
 
Keyword research
Keyword researchKeyword research
Keyword researchStudent
 
Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...butest
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content StrategistsLouis Rosenfeld
 
Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...
Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...
Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...LavaConConference
 
Keyword Intelligence - Socialize West 2011
Keyword Intelligence - Socialize West 2011Keyword Intelligence - Socialize West 2011
Keyword Intelligence - Socialize West 2011Ron Jones
 
Ditch the Keyword Based Content Strategy
Ditch the Keyword Based Content StrategyDitch the Keyword Based Content Strategy
Ditch the Keyword Based Content Strategysemrush_webinars
 
Ditch the Keyword Based SEO Content Strategy
Ditch the Keyword Based SEO Content StrategyDitch the Keyword Based SEO Content Strategy
Ditch the Keyword Based SEO Content StrategyNicole Hess
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
Using AI to understand search intent
Using AI to understand search intentUsing AI to understand search intent
Using AI to understand search intentAritra Mandal
 
Analyzing Queries to Find Revenue Opportunities
Analyzing Queries to Find Revenue OpportunitiesAnalyzing Queries to Find Revenue Opportunities
Analyzing Queries to Find Revenue OpportunitiesBill Hunt
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Similar to SIKM Leaders July 2012 - Understanding your Search Log (20)

SEO with the SEOGoddess Workshop
SEO with the SEOGoddess WorkshopSEO with the SEOGoddess Workshop
SEO with the SEOGoddess Workshop
 
SEO Keyword Research & Mapping
SEO Keyword Research & Mapping SEO Keyword Research & Mapping
SEO Keyword Research & Mapping
 
Classification, Tagging & Search
Classification, Tagging & SearchClassification, Tagging & Search
Classification, Tagging & Search
 
Search 1
Search 1Search 1
Search 1
 
Seo
SeoSeo
Seo
 
Keyword research
Keyword researchKeyword research
Keyword research
 
Surfing the web
Surfing the webSurfing the web
Surfing the web
 
Week 2
Week 2Week 2
Week 2
 
Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...Catégorisation automatisée de contenus documentaires : la ...
Catégorisation automatisée de contenus documentaires : la ...
 
Search Analytics for Content Strategists
Search Analytics for Content StrategistsSearch Analytics for Content Strategists
Search Analytics for Content Strategists
 
Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...
Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...
Personalizing Content Using Taxonomy with Megan Gilhooly, Vice President Cust...
 
Managed metadata in SharePoint 2010
Managed metadata in SharePoint 2010Managed metadata in SharePoint 2010
Managed metadata in SharePoint 2010
 
Keyword Intelligence - Socialize West 2011
Keyword Intelligence - Socialize West 2011Keyword Intelligence - Socialize West 2011
Keyword Intelligence - Socialize West 2011
 
Ditch the Keyword Based Content Strategy
Ditch the Keyword Based Content StrategyDitch the Keyword Based Content Strategy
Ditch the Keyword Based Content Strategy
 
Ditch the Keyword Based SEO Content Strategy
Ditch the Keyword Based SEO Content StrategyDitch the Keyword Based SEO Content Strategy
Ditch the Keyword Based SEO Content Strategy
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
Using AI to understand search intent
Using AI to understand search intentUsing AI to understand search intent
Using AI to understand search intent
 
Analyzing Queries to Find Revenue Opportunities
Analyzing Queries to Find Revenue OpportunitiesAnalyzing Queries to Find Revenue Opportunities
Analyzing Queries to Find Revenue Opportunities
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How Text Analytics Increase Search Relevance
How Text Analytics Increase Search RelevanceHow Text Analytics Increase Search Relevance
How Text Analytics Increase Search Relevance
 

Recently uploaded

How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistandanishmna97
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...SOFTTECHHUB
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxFIDO Alliance
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdfMuhammad Subhan
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxMasterG
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfAnubhavMangla3
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfOverkill Security
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiRaviKumarDaparthi
 

Recently uploaded (20)

How to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in PakistanHow to Check GPS Location with a Live Tracker in Pakistan
How to Check GPS Location with a Live Tracker in Pakistan
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdfFrisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 

SIKM Leaders July 2012 - Understanding your Search Log

  • 1. Search analytics – Understanding the long tail SIKM Leaders July 2012 Lee Romero blog.leeromero.org July 2012
  • 2. About me My background and early career are both in software engineering. I've worked in the knowledge management field for the last 12+ years – almost all of it in the technology of KM I’ve worked with various search solutions for the last 7-8 years – and spent most of that time trying to figure out how to measure their usefulness and improve them in any way I can. I’ve spoken at both Enterprise Search Summit and Taxonomy Boot Camp twice. My writings on search analytics have been featured by a number of experts in the field including Lou Rosenfeld and Avi Rappoport 2
  • 3. Search Analytics Definition: Search analytics is the field of analyzing and aggregating usage statistics of your search solution to understand user behavior and to improve the experience. Some search analytics are focused on SEO / SEM activities (for internet searches). The focus here will be on enterprise search, so will primarily be focusing on the aspect of improving the user experience. Further, I will primarily focus here on keyword search and understanding the user language found in search logs Always remember – analytics without action does not have much value. 3
  • 4. The challenge of your search log
  • 5. Understanding your search log For enterprise search solutions1, the “80-20” rule is not true The language variability is very high in a couple of ways (covered in the next few slides) Yet having a good understanding of the language, frequency and commonality in your search log is critical to being able to make sustainable improvements to your search The remainder of this presentation first provides some evidence supporting my claim and then will cover some ideas and research into this problem 1 This does not seem to apply equally to e-commerce solutions 5
  • 6. Some facts about search terms There’s an anecdote that goes something like, “80% of your searches are from 20% of your search terms” • Equivalently, some will say that you can make significant impact by paying attention to a few of your most common terms (you can, but in limited ways) Fact: in enterprise search solutions the curve is much shallower: This chart shows the inverted power curve for two different solutions I’m currently working with In the second case, it takes 13% of terms to cover 50% of searches, and that is over 7000 distinct terms in a typical month! 6
  • 7. Some facts about search terms: part 2 Another myth: a large percent of searches repeat over and over again Fact: on enterprise search solutions, there is surprisingly little commonality month-to-month Over a recent six month period, which saw a total of ~289K distinct search terms, only 11% of terms occurred in more than 1 month! # of months # terms % of searches 1 257665 89.2% 2 17994 6.2% 3 5790 2.0% 4 2900 1.0% 5 2019 0.7% 6 2340 0.8% 7
  • 8. Some facts about search terms: part 3 Another myth: a good percentage of your search terms will repeat in sequential periods Fact: There is much more churn even month-to-month than you might expect – in the period covered below, only about 13% of terms repeated from one month to the next (covering about 36% of searches) 8
  • 9. What to do with your search log? The summary of the previous slides: • It is hard to understand a decent percentage of terms within a given time period (month)! • If you could do that, the problem during the next time period isn’t that much easier! The next sections describe a couple of research projects I’ve been working on to tackle these issues 9
  • 11. Categorizing your users’ language Given the challenges previously laid out, using the search log to understand user needs seems very challenging Beyond the first several dozen terms, it is hard to understand what users are looking for • And those several dozen terms cover a vanishingly small percentage of all searches! However, it would be very useful to understand your users’ information needs if we could somehow understand the entirety of the search log How do we handle this? Categorize the search terms! 11
  • 12. Categorizing your users’ language, p2 So we need to categorize search terms to really be able to understand our users’ information needs. To do this, we face two challenges 1. What categorization scheme should we use? 2. How do we apply categorization in a repeatable, scalable and manageable way? For the first challenge, I would recommend you use your taxonomy (you do have one, right?) The second challenge is a bit more difficult but is addressed later in this deck 12
  • 13. Categories to use Proposal: Start with your own taxonomy and its vocabularies as the categories into which search terms are grouped Some searches will not fit into any of these categories, so you can anticipate the need to add further categories As an aside, this exercise actually provides a great measurement tool for your taxonomy • You can quantitatively assess the percent of your users’ language that is classifiable with your taxonomy • A number you may wish to drive up over time (through evolution of your taxonomy) 13
  • 14. Automating categorization Now we turn to the hairier challenge – how can we categorize search terms? To describe the problem, we have: 1. A set of categories, which may be hierarchically related (most taxonomies are) 2. A set of search terms, as entered by users, that need to be assigned to those categories Search Term Category Category ? Search Term Category Search Term Category Category Category Search Term Category ... Search Term ... ? ... ... 14
  • 15. Automating categorization, p2 The proposed solution is based on a couple of concepts: 1. You can think of this categorization problem as search! 2. You are taking each search term and searching in an index in which the potential search results are categories! Question: What is the “body” of what you are searching? Answer: Previously-categorized search terms! Using this approach, you can consider the set of previously- categorized search terms as a corpus against which to search • You can apply all of the same heuristics to this search as any search: • Word matching (not string matching) • Stemming • Relevancy (word ordering, proximity, # of matches, etc.) 15
  • 16. Automating categorization, p3 Here’s a depiction of this solution Previously categorized terms Search Term Category Category Category Search Term Category Search Term Category Category Previously Category categorized Search Term terms ... Search Term ... ... ... Previously categorized terms This red oval represents the “matching” process – it takes as input the search terms to be categorized, the set of categories along with previously-matched search terms and produces as output a set of categories associated with the new search terms 16
  • 17. Automating categorization, p4: Bootstrapping This approach depends on matching to previously-categorized terms • Every time you categorize a new search term, you expand the set of categorized terms, enabling more matches in the future Bootstrapping: You can take the names of the categories (the terms in your taxonomy) as the first set of “categorized search terms” • This allows you to start with no search terms having been categorized at all • You run a first round of matching against the categories to find first-level matches • Take those that seem like “good” matches and pull those into the set of categorized search terms for a second iteration, etc. • Using this in initial testing resulted in 10% of distinct terms from a month being associated with at least one category Another aspect: Any manual categorization of common search terms will add to the success of categorization 17
  • 18. Automating categorization, p5: Iterative Previously categorized Search Term Category terms Category Search Term Category Category New categorizations Search Term Category Category Category Previously Search Term categorized ... terms Search Term ... ... ... New categorizations Previously categorized terms New categorizations
  • 19. Automating categorization, p5: Iterative This approach also needs to be applied iteratively • You start with a set of categorized search terms and a new set of (uncategorized) search terms • You then apply this matching to the uncategorized search terms, getting a set of newly-categorized search terms (with some measure of probability of “correctness” of the match, i.e., relevancy) • You pull in the newly-categorized search terms and run the matching process again • Each time, as you expand the set of categorized search terms (from a previous match), you increase the possibility of more matches (in subsequent matches) 19
  • 20. Automating categorization, p6: Iterative It will be beneficial to have a human review the set of matches for each iteration and determine if they are accurate enough • The measurement of relevancy is intended to do this but would likely only be partially successful Over time, using this process, you build up a larger and larger set of categorized search terms • This makes it more likely in future iterations that more terms will be categorizable 20
  • 21. Automating categorization, p7: No matches There will always be search terms that do not get matched. • This may be because the terminology used does not match • This may be because there are no categories in the global taxonomy that would be useful for categorization The first issue would require a human to recognize the association (thus, categorizing the term and then enabling matches on future uses of that term) The second issue would require adding in new categories (not part of the global taxonomy) • And then categorizing the term into the newly-added category(ies) 21
  • 22. Summary With this approach, we can take a set of search terms at any time and categorize them (partially) automatically • Over time, the accuracy of the matching will improve through human review- and-approval of matches We then are able to relate these information needs to a variety of other pieces of data: • Volume of content available to users – significant mismatches can highlight need for new content • Rating of content in these categories – can highlight that a particular area of interest has content but it isn’t quality content • Downloads of content in these categories – could highlight navigational issues (e.g., when a category is much more highly represented in search than in downloads) This does not require directly working with end-users and is scalable 22
  • 23. Additional benefits: Measuring your taxonomy As mentioned earlier, part of the challenge will be that there will be terms that do not match the starting categories (i.e., the global taxonomy) This actually highlights some valuable insight obtainable from this: • We can identify gaps in our taxonomy (terms requiring new categories) • We can identify areas of our taxonomy where we have many search terms associated with a taxonomy term and consider if we need to either add or split search terms in order to better match our users’ real language • We can identify areas of the taxonomy that are of little use in terms of the language used by our users 23
  • 24. Additional benefits: Linguistic statistics Word Distinct Terms Searches management 3128 8283 Word counts – independent of term usage, sap 1931 3873 strategy 1414 3728 what are the most common individual business 1558 3599 words? it 1343 2992 process 1515 2920 data 1264 2899 project 1249 2823 model 1296 2791 plan 987 2170 Word networks – we can understand the inter-relationships between individual words (which pairs occur commonly together, which words occur commonly for a given word) These are not as much about information needs as about understanding the language users use (so this insight can help shape categorization) These are also very useful to prioritize your efforts in reviewing your search logs 24
  • 25. Additional benefits: Comparing to your content space With the statistics described in the previous slide, you could, conceivably compare it to the same analysis applied to your “content space” For example, derive the statistics for the titles of content available in your search • Do you find significant differences? This could represent differences in the names people apply to things and what they expect to use to find the content Another interesting angle is to use other controlled lists as the matched terms in a category • People names (applied this and found about 8% of terms match a person’s name) • Client names 25
  • 26. Understanding the quality of your users’ experience
  • 27. The Problem Search sucks! Yes, the common refrain from many users – “search doesn’t return what I’m looking for” or “I can never find what I’m looking for” There are many tools available to improve the users’ experience, including: • Improving the UI • Improving the content included • Manipulating settings in the engine to modify relevancy calculations, possibly even the engine itself The challenge for many of these is, once you make a change, how do you know it has improved the results? 27
  • 28. A solution? One way to assess the impact is to have a set of users perform either a set of pre-defined searches or a set of their own searches and then evaluate the quality of results The challenge with this is that it is very labor intensive, can take a long calendar time and is hard to do iteratively. An alternative could be to automate this evaluation! It is important to keep in mind that this is not about the relevancy of the results or determining whether the engine is returning the “right” items • It’s about assessing the user-perceived quality of a set of results given a set of criteria for a search 28
  • 29. Automating evaluation The idea is to automate some of the analysis of the quality of the result set by examining properties of the result set This approach attempts to perform a simple test similar to what a human user would do in scanning a set of search results • It uses the data returned by the search engine and displayed on the first page of results • It does not do a “deep” review of content 29
  • 30. The approach The algorithm takes the following approach: • For each search term, it executes the query against the search engine and retrieves the results ‒For each individual result, it calculates a quality score from 0.0 to 1.0 (a higher score implies the result looks like a better result) ‒The individual scores for a search term’s set of results are averaged to get a single score for that search term • In addition, the current POC outputs data in a tabular format including most of the individual elements returned by the search engine along with the derived score 30
  • 31. What are we looking at in assessing quality? Facets that influence quality • Focusing primarily on user-visible aspects First page Result set size Snippet Title Age Uniqueness of title 31
  • 32. What are we looking at in assessing quality? Factors that influence quality • Only examining the first page of results • Similarity / dissimilarity of keywords to title • Similarity / dissimilarity of keywords to excerpt • Uniqueness of titles within the result set (just first page) • Size of total result set • Age of results • Looking for specific “known” targets • (one “cheat”) Presence of keywords in “concepts” identified by engine 32
  • 33. What are we looking at in assessing quality? Others that may be explored • Balance across sources of content (does it match overall ratio?) • Ratings of individual results • Web domain of content (following an internet expectation that “some sources are better than others”) • Match of terms could be altered to consider synonyms • Examining taxonomy values ‒ Could apply matching to taxonomy values? ‒ Could be a “bonus” to items that have taxonomy? • May want to make weights (e.g., impact of age) consider source or class of content • Currently, in our search engine, best bets are automatically included. ‒ Would prefer to have them not included to see where they end up organically. • Also, in our search engine, the exact order on a page has not been replicated so we can’t include the exact order as a factor 33
  • 34. Validating the approach Does this reflect how a human user would perceive the quality? • This idea seems reasonable, but do we really have a way to determine if it is valid ‒Or, do we run the risk that this would lead to “local maximums” for the factors measured but not meaningfully improve the user’s experience? • So far, I have 2 independent ways to assess this ‒Comparing the results of this against a human assessment ‒Comparing the results of this against other factors that have been used as indicators of quality in the past 34
  • 35. Validating the approach, p2 Comparing against a human assessment • One of our on-going operations in GCKM is to review the quality of results for a very small number of terms ‒The below takes the output of the most recent of this for our a subset of our “super search terms” and compares it against the programmatically calculated quality ‒There is at least a correlation 0.8 between the automated score y = 0.2781x + 0.3826 0.7 R² = 0.5803 (the Y axis) and the manual 0.6 score (the X axis) Automated Score 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 Manual Score 35
  • 36. Validating the approach, p3 Comparing against searches/term • Within our search program, we use the ratio of searches per visit for a term as an indicator of the quality the results ‒The more pages of results a user looks at for a term, indicates that it’s harder for the user to find what they are looking for ‒The following chart displays a comparison between searches/visit (X-axis) and the automated quality score (Y-axis) ‒Again, we can see that there 80 is a correlation, though perhaps y = -0.6857x + 55.234 R² = 0.5225 70 not as strong as 60 50 compared to the manual 40 review 30 20 10 0 50 40 30 20 10 0 Quality Linear (Quality) 36
  • 37. Validating the approach, p4 Summing up • At this point, I am confident that the quality assessment we are producing automatically is reflecting the user’s general experience. ‒On individual items, it can vary significantly but in aggregate it appears to be valid ‒I have not yet dug into this but the automation enables the weights of each factor to be adjusted and it’s possible that we can get the automated score closer still to the “real” quality of results through adjusting weights 37
  • 38. Additional benefits of this tool Better analysis • Given that this utility can output data in a spreadsheet format, this presents some other capabilities ‒Estimate total “search impressions” for specific targets • Analyze “search impressions” vs. usage ‒Analyze spread of returned results across sources ‒Analyze quality along a variety of dimensions (source, taxonomy values, etc.) ‒Comparing results sets between terms that should show similar results • E.g., how similar are the results really for two synonyms? ‒Also, comparing result sets along a temporal dimension • How much change is there from one month (week) to the next? ‒Analyzing factors by depth into the “long tail” ‒Evaluating the quality of results for auto-complete terms 38
  • 39. Quality of results split by taxonomy on the content Better analysis - examples • Quality of results averaged over the service area assigned to content Quality by Service Area of content 38.0 37.5 37.0 36.5 Overall Avg 36.0 35.5 35.0 34.5 34.0 33.5 33.0 Enterprise Human Capital Outsourcing Strategy & Technology Applications (Consulting) Operations Integration 39
  • 40. Quality of results by depth into the “long tail” Better analysis - examples • A chart of the quality of the result pages by how far into the long tail a search term is Quality by Depth into the "long tail" 60.0 50.0 40.0 30.0 20.0 y = 55.685x-0.14 R² = 0.5253 10.0 0.0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 10500 11000 11500 12000 12500 13000 13500 14000 14500 15000 15500 16000 16500 17000 17500 40
  • 41. Quality over time – comparing before and after an upgrade Better analysis - examples • This chart shows the # of terms by their change in quality through an upgrade of our search engine – overall change was +2%! Change in Quality through an upgrade 450 400 Worse Better 350 300 250 200 150 100 50 0 11% 13% 15% 17% 19% 21% 23% 25% 27% 29% 31% 33% 35% 37% 39% 41% 44% 47% 49% 51% 54% 56% 59% 66% 81% -9% -7% -5% -3% -1% 1% 3% 5% 7% 9% -46% -39% -34% -31% -29% -27% -25% -23% -21% -19% -17% -15% -13% -11% 41
  • 42. And, finally For more about search analytics, I highly would recommend: • “Search Analytics for your Site” by Lou Rosenfeld • www.searchtools.com – edited by Avi Rappoport Also, you can find my own writings on search analytics (along with a variety of other KM topics) on my blog: • blog.leeromero.org 42