• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Clustering Search Log Data
 

Clustering Search Log Data

on

  • 585 views

Presented at the Harvard ABCD-WWW/CMS session, Nov. 15, 2012

Presented at the Harvard ABCD-WWW/CMS session, Nov. 15, 2012
A previous version of this talk was presented at Enterprise Search Europe, May 2012

Statistics

Views

Total Views
585
Views on SlideShare
580
Embed Views
5

Actions

Likes
0
Downloads
5
Comments
0

3 Embeds 5

https://twitter.com 2
http://www.docshut.com 2
http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Clustering Search Log Data Clustering Search Log Data Presentation Transcript

    • Clustering Search Query Log Data to Improve SearchSophy Bishop & Ravi Mynampaty Copyright © President & Fellows of Harvard College.
    • Agenda  Background  Five W’s of Clustering • What, why, who, how, when  Is it really repeatable?  Questions
    • About Information Management Services (IMS) Analytics Lifecycle Metadata Mgmt. Mgmt. - Standards - Best Practices - User Needs - Service Models Taxonomy Search Dev.
    • Inspired by…Chapters 8 & 9
    • About this talk… Case study on how we are improving search and browse by performing clustering exercises on your search query data Not rocket science High-level overview You can follow this method, with your own insights and tweaks You can kick this off next week at your work
    • What is clustering?A process for organizing and analyzing search logdata that: Is repeatable, low-cost, scalable, simple Yields actionable results Supports constant incremental improvement to search
    • What’s clustering good for? Ensure results for high frequency queries Improve Metadata and Taxonomy Inform and validate decision making in site IA Informs editorial/curatorial activities Provides Feedback for Search Suggestions o Autosuggest, synonym lists, no-hits page suggestions But more on this later...
    • So how do I cluster search queries?A simple set of steps Create query report Draw Cluster conclusions queries and ACT Determine # Analyze queries to clusters analyze
    • Step 1: Create a query reportWe started with the site with the most traffic • Upper-bound limit • One year’s data by quarter • Cut off tail at frequency < 10
    • Step 1: Create a query report We started with the site with the most traffic • Upper-bound limitHBS Working Knowledge FY12 Use Snapshot • One year’s data by quarterOverall Traffic • Cut off tail at frequency < 10 Page Views: 6,439,485 Visits: 3,635,746 Unique visitors: 2,734,620 On-site searches: 174,425 Views per Visit: 1.77 Local Search visit rate: 5% Organic Search visit rate: 46%
    • Step 2: Cluster the queries
    • Step 2 (cont’d): Three levels of clusteringLevel Method ExampleNarrow Simple Eliminate normalization grammatical, spelling, typos, and punctuation differencesMid-level Group by subject management, finance, decision makingBroad Group by facet topic, name, date, content type
    • Step 2 (cont’d): Levels  Tasks EnabledLevel Improve your Ensure Improve Improve base for representation Metadata/Index Search query of major /Taxonomy Suggestions analysis clusters on your siteNarrow X X X(simple)Mid-level X X X(group bysubject)Broad X X(group byfacet)
    • Step 2 (cont’d): Narrow Clustering Example
    • Step 2 (cont’d): Mid-level ExampleCluster brandbranding 245brand 160brand management 73consumer branding 57global brand 32service brands 24brand image retail bank 17employer branding 16brand management professionalservices 16global branding 13b2b branding 13importance of branding 12brand 2002 12brand equity 11brand image 11
    • Cluster brandStep 2 (cont’d): Mid-level Examplebranding 245brand 160brand management 73consumer branding 57global brand 32service brands 24brand image retail bank 17employer branding 16brand management professionalservices 16global branding 13b2b branding 13importance of branding 12brand 2002 12brand equity 11brand image 11
    • Cluster customer brand Step 2 (cont’d): Mid-level Example350 333branding 245brand300 160brand management 73250consumer branding 57global brand200 32 179service brands 24 145brand image retail bank150 17employer branding101 111 16100brand management professional 88services 16 50 40global branding 26 26 25 20 13 19 15 14 12 12 11 11 10 10 10b2b branding 0 13importance of branding 12brand 2002 12brand equity 11brand image 11
    • Step 2 (cont’d): Broad Clustering Example
    • Step 2 (cont’d): List of facets we usedFacet Example case studies, cases, working papers, articles,content type newspaperdate 2011, world in 2030demographic characteristics women, Gen Y, gender, baby boomersevent economic crisisformat podcast, videogeographic area india, japan, mount everestindustry global wine industry independent director, entrepreneur, ceo, phdjob type/role economistorganization name ikea, zara, toyotaperson name michael porter, kanter, sebeniusproduct name / brand name ipadproduct/commodity coffee, wine, cementtopic this covers the majority of keywords faculty work, ex: publication name, title of awork case
    • Step 3: Choose #clusters to analyzeNumber of Analyze Top Hits Improve Metadata/ Supply SearchClusters Taxonomy SuggestionsAnalyzed /Index50 X150 X X300+ X X X
    • Small # Clusters can cover a lot of your data Number of top clusters % Total QueriesTop 20 clusters 14Top 30 clusters 18Top 50 clusters 26Top 100 clusters 37
    • Now you have your clusters…What do you do with them? TAKE ACTION!
    • Analyze Top (“Short Head”) ClustersClustering has created a condensed and reliablelist of your top search queries Are they what you thought they would be? Does the information on your site accurately represent the top searches? Are you fulfilling user needs?
    • Use your clusters: Improve Site NavigationExamine the short-head of clusters, basically:  For each cluster, add up the frequencies of queries  Reorder clusters by cumulative frequency descending  Ensure top clusters are accounted for in your navigation  Use cluster topics as browse/navigation headers/footers for your website
    • WK Top ClustersCluster Frequencyinnovation 867balanced scorecard 794leadership 570cases 545social media 508negotiation 470knowledge management 457ethics 448apple 430corporate social responsibility 398
    • Use your clusters: Improve Taxonomy• Missing categories in browse taxonomy • "Balanced Scorecard" • “Ethics” • “Social media”• Second-level topics in the WK context
    • Use your clusters: Improve Taxonomy• Missing categories in browse taxonomy • "Balanced Scorecard" • “Ethics” • “Social media”• Second-level topics in the WK context
    • Use your clusters: Improve Taxonomy• Missing categories in browse taxonomy • "Balanced Scorecard" • “Ethics” • “Social media”• Second-level topics in the WK context
    • Mid-level clustering:Informs editorial /curatorial activities “Featured Topics” o What topics to highlight this week/month/year o News items to focus on o What research guides to create o How to formulate queries for the topics
    • Use your clusters: Improve Synonym Handling Clustered list provides synonyms for taxonomy Requires human judgment and standards/guidelines for synonyms – in our case, synonyms are exact Map to one "like term" in the search engine Example: Balanced Scorecard, BSC, Balanced score card kaplan and norton -> Balanced Scorecard
    • Use your clusters: Improve no-hits page
    • Time Commitment• 2 hours to 2 weeks• Variables include: • What kind of information you want to gather • How broad or narrow you want your clusters • How many queries you analyze• In our case ~2 person-weeks • We had Sophy Bishop • Intern, MSLIS student
    • Results vs. Time Invested Analyze top Update Create New Determine clusters Taxonomy Metadata New Search Suggestions2 Hours X X6 Hours X X XOne Week X X X X
    • Next Steps: Autosuggest Your top clusters probably make up a large percentage of what people are looking for o Use them to establish/supplement auto-suggest! Example: suggestions for “innovation” o innovation and leadership o disruptive innovation o innovation management o open innovation
    • Next Steps: New Access Structures Needed an obvious way to search podcasts o Put in best bets for now A lot of people searching for article titles o Considering simple interface/approach for select field-specific search, e.g. “title” Consider adding other facets to browse taxonomy where we have entities tagged o “company name”, “job type/class”, etc.
    • Next Steps SEO Optimization Input o Advise authors to use top cluster terms in Titles, Abstracts, Keywords o Report on clusters in our monthly analytics reports to faculty (“Top search topics/subjects in May 2012 were…” ; “Searchers found your works with following queries”) Repeat process on other sites/content
    • Summary Established plan/process, but be willing to tweak as you go Keep it very simple. Play with your data – the more we played, the better we understood what benefits could be realized by levels of clustering and effort Tuning process/results o Build staging/working prototypes o Repeat process on other sites TAKE ACTION!
    • Thank you! Questions? sophybishop@gmail.com @sophreads searchguy@hbs.edu @ravimynampaty