Search Analytics for Fun and Profit An Event Apart Chicago, Illinois August 27, 2007 Lou Rosenfeld www.rosenfeldmedia.com
Who I Am Information architecture consultant to Fortune 500s Publisher and founder, Rosenfeld  Media Blog at www.louisrosenfeld.com Co-author,  Information Architecture  for the World Wide Web  (3rd ed.,  2006; O’Reilly) New book:   Search Analytics for Your Site:  Conversations with your customers  (2008; Rosenfeld Media):  www.rosenfeldmedia.com/books/searchanalytics
Anatomy of a Search Log (from Google Search Appliance) Critical elements in  pink :  IP address ,  time/date stamp ,  query , and  # of results: XXX.XXX.X.104  - - [ 10/Jul/2006:10:25:46  -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort=date%3AD%3AL%3Ad1&ud=1&site=AllSites&ie=UTF-8&client=www&oe=UTF-8&proxystylesheet=www&q= lincense+plate &ip=XXX.XXX.X.104 HTTP/1.1" 200 971  0  0.02 XXX.XXX.X.104  - - [ 10/Jul/2006:10:25:48  -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort=date%3AD%3AL%3Ad1&ie=UTF-8&client=www&q= license+plate &ud=1&site=AllSites&spell=1&oe=UTF-8&proxystylesheet=www&ip=XXX.XXX.X.104 HTTP/1.1" 200 8283  146  0.16 XXX.XXX.XX.130  - - [ 10/Jul/2006:10:24:38  -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort=date%3AD%3AL%3Ad1&ud=1&site=AllSites&ie=UTF-8&client=www&oe=UTF-8&proxystylesheet=www&q= regional+transportation+governance+commission &ip=XXX.XXX.X.130 HTTP/1.1" 200 9718  62  0.17
The Zipf Curve:  Short Head, Middle Torso, Long Tail
Keep It In Proportion 7218 campus map 5859 map 5184 im west 4320 library 3745 study abroad 3690 schedule of courses 3584 bookstore 3575 spartantrak 3229 angel 3204 cata
What’s the Sweet Spot? department of surgery 7 80.00 7877 hotels 124 50.02 500 msu union 295 40.05 221 computer center 650 30.01 98 webenroll 1351 20.18 42 housing 2464 10.53 14 campus map  7218 1.40 1 Query Count Cumul.  % Rank
Topical Patterns and Seasonal Changes
Where will you  Capture Search Queries? The  search logs  that your search engine naturally captures and maintains as searches take place Search keywords or phrases that your users execute, that you capture into your own  local database Search keywords or phrases that your  commercial search solution  captures, records, and reports on (Mondosoft, Visual Sciences, Ultraseek, Google Appliance, etc.)
Querying your Queries:  Getting started What are the most  frequent unique queries? Are frequent queries retrieving  quality results? Click-through rates  per frequent query? Most  frequently clicked result  per query? Which frequent queries retrieve  zero results?  What are the referrer pages for  frequent queries? Which queries retrieve  popular documents? What  interesting patterns emerge  in general?
Tune your Questions: From generic to specific Netflix asks Which movies most frequently searched? Which of them most frequently clicked through? Which of them  least  frequently added to queue?
Diagnose This:  Fixing and improving the UX User Research Content Development  Interface Design:  search entry interface, search results Retrieval Algorithm Modification Navigation Design Metadata Development
User Research: What do they want?… SA is a true expression of users’ information needs (often surprising:  e.g., SKU #s at clothing retailer; URLs at IBM) Provides context by displaying aspects of single search sessions
User Research: …what else do they want?… BBC provides reports to determine other terms searched within same session (tracked by cookies)
User Research: …who wants it?… Specific segments needs as determined by: Security clearance IP address Job function Account information Alternatively, you may be able to extrapolate segments directly from SA Pages they initiate searches from
User Research: …who wants it?… BBC’s top  queries report from children’s section of site
User Research: …and when do they want it? Time-based variation (and clustered queries) from MSU By hour, by day, by season Helps determine “best bets”  development Also can help  tune main page  and other  editorial content
Content Development: Do we have the right content? From www.behaviortracking.com Analyze 0 result queries Does the content exist?  If so, there are titling, wording, metadata, or indexing problems If not, why not?
Content Development: Are we featuring the right stuff? Track clickthroughs to determine which results should rise to the top (example:  SLI Systems) Also suggests which “best bets” to develop to address common queries BBC removes navigation pages from search results
Search Entry Interface Design: “The Box” or something else? Identify “dead end” points (e.g., 0 hits, 2000 hits) where assistance could be added  Query syntax helps you select search features to expose (e.g., use of Boolean operators) OR
Search Results Interface Design: Which results where? #10 result is clicked through more often than #s 6, 7, 8, and 9 (ten results per page) From SLI Systems (www.sli-systems.com)
Search Results Interface Design: How to sort results? Financial Times  has found that users often include dates in their queries Obvious but effective improvement:  allow users to sort by date
Search System: What to change? Add functionality:  Financial Times  added spell checking Retrieval algorithm modifications Financial Times  weights company names higher  Netflix determines better weighting for unique terms and phrases  Deloitte, Barnes & Noble, Vanguard demonstrate that basic improvements (e.g., Best Bets) are insufficient (and justify increased $$$)
Navigation: Any improvements? Michigan State University builds A-Z index automatically based on frequent queries
Navigation: Where does it fail? Track and study pages (excluding main page) where search is initiated What do they search?  (e.g., acronyms, jargon) Are there other issues that would cause a “dead end”?  (e.g., tagging and titling problems) Are there user studies that could test/validate problems on these pages? (e.g., “Where did you want to go next?)
Metadata Development: How do searchers express their needs? Tone and jargon (e.g., “cancer” vs. “oncology,” “lorry” vs. “truck,” acronyms) Syntax (e.g., Boolean, natural language, keyword) Length (e.g., number of terms/query; Long Tail queries longer and more complex than Short Head) Everything we know from analyzing folksonomic tags applies here, and vice versa
Metadata Development: Which values and attributes? Uncover hierarchy and identify Metadata values (e.g.,  mobile  vs.  cell) Metadata attributes (e.g.,  genre, region) Content types (e.g.,  spec, price sheet) SA combines with AI tools for clustering, enabling concept searching and thesaurus development
Metadata Development: Leveraging differences in the curve Variations in information needs emerge between Short Head and Long Tail Example:  Deloitte intranet’s “known-item” queries are common; research topics are infrequent known-item queries research queries
Organizational Impact: Educational opportunities “ Reverse engineer” performance problems Vanguard  Tests “best” results for common queries Determines why these results aren’t retrieved or clicked-through  Demonstrates problem and solutions to content owners/authors benefits Sandia Labs does same, only with top results that are losing rank in search results pages
Organizational Impact: Reexamining assumptions Financial Times  learns about breaking stories from their logs by monitoring spikes in company names and individuals’ names and comparing with their current coverage Discrepancy = possible breaking story; reporter is assigned to follow up Next step?  Assign reporters to “beats” that emerge from SA
SA as User Research Method:  Sleeper, but no panacea Benefits Non-intrusive Inexpensive and (usually) accessible Large volume of “real” data Represents actual usage patterns Drawbacks Provides an incomplete picture of usage:  was user satisfied at session’s end? Difficult to analyze:  where are the commercial tools? Complements  qualitative methods (e.g., persona development, task analysis, field studies)
SA Headaches: What gets in the way? Problems* Lack of time Few useful tools for parsing logs, generating reports Tension between those who want to perform SA and those who “own” the data (chiefly IT) Ignorance of the method Hard work and/or boredom of doing analysis Most  of these are going away…  * From summer 2006 survey (134 responses), available at book site.
Please Share Your SA Knowledge: Visit our  book in progress  site Search Analytics for Your Site:  Conversations with your  Customers  by Louis  Rosenfeld and Richard  Wiggins (Rosenfeld  Media, 2008) Site URL:  www.rosenfeldmedia.com/books/searchanalytics/ Feed URL:  feeds.rosenfeldmedia.com/searchanalytics/
Contact Information Louis Rosenfeld  Rosenfeld Media, LLC 705 Carroll Street, #2L Brooklyn, NY  11215  USA +1.718.306.9396 [email_address] www.louisrosenfeld.com www.rosenfeldmedia.com

Search Analytics for Fun and Profit

  • 1.
    Search Analytics forFun and Profit An Event Apart Chicago, Illinois August 27, 2007 Lou Rosenfeld www.rosenfeldmedia.com
  • 2.
    Who I AmInformation architecture consultant to Fortune 500s Publisher and founder, Rosenfeld Media Blog at www.louisrosenfeld.com Co-author, Information Architecture for the World Wide Web (3rd ed., 2006; O’Reilly) New book: Search Analytics for Your Site: Conversations with your customers (2008; Rosenfeld Media): www.rosenfeldmedia.com/books/searchanalytics
  • 3.
    Anatomy of aSearch Log (from Google Search Appliance) Critical elements in pink : IP address , time/date stamp , query , and # of results: XXX.XXX.X.104 - - [ 10/Jul/2006:10:25:46 -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort=date%3AD%3AL%3Ad1&ud=1&site=AllSites&ie=UTF-8&client=www&oe=UTF-8&proxystylesheet=www&q= lincense+plate &ip=XXX.XXX.X.104 HTTP/1.1" 200 971 0 0.02 XXX.XXX.X.104 - - [ 10/Jul/2006:10:25:48 -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort=date%3AD%3AL%3Ad1&ie=UTF-8&client=www&q= license+plate &ud=1&site=AllSites&spell=1&oe=UTF-8&proxystylesheet=www&ip=XXX.XXX.X.104 HTTP/1.1" 200 8283 146 0.16 XXX.XXX.XX.130 - - [ 10/Jul/2006:10:24:38 -0800] "GET /search?access=p&entqr=0&output=xml_no_dtd&sort=date%3AD%3AL%3Ad1&ud=1&site=AllSites&ie=UTF-8&client=www&oe=UTF-8&proxystylesheet=www&q= regional+transportation+governance+commission &ip=XXX.XXX.X.130 HTTP/1.1" 200 9718 62 0.17
  • 4.
    The Zipf Curve: Short Head, Middle Torso, Long Tail
  • 5.
    Keep It InProportion 7218 campus map 5859 map 5184 im west 4320 library 3745 study abroad 3690 schedule of courses 3584 bookstore 3575 spartantrak 3229 angel 3204 cata
  • 6.
    What’s the SweetSpot? department of surgery 7 80.00 7877 hotels 124 50.02 500 msu union 295 40.05 221 computer center 650 30.01 98 webenroll 1351 20.18 42 housing 2464 10.53 14 campus map 7218 1.40 1 Query Count Cumul. % Rank
  • 7.
    Topical Patterns andSeasonal Changes
  • 8.
    Where will you Capture Search Queries? The search logs that your search engine naturally captures and maintains as searches take place Search keywords or phrases that your users execute, that you capture into your own local database Search keywords or phrases that your commercial search solution captures, records, and reports on (Mondosoft, Visual Sciences, Ultraseek, Google Appliance, etc.)
  • 9.
    Querying your Queries: Getting started What are the most frequent unique queries? Are frequent queries retrieving quality results? Click-through rates per frequent query? Most frequently clicked result per query? Which frequent queries retrieve zero results? What are the referrer pages for frequent queries? Which queries retrieve popular documents? What interesting patterns emerge in general?
  • 10.
    Tune your Questions:From generic to specific Netflix asks Which movies most frequently searched? Which of them most frequently clicked through? Which of them least frequently added to queue?
  • 11.
    Diagnose This: Fixing and improving the UX User Research Content Development Interface Design: search entry interface, search results Retrieval Algorithm Modification Navigation Design Metadata Development
  • 12.
    User Research: Whatdo they want?… SA is a true expression of users’ information needs (often surprising: e.g., SKU #s at clothing retailer; URLs at IBM) Provides context by displaying aspects of single search sessions
  • 13.
    User Research: …whatelse do they want?… BBC provides reports to determine other terms searched within same session (tracked by cookies)
  • 14.
    User Research: …whowants it?… Specific segments needs as determined by: Security clearance IP address Job function Account information Alternatively, you may be able to extrapolate segments directly from SA Pages they initiate searches from
  • 15.
    User Research: …whowants it?… BBC’s top queries report from children’s section of site
  • 16.
    User Research: …andwhen do they want it? Time-based variation (and clustered queries) from MSU By hour, by day, by season Helps determine “best bets” development Also can help tune main page and other editorial content
  • 17.
    Content Development: Dowe have the right content? From www.behaviortracking.com Analyze 0 result queries Does the content exist? If so, there are titling, wording, metadata, or indexing problems If not, why not?
  • 18.
    Content Development: Arewe featuring the right stuff? Track clickthroughs to determine which results should rise to the top (example: SLI Systems) Also suggests which “best bets” to develop to address common queries BBC removes navigation pages from search results
  • 19.
    Search Entry InterfaceDesign: “The Box” or something else? Identify “dead end” points (e.g., 0 hits, 2000 hits) where assistance could be added Query syntax helps you select search features to expose (e.g., use of Boolean operators) OR
  • 20.
    Search Results InterfaceDesign: Which results where? #10 result is clicked through more often than #s 6, 7, 8, and 9 (ten results per page) From SLI Systems (www.sli-systems.com)
  • 21.
    Search Results InterfaceDesign: How to sort results? Financial Times has found that users often include dates in their queries Obvious but effective improvement: allow users to sort by date
  • 22.
    Search System: Whatto change? Add functionality: Financial Times added spell checking Retrieval algorithm modifications Financial Times weights company names higher Netflix determines better weighting for unique terms and phrases Deloitte, Barnes & Noble, Vanguard demonstrate that basic improvements (e.g., Best Bets) are insufficient (and justify increased $$$)
  • 23.
    Navigation: Any improvements?Michigan State University builds A-Z index automatically based on frequent queries
  • 24.
    Navigation: Where doesit fail? Track and study pages (excluding main page) where search is initiated What do they search? (e.g., acronyms, jargon) Are there other issues that would cause a “dead end”? (e.g., tagging and titling problems) Are there user studies that could test/validate problems on these pages? (e.g., “Where did you want to go next?)
  • 25.
    Metadata Development: Howdo searchers express their needs? Tone and jargon (e.g., “cancer” vs. “oncology,” “lorry” vs. “truck,” acronyms) Syntax (e.g., Boolean, natural language, keyword) Length (e.g., number of terms/query; Long Tail queries longer and more complex than Short Head) Everything we know from analyzing folksonomic tags applies here, and vice versa
  • 26.
    Metadata Development: Whichvalues and attributes? Uncover hierarchy and identify Metadata values (e.g., mobile vs. cell) Metadata attributes (e.g., genre, region) Content types (e.g., spec, price sheet) SA combines with AI tools for clustering, enabling concept searching and thesaurus development
  • 27.
    Metadata Development: Leveragingdifferences in the curve Variations in information needs emerge between Short Head and Long Tail Example: Deloitte intranet’s “known-item” queries are common; research topics are infrequent known-item queries research queries
  • 28.
    Organizational Impact: Educationalopportunities “ Reverse engineer” performance problems Vanguard Tests “best” results for common queries Determines why these results aren’t retrieved or clicked-through Demonstrates problem and solutions to content owners/authors benefits Sandia Labs does same, only with top results that are losing rank in search results pages
  • 29.
    Organizational Impact: Reexaminingassumptions Financial Times learns about breaking stories from their logs by monitoring spikes in company names and individuals’ names and comparing with their current coverage Discrepancy = possible breaking story; reporter is assigned to follow up Next step? Assign reporters to “beats” that emerge from SA
  • 30.
    SA as UserResearch Method: Sleeper, but no panacea Benefits Non-intrusive Inexpensive and (usually) accessible Large volume of “real” data Represents actual usage patterns Drawbacks Provides an incomplete picture of usage: was user satisfied at session’s end? Difficult to analyze: where are the commercial tools? Complements qualitative methods (e.g., persona development, task analysis, field studies)
  • 31.
    SA Headaches: Whatgets in the way? Problems* Lack of time Few useful tools for parsing logs, generating reports Tension between those who want to perform SA and those who “own” the data (chiefly IT) Ignorance of the method Hard work and/or boredom of doing analysis Most of these are going away… * From summer 2006 survey (134 responses), available at book site.
  • 32.
    Please Share YourSA Knowledge: Visit our book in progress site Search Analytics for Your Site: Conversations with your Customers by Louis Rosenfeld and Richard Wiggins (Rosenfeld Media, 2008) Site URL: www.rosenfeldmedia.com/books/searchanalytics/ Feed URL: feeds.rosenfeldmedia.com/searchanalytics/
  • 33.
    Contact Information LouisRosenfeld Rosenfeld Media, LLC 705 Carroll Street, #2L Brooklyn, NY 11215 USA +1.718.306.9396 [email_address] www.louisrosenfeld.com www.rosenfeldmedia.com