Better Search Engine Testing - Eric Pugh

Uploaded on

See conference video - …

See conference video -

"I know it when I see it".

This term was coined by a Supreme Court Justice in reference to obscenity, but he might as well been talking about relevancy and search engine results. Testing search engines is rarely a binary process of "it works, it doesn't work", instead it draws on our human skills to design tests that capture the intangibles that make up a great search engine implementation! The behavior of a search engine changes as the data changes, so a search that returns one set of results today will return a different set tomorrow. Is that a bug? Or just a finely tuned search engine responding to changes in the data it searches? Search Engine testing often focuses on the very first layer of functionality, "Do I get results?", without digging deeper into "Do I get great relevant results?".

More in: Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. BETTER SEARCH ENGINE TESTING STPCON 2011 | EPUGH@O19S.COM | @DEP4B 1 WHY AM I QUALIFIED TO BE UP HERE?• President of OpenSource Connections• Contributor to CruiseControl and Continuum CI projects• Member of Apache Software Foundation• Presenter at conferences (OSCON, ApacheCON, jTDS, ExpoQA, STPcon 2009!) 2
  • 4. AGENDA Why is Search Becoming More Important? What is a Search Engine? Techniques for Testing Wrap Up 7WHY IS SEARCHBECOMING MORE IMPORTANT? 8
  • 5. INFORMATION IS EXPLODING “information workers ... are each bombarded with1.6 gigabytes of information on average every day through emails, reports, blogs, text messages,calls and more”. • 9 UNSTRUCTURED• emails, spreadsheets, documents, presentations, images, databases • 75% unstructured to 25% structured 10
  • 6. MANAGING DATA IS EXPENSIVE•1 GB costs $.20 to store•1 GB costs $3500 to Manage 11 WHAT DOES 3500 BUY YOU?• 69% of respondents felt 50% or less of data could be found online• Knowledge workers spend 25% of their time engaged in search-related activities. 12
  • 7. WHY NOT JUST USE GOOGLE • We don’t want 44 million results, we want 1 • we want “the” answer, not “an” answer • we tolerate inefficieny in the Internet search AsJohn Allenhappy toputs it: “The Internet is • We are Paulos “satisfice”the worlds largest library.Its just that all the books are on the floor.” 13 WHAT IS A SEARCH ENGINE? 14
  • 8. 1516
  • 9. 1718
  • 11. CONTENT INDEXING•- creating an index by crawling the content directories, databases, other repositories using an automated process (either pushing or pulling changes)• create an Index, which is a searchable key to a collection.• In Enterprise Search, the indexing mechanism should be able to access company private data (with access privileges maintained)• control indexing schedule - being able to index rapidly changing content quickly, other content more slowly.• rather than having the bot look for the data. 21 CONTENT INDEXING• Indexing may also support • Metadata extraction • Auto-summarization, which is analyse of the collection and group its content into categories or clusters.• Metadatain turn becomes facets that can be used to tune the query to put emphasis on that category. 22
  • 13. FORMATTING 25 FACETINGFaceted or "guided navigation"leverages metadata fields andvalues to provide users with visibleoptions for narrowing or refiningtheir query.- Peter Morville, Search Patterns 26 26
  • 14. Search Stack User Interface Search Engine Data 27 HOW DO WE TEST? 28
  • 15. HOW DO WE TEST?• Querying• Formatting• Content Indexing• Performance 29 WHO SHOULD TEST? 30
  • 16. CHALLENGES• Competing business stakeholders: • Tester: When I search for “lamp shades”, I used to see these documents, now I see a differing set. • Business Owner: How do I know that the new search engine is better? • User: My pet feature “search within these results” works differently. • Marketing Guy: I want to control the results so the current marketing push for toilet paper brand X always shows up at the top. 31 CHALLENGES • Stakeholders want a better search implementation, but perversely often want it to all work “the exact same way”. ! Getting agreement across all the stakeholders for the project vision, and agree on the metrics is a challenge. 32
  • 17. PERFECT SEARCH TESTER WOULD BE ALL OF• Mathematician • Business Analyst• Librarian • Systems Engineer• UX Expert • Geographer!• Writer • Psychologist• Programmer 33 KNOWLEDGE TRANSFER• If you don’t have the perfect team already, bring in experts and do domain knowledge transfer.• Learn the vocabulary of search to better communicate together • “auto complete” vs “auto suggest”• Do “Search for Content Team” brownbag sessions! 34
  • 18. QUERY TESTING• Often called “relevancy testing” 35 TWO SCHOOLS OF THOUGHT• “One True Answer”• “I know it when I see it” 36
  • 19. “ONE TRUE ANSWER”• Absolute Truth / Matrix / Grid / TREC / Relevancy Assertions • The correct answers for each search are known ahead of time • Humans judges often decide these correct answers, stored as Relevancy Assertions • Can be labor intensive to setup• A “Numerical Grade” is produced for comparision 37 PROBLEMS WITH THIS APPROACH• Open to gaming. TREC competition is swamped by “academic” search engine efforts that don’t work in the real world.• Needa well understood data set with generally accepted answers. is it better to have an engine that gives modestly relevant results almost all the time, or an engine that gives really good answers sometimes, better on average than the other engine, but sometimes gives back complete garbage? 38
  • 20. A/B TESTING Engine version 1 and version 2!• Tracks explicit or implicit preferences between engines A/B• Often dispenses with the notion of the "correct" answer• Canbe easier to setup, but some fear the best answers will be missed by both engines 39 RELEVANCY• Do we have any defined relevancy metrics?• Relevancy is like porn..... 40
  • 21. I KNOW IT WHEN I SEE IT! 41 BEYOND PRECISION AND RECALL: HOW ENGINES ARE• Binary vs. Non-Binary Grading Systems • Early TREC had binary judgements, only Yes/No on whether each doc was related to a test search • More choices were later added •A system can use letter grades (A, B, C, D and F) or numeric grades • Another style asks testers to sort documents in their preferred order 42
  • 22. CLASSIC MEASUREMENTS OF SEARCH RELEVANCY• Recall: "Did I find all the documents I expected to get back?! What percent?"• Precision: "Did the system bring back other documents that werent relevant?! What percent were on target?" 43 NEWER IDEAS• Rank: The order of documents that were returned • Generally a 1 in 20 match in the #1 spot is better than a 50% rate where all matches are on the second page. 44
  • 23. INTERACTIVITY: WHAT NAVIGATORS OR VISUALIZATION WERE GIVEN• Facets and sorting: Clickable filters and sort options• Unsupervised Clustering: Related terms or phrases, or related searches• Spelling and thesaurus suggestions 45 SUBJECT DISAMBIGUATION, SENTIMENT, CONFLICTING INFORMATION, CROWD HINTS• kidney bean or kidney cell• "best football team in the UK" 46
  • 24. 47SOURCES OF VARIANCE, AKA "PROBLEMS" Note, this is talking about comparing search engine a to search engine b. But I am thinking more in the context of search engive v1 to v2! 48
  • 25. DIFFERENT GOALS• Perfect/Human vs. Best vs. Acceptable vs. Better than X• Constrained vs. Unconstrained Resources (time, cpu, storage) 49 SAMPLE SIZE• Amount of Data • Fixed set or growing over time• Number of Testers (AB or Relevancy Judgments)• Number of Searches 50
  • 26. VERTICAL VS. HORIZONTAL CONTENT• Oneextreme: Specific demo may cover just one discipline, for example Medical Journals• Other extreme: Internet covers vastly disparate domains 51 USERS• Experienced vs. New Searcher• Subject Expert vs. Novice• Spelling, typing and computer proficiency• InterfaceMedium (large visual display, small text display, audible, Braille, etc)• Amount of Effort to understand Search• Willingness to Iterate• Searching for specific answer vs. General Exploration 52
  • 27. TYPE OF SEARCHES• Length / 1 or 2 words• Full question• Sample text• Internet Boolean• Advanced Boolean / Syntax / Proximity • Wildcard, Regex, etc. 53 PUNCTUATION• Chemical• Source Code• Units of Measure• Literal vs. Search Operator 54
  • 28. NOT EVEN GETTING INTO MULTI LINGUAL SEARCH• How do I test in languages I don’t understand? 55 GROK YOUR RESULTS 56
  • 29. FORMATTING TESTING• Directly builds on most of our existing test skills. 57 PERSONAS & SCENARIOS 58
  • 30. Persona 1: Going to be a mom Oh my God Needy I’m actually Pregnant! Narrative Self Introduction Hi all, Im very new to this but i couldnt help but share my excitement. I have just found out today that i am pregnant. It wasnt planned, me and my partner of a year and a half were going to wait until we had our own place and were married first but it looks like we have done it the other way round. What’s next? What am I supposed to do? Guidance please! My only concern is that i dont really know how my boyfriend To interact with feels about it. I know we need to discuss the options but i people going have really already made up my mind about what i want to do. through the same There is so much to consider, money, a decent place to live, thing. being ready but i know i am ready and have been for a long time ( I get extremely broody when i see my friends kids) Scenarios that typify - planned to get pregnant, but Should i just tell him how i feel or go with how he feels hasnt done any research because i dont want to lose him. He is a loving partner who Catch phrases - Nervous but would stand by me through anything i just don’t want him to excited, giddy, Where do I feel like i am tying him down!!! I suppose i am feeling very start? happy but also very confused at the same time!!! Tag lines - Wants to share, has a million questions Likely to say - Guide me, __f468_maternite1-Oh-my-god-i-m-pregnant.html help me get off to a good start 59 Persona 2: New Mom Are my kids sick or is Demanding this condition normal? How do I…? Narrative I have been hearing about women who claim that thier 2, 2 1/2 or 3 year old is not ready for the potty. They claim its a nightmare and are waiting for their children to come around. Maybe I grew up in the twilight zone, but I had always assumed that potty training was something that is just done. Its done when: a) The child in question can sleep through the night and stay dry. b) The child in question can speak to you, in full sentences. like, "apple juice, please" or "wanna go to the park" or "momma I wanna How do I ensure my hold you..." baby is latching on c) The child in question knows they are soiled and can ask to be correctly? changed. Barring any of those things, a child is ready to be placed on the potty. What type of stroller using the potty was never negotiable in my family. When we hit the should I buy? What above milestones my mother trained us. We just did it. If we complained brand of car seat is she never put diapers on us, she just kept directing us back to the potty. best? Her methods of redirection may be controversial (she told my brother that unless he was a big boy he would not get a happy meal. Boys who pooed on themselves got sad meals... lol!!! He straightened up and Scenarios that Typify started using the potty at 2 1/2) but she was never abusive or anything she just DIDNT ASK US. it was time to potty and that was it. Likely to say -Are my kids sick or is this condition normal? The reasoning was that I used to drink from a bottle, and sleep with my Describes herself - wants to be a mother, and such, now I dont. I also used to crap my pants, and that is good mother, looking for expert advice, wants to get ideas from other no longer allowed after a certain point. moms Narrative- could be working mom, could be stay at home mom My question is this: why ask children if they are ready to use the potty, Questions likely to ask - sometimes after they are clearly ready to use it (with language tools and bladderThis picture captures my life wants to ask questions/get expert control)? Why is it treated like something that is negotiable or that theperfectly: an adult beverage advice sitting on a book about child has a choice of either coming around to it or not? I understand that underpants. children are sensitive and you have to follow their lead, at times. But allowing them to shit 60
  • 31. Scenario 1Find old answer “I know went through this before with my first child, but cannot recall the answer” Preamble Experienced mom has a déjà vu moment about a previous problematic experience with her first child. She has a partial recollection of a piece of information Success Factors related to the answer she seeks but she needs help in • Speed of Comprehension pulling • Directness to destination • Reduced: • Number of queries • Number of results • Indirect Knowledge Transfer Thinking aloud in the Family Room Very nice – lists out related Josh had not started to cry concerns for constipation. Hhhm I now I non-stop for 3 hours when Let’s see: ‘symptoms’, had the same wwwaaaaaaaa it finally dawned on me ‘cures’, ‘when to call the issue with josh, wwwwaaaaaa . . . that he had not had a doctor’, ‘what other moms but what the ggg movement for 3 days . . . are saying’, ‘topic over heck did I do? wwwaaaaaaaa . . . Let’s try querying that . . . view’ Ok – I’ll take ‘cures’ Alex “no poop” . . . Not likely . . . for a 300 points and my Uumm . .. “constipation”? personal sanity! Water . . . Oh, might help to specify fruit juice . . . high-fiber who as well . . . “baby” . . . baby foods - Ahhh prune juice . . . prune juice! Now why didn’t I remember that! After hours of frustration mother home alone has a Mother starts to type in query but suggest-as-you- Structured results quickly tip off the mother to the partial epiphany as to her child’s problem. type search box hints to her to be more specific. assorted aspects of constipation. She focuses in on one of the aspects and has total recollection of her previous experience. 61Scenario 2Urgent Question It’s 2am and I don’t know who to ask?” Preamble Mother of twins finds herself with panicked in the early morning hours with a new situation. Success Factors • Speed of Comprehension • Directness to answer Crying in the Kitchen I don’t have to read ‘102’ . . . thank wwwaaaaaaaa hundreds of pages on the god ! We’re safe wwwwaaaaaa . . . internet . . . I just need a gggwwwaaaaaaaa quick concise answer . . . wwwaaaaaaaaa . . . . . at what temperature do I Crap! Who I am I wwwwaaaaa . . . need to be worried . . ? ! Ahhh . . . that’s helpful - . ggg supposed to at this other conditions to know hour ! Why is it no wwwaaaaaaaaa . . . Please [BabyCenter] show me about . . . body is open when the answer . . ! That’s thorough : ‘What will the doctor I need them ? ! wwwaaaaaaaa do? ‘ wwwwaaaaaa . . . Interesting ‘If fever is a defense against ggg infection, is it really a good idea to try to wwwaaaaaaaa wwwaaaaaaaaa . . bring it down?’ wwwwaaaaaa . . . . ggg wwwaaaaaaaaa . . Let me book mark . this for later. In the middle of the night, a mother of twins finds Mother starts to type in a query but notices the The mother zooms in on the specific answer she herself alone, overwhelmed, and in dire need of an suggest-as-you-type search box lets her narrow her seeks. But then she notices collateral knowledge answer. question boosting her confidence she is going to get she takes note of for later reading. the answer she needs. 62
  • 32. CONTENT INDEXING TESTING• Leverages our normal testing skills. And typically what it really means is “Performance Testing”. • Lot’s of “integration” testing. 63 PERFORMANCE TESTING 64
  • 33. LEVELS OF SCALING• Scale High • There is a quickly hit point of diminishing returns!• Scale Wide • The safety valve for lots of load.• Scale Deep • ScalingDeep? You are doing some crazy stuff with huge indexes!! 65 65 SCALE WIDE (SLAVES)• Too many inbound queries!• slaves poll master for changes• index and config files transferred• ALL JAVA! 66 66
  • 34. SCALE WIDE (SHARDING)• Too large of an index to query• Split index over multiple Search servers •A -> M: Server 1, N -> Z: Server 2 • uniqueId.hash % numServers• Relevancy typically balanced shards• Requestsplit across shards, results aggregated to single response 67 67 SCALE DEEP• Combine both scaling wide to handle number of queries with sharding to handle size of indexes! 68 68
  • 35. WRAP UP 69 User SearchMethodology Interface Engine DataConcurrent Streams of Work Iteration 2 Story:Operationalize Solr Deploy Solr into BabyCenter Test Environment Iteration 2 Story: Search Analysis Integrate Solr into Community UI, A/B Testing Iteration 2 Story:Search Experience Conceptual Model (Personas, etc) & Mockups OSC APPROACH TO SEARCH 70
  • 36. OSC APPROACH TO SEARCH 71 RESOURCES• Google-for-Enterprise-Knowledge•• and-precision/• 72
  • 37. SEARCHPATTERNS.ORG 73 73 THANK YOU!• twitter: dep4b• speakerrate:• email: 74 74