Better Search Engine Testing

  2. 2. WHY AM I QUALIFIED TO BE UP HERE?• President of OpenSource Connections• Contributor to CruiseControl and Continuum CI projects• Member of Apache Software Foundation• Presenter at conferences (OSCON, ApacheCON, jTDS, ExpoQA, STPcon 2009!)
  7. 7. AGENDAWhy is Search Becoming More Important? What is a Search Engine? Techniques for Testing Wrap Up
  9. 9. INFORMATION IS EXPLODING “information workers ... are each bombarded with1.6gigabytes of information on average every day through emails, reports, blogs, text messages,calls and more”.•
  10. 10. UNSTRUCTURED• emails, spreadsheets, documents, presentations, images, databases • 75% unstructured to 25% structured
  11. 11. MANAGING DATA IS EXPENSIVE•1 GB costs $.20 to store•1 GB costs $3500 to Manage
  12. 12. WHAT DOES 3500 BUY YOU?• 69% of respondents felt 50% or less of data could be found online• Knowledge workers spend 25% of their time engaged in search-related activities.
  13. 13. WHY NOT JUST USE GOOGLE • We don’t want 44 million results, we want 1 • we want “the” answer, not “an” answer • we tolerate inefficieny in the Internet search AsJohn Allenhappy toputs it: “The Internet is • We are Paulos “satisfice”the worlds largest library.Its just that all the books are on the floor.”
  16. 16. CONTENT INDEXING•- creating an index by crawling the content directories, databases, other repositories using an automated process (either pushing or pulling changes)• create an Index, which is a searchable key to a collection.• In Enterprise Search, the indexing mechanism should be able to access company private data (with access privileges maintained)• control indexing schedule - being able to index rapidly changing content quickly, other content more slowly.• rather than having the bot look for the data.
  17. 17. CONTENT INDEXING• Indexing may also support • Metadata extraction • Auto-summarization, which is analyse of the collection and group its content into categories or clusters.• Metadatain turn becomes facets that can be used to tune the query to put emphasis on that category.
  23. 23. HOW DO WE TEST?
  24. 24. HOW DO WE TEST?• Querying• Formatting• Content Indexing• Performance
  25. 25. WHO SHOULD TEST?
  26. 26. CHALLENGES• Competing business stakeholders: • Tester: When I search for “lamp shades”, I used to see these documents, now I see a differing set. • Business Owner: How do I know that the new search engine is better? • User: My pet feature “search within these results” works differently. • Marketing Guy: I want to control the results so the current marketing push for toilet paper brand X always shows up at the top.
  27. 27. CHALLENGES• Stakeholders want a better search implementation, but perversely often want it to all work “the exact same way”.   Getting agreement across all the stakeholders for the project vision, and agree on the metrics is a challenge.
  28. 28. PERFECT SEARCH TESTER WOULD BE ALL OF• Mathematician • Business Analyst• Librarian • Systems Engineer• UX Expert • Geographer!• Writer • Psychologist• Programmer
  29. 29. KNOWLEDGE TRANSFER• If you don’t have the perfect team already, bring in experts and do domain knowledge transfer.• Learn the vocabulary of search to better communicate together • “auto complete” vs “auto suggest”• Do “Search for Content Team” brownbag sessions!
  30. 30. QUERY TESTING• Often called “relevancy testing”
  31. 31. TWO SCHOOLS OF THOUGHT• “One True Answer”• “I know it when I see it”
  32. 32. “ONE TRUE ANSWER”• Absolute Truth / Matrix / Grid / TREC / Relevancy Assertions • The correct answers for each search are known ahead of time • Humans judges often decide these correct answers, stored as Relevancy Assertions • Can be labor intensive to setup• A “Numerical Grade” is produced for comparision
  33. 33. PROBLEMS WITH THIS APPROACH• Open to gaming. TREC competition is swamped by “academic” search engine efforts that don’t work in the real world.• Needa well understood data set with generally accepted answers. is it better to have an engine that gives modestly relevant results almost all the time, or an engine that gives really good answers sometimes, better on average than the other engine, but sometimes gives back complete garbage?
  34. 34. A/B TESTING Engine version 1 and version 2!• Tracks explicit or implicit preferences between engines A/B• Often dispenses with the notion of the "correct" answer• Canbe easier to setup, but some fear the best answers will be missed by both engines
  35. 35. RELEVANCY• Do we have any defined relevancy metrics?• Relevancy is like porn.....
  36. 36. I KNOW IT WHEN I SEE IT!
  37. 37. BEYOND PRECISION AND RECALL: HOW ENGINES ARE• Binary vs. Non-Binary Grading Systems • Early TREC had binary judgements, only Yes/No on whether each doc was related to a test search • More choices were later added •A system can use letter grades (A, B, C, D and F) or numeric grades • Another style asks testers to sort documents in their preferred order
  38. 38. CLASSIC MEASUREMENTS OF SEARCH RELEVANCY• Recall: "Did I find all the documents I expected to get back?  What percent?"• Precision: "Did the system bring back other documents that werent relevant?  What percent were on target?"
  39. 39. NEWER IDEAS• Rank: The order of documents that were returned • Generally a 1 in 20 match in the #1 spot is better than a 50% rate where all matches are on the second page.
  40. 40. INTERACTIVITY: WHAT NAVIGATORS OR VISUALIZATION WERE GIVEN• Facets and sorting: Clickable filters and sort options• Unsupervised Clustering: Related terms or phrases, or related searches• Spelling and thesaurus suggestions
  41. 41. SUBJECT DISAMBIGUATION, SENTIMENT, CONFLICTING INFORMATION, CROWD HINTS• kidney bean or kidney cell• "best football team in the UK"
  43. 43. DIFFERENT GOALS• Perfect/Human vs. Best vs. Acceptable vs. Better than X• Constrained vs. Unconstrained Resources (time, cpu, storage)
  44. 44. SAMPLE SIZE• Amount of Data • Fixed set or growing over time• Number of Testers (AB or Relevancy Judgments)• Number of Searches
  45. 45. VERTICAL VS. HORIZONTAL CONTENT• Oneextreme: Specific demo may cover just one discipline, for example Medical Journals• Other extreme: Internet covers vastly disparate domains
  46. 46. USERS• Experienced vs. New Searcher• Subject Expert vs. Novice• Spelling, typing and computer proficiency• InterfaceMedium (large visual display, small text display, audible, Braille, etc)• Amount of Effort to understand Search• Willingness to Iterate• Searching for specific answer vs. General Exploration
  47. 47. TYPE OF SEARCHES• Length / 1 or 2 words• Full question• Sample text• Internet Boolean• Advanced Boolean / Syntax / Proximity • Wildcard, Regex, etc.
  48. 48. PUNCTUATION• Chemical• Source Code• Units of Measure• Literal vs. Search Operator
  49. 49. NOT EVEN GETTING INTO MULTI LINGUAL SEARCH• How do I test in languages I don’t understand?
  51. 51. FORMATTING TESTING• Directly builds on most of our existing test skills.
  53. 53. Persona 1: Going to be a mom Oh my God Needy I’m actually Pregnant! Narrative Self Introduction Hi all, Im very new to this but i couldnt help but share my excitement. I have just found out today that i am pregnant. It wasnt planned, me and my partner of a year and a half were going to wait until we had our own place and were married first but it looks like we have done it the other way round. What’s next? What am I supposed to do? My only concern is that i dont really know how my boyfriend feels about it. I Guidance please! know we need to discuss the options but i have really already made up my mind about what i want to do. There is so much to consider, money, a decentTo interact withpeople going place to live, being ready but i know i am ready and have been for a long time ( Ithrough the same get extremely broody when i see my friends kids)thing. Should i just tell him how i feel or go with how he feels because i dont want to Scenarios that typify - lose him. He is a loving partner who would stand by me through anything i just planned to get pregnant, but don’t want him to feel like i am tying him down!!! I suppose i am feeling very happy but also very confused at the same time!!! hasnt done any research Catch phrases - Nervous but excited, giddy, Where do I hp:// start? my-god-i-m-pregnant.html Tag lines - Wants to share, has a million questions Likely to say - Guide me, help me get off to a good start
  54. 54. Persona 2: New Mom Are my kids sick or is Demanding this condition normal? How do I…? Narrative I have been hearing about women who claim that thier 2, 2 1/2 or 3 year old is not ready for the poy. They claim its a nightmare and are waiting for their children to come around. Maybe I grew up in the twilight zone, but I had always assumed that poy training was something that is just done. Its done when: How do I ensure my a) The child in question can sleep through the night and stay dry. baby is latching on b) The child in question can speak to you, in full sentences. like, "apple juice, please" or correctly? "wanna go to the park" or "momma I wanna hold you..." c) The child in question knows they are soiled and can ask to be changed. What type of stroller Barring any of those things, a child is ready to be placed on the poy. using the poy was should I buy? What never negotiable in my family. When we hit the above milestones my mother trained us. We brand of car seat is just did it. If we complained she never put diapers on us, she just kept directing us back to best? the poy. Her methods of redirection may be controversial (she told my brother that unless he was a big boy he would not get a happy meal. Boys who pooed on themselves got sad meals... lol!!! He straightened up and started using the poy at 2 1/2) but she was never abusive or anything she just DIDNT ASK US. it was time to poy and that was it. Scenarios that Typify The reasoning was that I used to drink from a bole, and sleep with my mother, and such, Likely to say -Are my kids sick or is this condition normal? now I dont. I also used to crap my pants, and that is no longer allowed after a certain point. Describes herself - wants to be a good mother, looking for expert advice, wants to get ideas from other My question is this: why ask children if they are ready to use the poy, after they are moms clearly ready to use it (with language tools and bladder control)? Why is it treated like Narrative- could be working mom, something that is negotiable or that the child has a choice of either coming around to it or could be stay at home mom not? I understand that children are sensitive and you have to follow their lead, at times. Questions likely to ask - sometimesThis picture captures my life perfectly: an wants to ask questions/get expert But allowing them to shit adult beverage advice siing on a book about underpants.
  55. 55. Scenario 1 “I know went through this before with my first child,Find old answer but cannot recall the answer” Preamble Experienced mom has a déjà vu moment about a previous problematic experience with her first child. She has a partial recollection of a piece of information Success Factors related to the answer she seeks but she needs help in • Speed of Comprehension pulling • Directness to destination • Reduced: • Number of queries • Number of results • Indirect Knowledge Transfer Thinking aloud in the Family Room Very nice – lists out related Josh had not started to cry non- concerns for constipation. Let’s Hhhm I now I had the stop for 3 hours when it finally see: ‘symptoms’, ‘cures’, ‘when to same issue with josh, wwwaaaaaaaa dawned on me that he had not had call the doctor’, ‘what other moms but what the heck did wwwwaaaaaa . . . ggg a movement for 3 days . . . are saying’, ‘topic over view’ I do? wwwaaaaaaaa . . . Let’s try querying that . . . “no Ok – I’ll take ‘cures’ Alex for a 300 poop” . . . Not likely . . . Uumm . .. points and my personal sanity! “constipation”? Oh, might help to Water . . . fruit juice . . . high-fiber specify who as well . . . “baby” . . . baby foods - Ahhh prune juice . . . prune juice! Now why didn’t I remember that! After hours of frustration mother home alone has a Mother starts to type in query but suggest-as-you- Structured results quickly tip off the mother to the partial epiphany as to her child’s problem. type search box hints to her to be more specific. assorted aspects of constipation. She focuses in on one of the aspects and has total recollection of her previous experience.
  56. 56. Scenario 2 It’s 2am and I don’t know who to ask?”Urgent Question Preamble Mother of twins finds herself with panicked in the early morning hours with a new situation. Success Factors • Speed of Comprehension • Directness to answer Crying in the Kitchen I don’t have to read hundreds of ‘102’ . . . thank god ! wwwaaaaaaaa pages on the internet . . . I just need We’re safe wwwwaaaaaa . . . ggg a quick concise answer . . . wwwaaaaaaaaa . . . wwwaaaaaaaa . . . at what temperature do I need to Crap! Who I am I wwwwaaaaa . . . ggg be worried . . ? ! Ahhh . . . that’s helpful - other supposed to at this wwwaaaaaaaaa . . . conditions to know about . . . hour ! Why is it no body is Please [BabyCenter] show me the open when I need them ? ! answer . . ! That’s thorough : ‘What will the doctor do? ‘ wwwaaaaaaaa Interesting ‘If fever is a defense against infection, is wwwwaaaaaa . . . ggg it really a good idea to try to bring it down?’ wwwaaaaaaaaa . . . wwwaaaaaaaa wwwwaaaaaa . . . ggg wwwaaaaaaaaa . . . Let me book mark this for later. In the middle of the night, a mother of twins finds Mother starts to type in a query but notices the The mother zooms in on the specific answer she herself alone, overwhelmed, and in dire need of an suggest-as-you-type search box lets her narrow her seeks. But then she notices collateral knowledge answer. question boosting her confidence she is going to get she takes note of for later reading. the answer she needs.
  57. 57. CONTENT INDEXING TESTING• Leverages our normal testing skills. And typically what it really means is “Performance Testing”. • Lot’s of “integration” testing.
  59. 59. LEVELS OF SCALING• Scale High • There is a quickly hit point of diminishing returns!• Scale Wide • The safety valve for lots of load.• Scale Deep • Scaling Deep? You are doing some crazy stuff with huge indexes!! 65
  60. 60. SCALE WIDE (SLAVES)• Too many inbound queries!• slaves poll master for changes• index and config files transferred• ALL JAVA! 66
  61. 61. SCALE WIDE (SHARDING)• Too large of an index to query• Split index over multiple Search servers •A -> M: Server 1, N -> Z: Server 2 • uniqueId.hash % numServers• Relevancy typically balanced shards• Requestsplit across shards, results aggregated to single response 67
  62. 62. SCALE DEEP• Combine both scaling wide to handle number of queries with sharding to handle size of indexes! 68
  63. 63. WRAP UP
  66. 66. RESOURCES• Google-for-Enterprise-Knowledge•• and-precision/•
  68. 68. THANK YOU!• twitter: dep4b• speakerrate:• email: 74