Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CrowdFlower - Best Practices for Building a World-Class Search Engine


Published on

Author -

Published in: Technology
  • Be the first to comment

  • Be the first to like this

CrowdFlower - Best Practices for Building a World-Class Search Engine

  1. 1. Best Practices for Building a World-Class Search Engine
  2. 2. 2 Table of Contents 1 2 3 4 5 6 Introduction Designing an intuitive search engine Understanding and optimizing results Improving your search relevance algorithm How Crowdflower can help Conclusion Page 3 Page 4 Page 8 Page 14 Page 17 Page 22
  3. 3. 3 INTRODUCTION 1Internal search has a direct and tangible impact on revenue. Put simply, users who search are more likely to convert than users who only browse. This is especially true for ecommerce companies, as searchers on those sites are many times more likely to purchase than browsers. It follows then that ecommerce sites would spend a lot of institutional capital to create the most accu- rate, most intuitive, and most powerful search experience they can. For the biggest companies–sites like Amazon and eBay and Walmart–this is a gigantic undertaking. They have massive, fluid product catalogs and, since they sell such an incredible breadth of goods, their algorithms need to be excep- tional. A user can buy the same book in many places; they’re likely to keep coming back if finding it at a competitive price is simple and painless. In this How To Guide, we’ll be taking lessons and best practice from some of the biggest ecommerce sites in the world and teaching you how to hone your search experience. Here’s what we’ll cover: 1. Designing an intuitive search interface: More than just a search box, a great search interface includes autocorrect, faceting, quality product images, filterable results pages, and more. 2. Understanding and optimizing results: We’ll cover the most common metrics data scientists use to grade and test their search algorithms, namely click data, per-result relevance, and whole page relevance. 3. Improving your search relevance algorithm: Learn how to score and test your search algorithms. 4. How CrowdFlower can help: We’ll explain how crowd-based approaches can score, test, and improve your search relevance, regardless of the metric you use. We’ll present real jobs and real use cases to show you how. Let’s start with your search interface. “Shoppers who use site search on ecommerce sites convert at two to three times the rate of those who don’t use it.” MarketingSherpa “Walmart has seen a 10% to 15% increase in shoppers completing purchases for products found via search queries entered on its website” InformationWeek
  4. 4. 4 DESIGNING AN INTUITIVE SEARCH INTERFACE 2 Essentially every website has search functionality, but when we’re talking about your search interface, it’s important to note that we’re not simply referring to your site’s search box. Instead, we’re talking about everything from that box to the facets, categories, autocomplete, autosuggest, and all other functionality that helps users navigate to the products or pages they want to see. Here are six essen- tial features you should include in your front-end search experience. Autocomplete The autocomplete feature in a search box provides users query suggestions as they start typing their query into the search box. PrinterLand, a leading printer retailer in the UK, observed that customers who land on an autocomplete page suggestion are 6 times more likely to convert than those who don’t.(3) And although most ecommerce websites today include the autocomplete feature, some include messy data like duplicates or suggestions that have no products to display. Amazon’s autocomplete is an industry leader. It not only autocompletes popular search queries, but actually includes categories, so that broader searches can be narrowed before users even hit the results page. “Black shoes” on its own isn’t the most specific search term and, because of that, a less advanced search interface might surface some women’s shoes, some toddler’s shoes, some men’s dress shoes, and so on. Because Amazon populates a drop-down with autocompletes and suggested categories, users can navigate to more specific results without having to pare down their search on the results page itself.
  5. 5. 5 Spellcheck Of course, some users will either type through an autocomplete or they might be searching for some- thing your autocomplete algorithm doesn’t totally understand. This leads to misspellings, especially when searchers are typing up brand names. In these cases, it’s important to suggest product names and model numbers that may not exactly match the user’s query, but are similar. The idea is similar to the one we discussed with autocompletion: you’re removing another click or another couple steps from the user’s experience. For example, take a look at the screenshot from Walmart. Their site is smart enough to recognize that a user typing “sandisc” actually means “Sandisk” and populates results based on that understanding. There’s also a link to click if, for some reason, that misspelling happened to be accurate. Related Searches Even for searchers who spell queries correctly, there’s a chance they might use terminology that may not exactly match the terms in your company’s product catalog. For example, if a user types in “suitcase” instead of “luggage” and your database doesn’t account for synonyms, it’s possible that the user might be shown a blank search results page. According to Baymard, an ecommerce usability research firm, some “70% of ecommerce websites require users to search by the exact jargon for the product type that the website uses.” That’s a problem you’ll have to solve for. eBay does this really well. In the screenshot below, a user who types “suitcase” is shown not only multiple suitcase-related terms (like “suitcase set” and “vintage suitcase”) but synonyms as well (“lug- gage” and “travel bag”).
  6. 6. 6 Mapping these categories together often requires some manual work but it’s well worth it. Setting rules so that your algorithm surfaces these sorts of related searches keeps users from re-typing to fit your particular nomenclature or from simply bouncing from your site altogether to a competitor that understands what they want in the first place. Faceting Facets are the filtering options you’ll often see on the left side of a search results page. They let users better refine their results without re-searching, surfacing similar products, color choices, add-ons, and more. This gives buyers additional choices, showcases more of your product catalog at a glance, and provides searchers with all the options they need to stop looking and start buying. Still, less than half of ecommerce sites use faceting. It’s often a bigger engineering ask than most of the other best practices on this list and it requires a really well-organized taxonomy to function properly. For example, your products should have color data if you want to sort by color or brand data to sort by brand. And while faceting alone might not be the reason to populate your taxonomy with this information, having all that data for all your products will give you many more weights to test your algorithm with in the future.
  7. 7. 7 Images Product images are a necessity for any ecom- merce site. According to an Internet Retail Conference Exhibition (IRCE) report, 75% of consumers cited image quality as a critical or very important feature on ecommerce sites. Ability to see color choices and alternate views were cited by 68% and 66% respectively. Larger pictures, higher quality pictures, and multiple angles on products simply drive more sales. Take a look at the screenshot from Ama- zon above. Note that even for an item as simple as a black shoe, there are five hi-res options to choose from. Additionally, there are multiple color choices on the page itself to choose from. This mirrors the experience at a brick-and-mortar store about as well as possible online and gives users additional options without further searching. Snippet Extraction Snippets are the blocks of information listed under or to the side of each product on a search results page. They contain data like price, rat- ings, reviews, product summaries, colors, size, and more. The screenshot from Target does a nice job of illustrating product snippets at a glance. Users can see at a glance the price, the sizes, the available colors, the brand, and relevant reviews. All this data should be present in your existing database but, much like with faceting above, if it doesn’t exist (or if your database is fairly messy), consider building out that taxonomy so your front-end experience can provide the right information to prospective buyers. Conversely, bad snippets can confuse users and drastically increase bounce rate.
  8. 8. 8 UNDERSTANDING AND OPTIMIZING RESULTS 3 The other half of the search experience is, of course, relevant results. After all, the best interface in the world isn’t going to drive conversions if the results your algorithm surfaces are misleading, jumbled, or inexact. In this section, we’ll discuss the best practices we’ve found our customers use to iterate on their search experience, some metrics you should consider when improving and testing your algorithms, and a few counter-intuitive lessons we’ve learned from real-world use-cases. As most data scientists and search architects know, the process of improving search relevance is never truly done. Rather, it’s an ongoing exercise. But the importance of a great search experience is absolutely worth it. We noted this above, but its important to reiterate that shoppers who use site search on ecommerce domains are two to three times more likely to purchase than those who don’t. Which is to say: browsers browse and searchers buy. Past that, it’s not simply returning a great page of results (though that, of course, is important), but rather making sure the best results are as high on the page as you can get them. That’s be- cause clicks aren’t evenly distributed across all your results. In fact, the first result on your page will usually get twice the clicks as your second, which gets twice the clicks as your third, and so on down the page (though it’s worth noting that the final result, which is generally before the “Next” button or the pagination links, does get more interaction than the few before it). You can see the average click distribution in the chart to the right. Before you start really digging in, it’s vital that you know exactly what metrics your site should optimize for. For example, are you worried about certain kinds of products over others? Are you concerned with pages on holistic level? Do you want to account for seasonality? Surface more expensive products? The list goes on and on. Popular Search Relevance Metrics Worth Optimizing The reason you want to consider this up front is fairly practical: you’ll need a particular metric to show to your CEO or other stakeholders to prove that your relevance tuning is making tangible improve- ments. This will keep you from having to drop everything for ad-hoc requests. For example, say your CEO’s son searches for “black iPad” on your website and half the results are for other black tablets in addition to the iPad. Your CEO might ask why your search experience is returning “bad” results when in fact you’d optimized your search for product categories (tablets) and colors before showing white
  9. 9. 9 iPads. He might ask you to redo your strategy based on that single query. Reacting to something like that can be disastrous long-term. Decide which metrics are the most important for your site and stick to them. You’ll have clear goals and a clear number to optimize toward. Here are three commonly used metrics that smart ecommerce companies use to decide if their search relevance efforts are bearing fruit. Metric 1: Click Data Click data is simply what your users click on after they make certain searches. In other words, if a search for “black shoes” consistently results in searchers clicking on a certain link, you can assume that link is a quality result for that query. Click data is one of the easier ways to fine tune your search relevance algorithm. It’s cheap to collect a ton of it because all you’re doing is keeping track of what users do on your site. Which is to say that while some of the metrics down will require human scoring, click data simply doesn’t. On the other hand, click data can sometimes be difficult to interpret. In fact, sometimes what users click on might not be what they’re really looking to buy. Let’s recall that example above. Say a user searches for “black shoes” and presented a result page with eight pairs of black shoes, one pair of white shoes, and a risqué picture of a black woman’s high heel. The white shoes might be especially stylish and might stand out on the page and thus get more clicks than you’d expect. On the other hand, you might have people clicking on that risqué picture and getting sidetracked. Meanwhile, your more accurate searches (the eight pairs of black shoes) seem like less accurate results, even though they aren’t. Speaking of click data sometime being misleading, consider the “next” button. We had a big ecom- merce company that believed when users clicked the “next” button at the bottom of a search page, it was actually a good thing. This might seem counter-intuitive, as it suggests the user didn’t find what they were looking for. But looking at it a different way, you could also argue that clicking a “next” but- ton meant the user actually did think the results were high quality, they just didn’t see the exact pair of, say, black shoes they wanted to buy, so they clicked next until they found that perfect match. Metric 2: Per Result Relevance Per result relevance, on the other hand, requires human scoring. Generally, it works like this: First, what you’ll want to do is pull a random amount of search queries from your logs (say, 500). Then, pull some the top results that appear for each query (say, for example, ten). That’d give you somewhere Click data can sometimes be difficult to interpret. In fact, sometimes what users click on might not be what they’re really looking to buy.
  10. 10. 10 around five thousand query-result pairings, depending on how many results you decide to analyze. You’d want to take that set of pairings and score them on a relevance scale of your choosing (say, a four point scale). (Before we go much further, we want to stress that it’s important the queries you select are in fact random. You want to test your algorithm against misspellings and the sort of long-tail searches that make up the bulk of searches on most sites. It’s not terrible to cherry-pick a search here or there–especially if those items make up a sizable percentage of your queries or if your site make an inordinate amount of its profit on certain kinds of search–but the vast majority should absolutely be random.) How you score the query-result pairs is up to you. Some sites will do internal audits, others will hire interns or temps, others will look to crowd- based solutions. What’s important is that every- one scoring these pairs gets clear instructions. Just make sure you’re asking graders to look at intent not just plain matching. Which is to say that “black shoes” should return black shoes, not black shoe polish (which matches the words in query decently but the intent poorly). Once each pairing is scored, you should have a good idea of where your search is performing and where it isn’t. The most common scoring technique? Discounted cumulative gain or DCG. DCG aims to get you a single number that explains how accurate results are for a certain query. There’s no upper limit (the math is partially based on how many levels you allow in your relevance scale above) but higher numbers are better than low ones. The way it works is best explained with examples. Let’s go back to black shoes. When you’re gathering queries and search pairings you should also be collecting the position (rank) of each result. So with black shoes, you’ll have results on the page from, say, one through five. DCG takes the position and human-scored relevance number for each pairing and combines them, so you’re left with a single, easy to understand metric. If all the results for black shoes are highly relevant, that score will be high; that indicates your algorithm is performing great on that search. If all the results for black shoes return non-shoes or red shoes or some other inaccurate result, the score will be low, which of course indicates the opposite. (Note: at the end of this how to guide, we’ll include a link to an Excel-based DCG calculator, complete with instructions.) Typical DCG weights with search rank on the X axis
  11. 11. 11 How to score per result relevance There are a couple decisions you’ll need to make if you’re looking at optimizing around per result relevance. First, you’ll want to decide on a point scale which the people grading query-result pairing will use to score each couple. Many of our most successful customers use a four-point scale that looks something like this: Score Explanation Example based on “black Nike shoes” as a query 1- Off-Topic The result is irrelevant and intent was not matched. Black gloves, black shoe polish, or something even less relevant. Not matching the intent at all. 2- Acceptable The result is somewhat related but poorly matched. Black shoes that aren’t Nikes. In other words, the result matched part of the query and the intent but didn’t get it very close. 3- Good The result is close but not perfect. Nikes that are mostly but not all black. 4- Excellent The result is a precise match. Black Nike shoes. Each and every query-result pairing should graded on the same scale. Multiple judgements on each pairing get you a less subjective score, in general, but multiple judgements are more important if you aren’t doing this analysis in-house (in other words, if you’re using temps, outsourcing, or a crowd- based approach, consider multiple judgements to normalize your scores). The other thing you need to consider is what exactly you show your graders as far as a result. A que- ry, of course, should just be text, but it’s generally good practice to show scorers the page they’d land on if they clicked the particular result you’re scoring (as opposed to a product title or text descriptions of the product, etc.). That’s because this more closely mimics the experience a searcher has on your site (we’ll get into scoring entire pages, which can be used in conjunction with the per result or click data metrics, in just a moment). Again, the relevance score and the position (or rank) of multiple results can be combined to give you a single DCG score you can use to get a high-level idea of your search relevance algorithm as well as a number to try and beat during your later relevance testing.
  12. 12. 12 Metric 3: Whole Page Relevance This is another metric we frequently see ecommerce companies using to evaluate search relevance. Naturally, it has its own set of pros and cons, but it’s important to note that whole page relevance is often used in tandem with one of the aforementioned metrics. One of the major advantages about whole page relevance scoring? It accurately portrays what the search experience is on your site. After all, users who search don’t simply land on particular result, they land on a page of results. If that entire page returns quality matches, you’re returning a great experience. Much like per result relevance, this is a human-scored metric so you’ll want to settle on a scoring sys- tem. And, much like with per-result relevance, we often see a four-point scale put to use here. For example: Score Explanation Example based on “black Nike shoes” as a query 1- Off-Topic The results for a particular query are bad across the board. None or a small percentage of results matches. Page returns a host of black shoe laces, shoe polish, or shoes that are other colors, perhaps with a pair of black shoes or two mixed in. 2- Acceptable The results for a particular query are decent, returning somewhere around 50% matching results. Page returns five pairs of black shoes, four pairs of brown shoes, and a black shoe polish kit. 3- Good The results for a particular query match well, but not everything on the page is applicable. Anywhere between 50% and 90% of the results match intent well. Page returns eight or nine pairs of black shoes and one or two pairs of gray or brown shoes. 4- Excellent All results for a particular query match intent. The entire page of results is black shoes. Keep in mind that judging your algorithm’s accuracy based on this metric requires good understand- ing of your site’s result inventory. For example, if you only sell seven pairs of black shoes, you’ll be topping at “good” for that result, since a typical page has ten results.
  13. 13. 13 Whole page relevance is a smart thing to optimize for for a few reasons. One we’ve mentioned al- ready: it most closely mirrors what your users see when they search. Another is that it can show you holes in your algorithm you didn’t realize existed (certain product names being conflated with others, the accuracy of searches further down your results page). It can also help account for things like seasonality. You can ask your graders to, for example, grade pages based on searches made in the autumn or around Christmas. If you searched “blower” in autumn, you’d want to show a page of leaf blowers; in the winter, you’d rather surface snow blowers. That said, whole page relevance isn’t perfect. You really do want the first result to be as close to the query intent as possible and whole page relevance can’t account for that. Moreover, scoring whole page relevance is a bit more subjective. Graders just find it easier to look at simple query-result pair- ings than they do critiquing an entire page. The cognitive load on the whole page is higher and the lines between “good” and “acceptable” or “good” and “excellent” are a bit hazier and more nebulous when you’re viewing an entire page. More often than that, one or more of the above metrics (click data, per result relevance, and whole page relevance) are used to evaluate your search experience. Of course, evaluating it is just step one. Making your algorithm better is what you’ll want to do next.
  14. 14. 14 IMPROVING YOUR SEARCH RELEVANCE ALGORITHM 4 Now that you’ve set up an intuitive interface and found a baseline for your site’s search relevance, you’re ready to start tweaking the algorithm you’ve been working with thus far. Exactly what method you use to improve your search experience depends on the metrics you’ve chosen your unique set of results and what users do on your site, but a few general rules apply. For starters, remember that you should be testing against the same queries. If you’re using per result relevance (and we’ve found this is what most of our customers test on primarily), you want to test the same set of queries against a competing algorithm. In other words, if you scored your original results for “black shoes,” you’ll need to compare them to your new algorithms results for “black shoes.” The same applies for whole page relevance. Here are a few common ways to improve search relevance: Hand-Selecting Better Results for Popular Queries This is absolutely not a scalable solution but it can be a decent quick fix, especially for products and queries you know make up an outsized percentage of your site’s traffic. Say, for example, you’re aware that 15% of your site’s searches are for “black shoes” but your result pairs and result pages are scoring low on your selected metric(s). You could manually tell your algo- rithm to surface black shoes one-by-one (or by category, if your database supports it). This ensures that those 15% of searches will result in quality products or pages of products for your users. The problem with this, beyond the fact that you can’t possibly match all searches on your site to results by hand, is that whenever your site gets a new item or set of items that fit that query (say, a whole line of new black shoes), you’ll need to do this all over again. This is time-consuming and on a practical level, people just forget. You might not even be aware certain new products have come in that fit that hand-selected use case. Moreover, most searches fall in the long tail. You’ll really only be accounting for a fraction of searches when you make hand-selected pairings and, if you later decide to include an additional category that each product has, you’ll run into even more troubles. It’s a quick fix and it can pay short-term divi- dends, but it’s not recommended. Automatically or Manually Revising Result Weights Every product or page in your database should have a set of traits or columns associated with it. Search software like Solr or Elastic Search have these attributes and most proprietary search pro-
  15. 15. 15 grams do as well. Sticking with our ongoing example, a pair of shoes may have any or all of the following: • Clothing category • Color • Size • Brand • Price • Additional color • Shoelace options • Part of a sale or deal • Introductory price • Etc. By analyzing your DCG scores or whatever other metric you’re using, you should be able to make some hypotheses on which kinds of searches are less accurate than others. For example, say you’ve noticed via whole page relevance that most searches with a color are never scored as “excellent.” You may want to increase the weight your algorithm gives to that particular attribute and test the new algorithm against your old one. You’d hope that searches like “black shoes” and “white gloves” and the like would perform better but pay close attention to what performs worse as well. Certain search- es that were previously returning excellent results could suffer. You’ll need to adjust the balance of certain categories and attributes against each other to find the one that performs best for the metrics you’ve been optimizing for. Of course, you don’t necessarily have to do this by hand. There are machine-learning strategies that will do this automatically and find the proper ratio of weights. These can be especially valuable if there are tons of per item data for each product or page on your site. The more data per item, the harder it is to find the most effective weighing across those attributes. And while this can be a really effective way to improve your site’s search relevance, keep in mind that using the same data set of queries can be a danger. You’ll be optimizing based on a certain selection of random searches over and over while possibly hurting search performance elsewhere. In that case, you can always test both your original and newly improved algorithms against each other based on an additional selection of random queries, results, and page pairings. Adding Extra Attributes or Tags to Your Library of Products or Pages Another reason our example of “black shoes” might be returning subpar results is that you simply might not have a color attribute to weigh at all. Conversely, you may look through your search log (or certain poor performers from your original random sampling) and conclude that you’re missing key You’ll need to adjust the balance of certain categories and attributes against each other to find the one that performs best for the metrics you’ve been optimizing for.
  16. 16. 16 data for large swaths of products. If that’s the case, you should consider adding tags or attributes to your product catalog. This can be a bit time-consuming (though solutions that run through the crowd, as we’ll explain later, are both cost-effective and scalable) but the impact can be immense. Not only will you fill in holes in your existing algorithm but you give it new attributes that you can weigh to improve results that weren’t quite performing up to par. Additionally, having new weights allows you to surface new facets on your results page. It’s never a bad idea to have more, good data to iterate on, after all. You might consider preemptively adding tags instead of waiting to notice use-cases where your search experience is lacking. Getting back to our seasonality example above, you might consider adding “christmas” as a tag to account for searches like “christmas gift” or “christmas lights” or some- thing similar. This isn’t meant to be an exhaustive list of strategies to adjust and improve your search algorithm. In fact, new solutions, software, and processes are always on the horizon. The one we know most about, of course, is how people-powered search relevance tuning works. A bad model on good data is much better than a good model on bad data. Lukas Biewald, CEO, CrowdFlower
  17. 17. 17 HOW CROWDFLOWER CAN HELP 5 Many of our most successful customers are companies like eBay, sites whose product catalog is con- stantly changing and who absolutely need their search results to be highly relevant. After all, product listings expire every second on eBay, so quick fixes like hand-matching are out the window. And since sellers write the listings themselves, you’re more prone to see misspellings or bizarre charac- ters in product descriptions or titles. We’ll go over some of the ways CrowdFlower can help improve your search relevance, complete with real world use-cases, explanations, and job examples below. We’ll start with setting a baseline for your current algorithm. Scoring Results for Search Relevance Some search relevance metrics need human scoring to verify accuracy. Even if you’re basing your success on click data (which can be gamed by clever titles and again, can be prone to noise), having a human score query-result pairs can be a good level-set. But it’s far more valuable if you’re evaluat- ing your search experience based on per result or whole page relevance. To really evaluate your algorithm you’re going to want to look at thousands of random searches and their associated results. Say you choose a lower number of searches (for example 1000) and you want to evaluate the first ten items appearing on the result page (as commonly, each page will return ten results and many users simply won’t click “next”). That alone is 10,000 data rows. You don’t have time to hand-score each one. Conversely, hiring and training interns or outsourcing the task is trou- blesome and inefficient. That’s where the crowd comes in. By disseminating the work to a labor pool that’s millions strong, you can get those 10,000 data rows scored quickly and, most importantly, accurately. That’s because you can run each row through mul- tiple contributors and either take their most agreed upon answer or assign a value to each score on your relevance scale and simply average them.
  18. 18. 18 Here’s what a CrowdFlower contributor sees during a typical job: You present the search query and whatever you want to show as a result: a link to a page, an image of the product, the title of the product, or anything else. You can also include a generic search button so members of the crowd can look up the query if it’s unfamiliar to them (something we generally recommend). It’s here we should note an important part of our process: quiz mode. Before you launch your rel- evance scoring job, you’ll make what are called “test questions.” Test questions are query-result pairings that you score yourself and judge a contributor’s understanding against. For example, if your search query is “black shoes” and your result is actually a pair of black shoes, contributors that mark relevance scores you don’t agree with will be flagged. If they fall under an accuracy threshold that you set, they won’t be allowed to enter your job. Further, we surface test questions randomly within each page of their workflow. Those test questions are not repeated and if a contributor who passed quiz mode later falls below your accuracy threshold, they are ejected from your job and units they’ve previously judged are put back into the pool of data rows available for judgement. Once the crowd finishes scoring the relevance of those thousand queries, you’ll have the data you need to smart decisions about your algorithm. We even provide a tailor-made DCG calculator that works with our jobs to allow you to get a high-level look at your site’s search relevance.
  19. 19. 19 We recommend that you score your current search algorithm with the crowd, make changes based on the metrics that are right for you and your site, then test the query-result pairings the new algo- rithm produces on the same random set of queries against your old one. You’ll be able to understand, at a glance, if your new algorithm is an improvement or if you should make further changes. Additional Tagging Say your query for “black shoes” is doing well but your algorithm doesn’t return good results for something like the year those shoes were released, their Global Trade Item Number (GTIN), or any other attribute you’ve noticed in your search logs because you simply never tagged those products with those characteristics. This is another occasion where the crowd can be really helpful. An example GTIN number You can surface part or the entirely of your product catalog to the crowd and ask them to tag items for any use-case. Say you wanted to include seasonal information. Just ask the crowd if a certain item is associated with spring or Halloween. Say you want to include additional product information, like available sizes. Just present a listing and have the crowd select which sizes the item comes in. These additional tags gives you more options when you’re improving your search algorithm. Wheth- er you’re adjusting weights manually or relying on a more automated, machine-learning approach, having additional categories to balance against each other can be highly valuable. And since Crowd- Flower’s contributors can run through thousands of these tagging assignments each hour, you can populate your entire product database with new tags quickly and efficiently, whenever you need to. Data Cleaning Product databases get messy. Manufacturers may use different wordings for similar products; differ- ent distributors can describe or title identical products in different ways; or sometimes, you may just have several images associated with one product with no real way of knowing which is best. The point is, the bigger your catalog of results, the more likely some of those results will be dirty. And though some of this cleaning can be automated (sometimes by simple deduplication functionality in your database), some of it simply cannot. This is where the crowd comes in. You can set up a job where you show contributors any variety of data you think needs cleaning and ask them for input. For example, you could show them a query for “black shoes” and ask if the two
  20. 20. 20 pairs returned are identical. You could show them two pictures associated with a certain product in your database and ask if they are the same. There are tons of options here. Just know that how you clean your data is up to you. Product Categorization Categorizing your results is another great use of the platform and, again, one that can be tackled in a few different ways, all depending on what problem you’d like to solve. For example, say there’s a result for “black shoes” that’s consistently being returned on your first page, but hardly any one is clicking on it. Is that because that result is invalid or because people sim- ply don’t want those shoes? If you were looking at a metric like click data, you won’t be able to tell. What you can do is pull out query-result pairings that match the one described (supposedly good results without much traction in clicks) and simply ask the crowd: “Is this item categorized well?” A contributor would see the result (say, black shoe polish) and the query (black shoes) and the category (in this case, perhaps just “shoes”). Since the result doesn’t match intent, the crowd would mark “no.” Questions like these across thousands of under-performing results can help you categorize products the right way. In fact, piggy-backing on that idea, you could simply take all the results the crowd identified as poor and ask them to re-categorize each of those. This works really well for building out a more robust categorization taxonomy as well. For example, say you applied a category to a whole swath of prod- ucts and you’ve realized that the categories you’ve gone with just aren’t specific enough. Maybe you created a category for “Video Games” but didn’t include the game system for each title. That would be a pretty big hole in your search experience. You can also take products and simply provide categories to the crowd so they can categorize them for you. Just show a product image, title, and/or description and present a few category options. They’ll do the rest. Here’s an example of what the crowd sees in a job like that: Do this and you’ll have a more specific ontology you can use to improve your search experience and your search algorithm.
  21. 21. 21 Relevance Ranking Another way to tweak your algorithm is relevance ranking. This helps give you a holistic view of what users may expect when searching for certain products or items and can help you re-rank results based on actual feedback. Here, you present the crowd with a query and the top ten results your algorithm provides. Then, sim- ply ask the crowd which results fit the intent of the original query the best. You can have the crowd rank the results in order, one through ten or one through five. This can help you fine-tune your search relevance. For example, say you’re using click data as a met- ric you’re currently optimizing for. You provide “black shoes” as a query and the user ranks the results your algorithm produces one through ten. Click data might suggest a certain result should be first or second on that list when, in fact, what users are reacting to is a compelling picture (not a compelling product) or the only pair of shoes on that list that actually isn’t black. That result will get lower marks than you’d likely anticipate and you can feed those new rankings into your search algorithm to sur- face better results, higher up.
  22. 22. 22 CONCLUSION 6 Search has an outsized impact on an ecommerce company’s bottom line. That’s why they invest heavily in making their search experience as accurate and intuitive as possible. By learning how smart ecommerce sites design their interfaces, surface their results, and improve them constantly, we hope you’ve learned how to both improve your site’s search relevance and how CrowdFlower can help along the way. We believe people-powered search relevance tuning is the best way to accurately and scalably improve search and would love to hear from you. You can request a demo or just try Crowd- Flower on a trial basis. You can also download our discounted cumulative gain (DCG) calculator here. About CrowdFlower CrowdFlower’s people-powered data enrichment platform helps data scientists train algorithms to consistently provide the most relevant search results for ecommerce websites. It fills in the gaps in your data by adding product descriptions, IDs, image tags, and other metadata to give you clean- er, more complete data. It can also handle the most intricate product categorization, giving you the kind of advanced taxonomies your product search needs. Leverage the power of the world’s largest on-demand workforce, and take your ecommerce search experience to the next level.