CrowdFlower - Best Practices for Building a World-Class Search Engine
Best Practices for Building a
World-Class Search Engine
Table of Contents
Designing an intuitive search engine
Understanding and optimizing results
Improving your search relevance algorithm
How Crowdﬂower can help
1Internal search has a direct and tangible impact on revenue. Put simply, users who search are more
likely to convert than users who only browse. This is especially true for ecommerce companies, as
searchers on those sites are many times more likely to purchase than browsers.
It follows then that ecommerce sites would spend a lot of institutional capital to create the most accu-
rate, most intuitive, and most powerful search experience they can. For the biggest companies–sites
like Amazon and eBay and Walmart–this is a gigantic undertaking. They have massive, ﬂuid product
catalogs and, since they sell such an incredible breadth of goods, their algorithms need to be excep-
tional. A user can buy the same book in many places; they’re likely to keep coming back if ﬁnding it at
a competitive price is simple and painless.
In this How To Guide, we’ll be taking lessons and best practice from some of the biggest ecommerce
sites in the world and teaching you how to hone your search experience.
Here’s what we’ll cover:
1. Designing an intuitive search interface: More than just a search box, a great search interface
includes autocorrect, faceting, quality product images, ﬁlterable results pages, and more.
2. Understanding and optimizing results: We’ll cover the most common metrics data scientists use
to grade and test their search algorithms, namely click data, per-result relevance, and whole page
3. Improving your search relevance algorithm: Learn how to score and test your search algorithms.
4. How CrowdFlower can help: We’ll explain how crowd-based approaches can score, test, and
improve your search relevance, regardless of the metric you use. We’ll present real jobs and real
use cases to show you how.
Let’s start with your search interface.
“Shoppers who use site search on ecommerce sites convert at two to
three times the rate of those who don’t use it.”
“Walmart has seen a 10% to 15% increase in shoppers completing
purchases for products found via search queries entered on its website”
DESIGNING AN INTUITIVE
Essentially every website has search functionality, but when we’re talking about your search interface,
it’s important to note that we’re not simply referring to your site’s search box. Instead, we’re talking
about everything from that box to the facets, categories, autocomplete, autosuggest, and all other
functionality that helps users navigate to the products or pages they want to see. Here are six essen-
tial features you should include in your front-end search experience.
The autocomplete feature in a search box provides users query suggestions as they start typing their
query into the search box. PrinterLand, a leading printer retailer in the UK, observed that customers
who land on an autocomplete page suggestion are 6 times more likely to convert than those who
don’t.(3) And although most ecommerce websites today include the autocomplete feature, some
include messy data like duplicates or suggestions that have no products to display.
Amazon’s autocomplete is an industry leader. It not only autocompletes popular search queries, but
actually includes categories, so that broader searches can be narrowed before users even hit the
“Black shoes” on its own isn’t the most speciﬁc search term and, because of that, a less advanced
search interface might surface some women’s shoes, some toddler’s shoes, some men’s dress shoes,
and so on. Because Amazon populates a drop-down with autocompletes and suggested categories,
users can navigate to more speciﬁc results without having to pare down their search on the results
Of course, some users will either type through an autocomplete or they might be searching for some-
thing your autocomplete algorithm doesn’t totally understand. This leads to misspellings, especially
when searchers are typing up brand names. In these cases, it’s important to suggest product names
and model numbers that may not exactly match the user’s query, but are similar.
The idea is similar to the one we discussed with autocompletion: you’re removing another click or
another couple steps from the user’s experience.
For example, take a look at the screenshot from Walmart. Their site is smart enough to recognize that
a user typing “sandisc” actually means “Sandisk” and populates results based on that understanding.
There’s also a link to click if, for some reason, that misspelling happened to be accurate.
Even for searchers who spell queries correctly, there’s a chance they might use terminology that
may not exactly match the terms in your company’s product catalog. For example, if a user types in
“suitcase” instead of “luggage” and your database doesn’t account for synonyms, it’s possible that
the user might be shown a blank search results page. According to Baymard, an ecommerce usability
research ﬁrm, some “70% of ecommerce websites require users to search by the exact jargon for the
product type that the website uses.” That’s a problem you’ll have to solve for.
eBay does this really well. In the screenshot below, a user who types “suitcase” is shown not only
multiple suitcase-related terms (like “suitcase set” and “vintage suitcase”) but synonyms as well (“lug-
gage” and “travel bag”).
Mapping these categories together often requires some manual work but it’s well worth it. Setting
rules so that your algorithm surfaces these sorts of related searches keeps users from re-typing to ﬁt
your particular nomenclature or from simply bouncing from your site altogether to a competitor that
understands what they want in the ﬁrst place.
Facets are the ﬁltering options you’ll often see on the left side of a search results page. They let users
better reﬁne their results without re-searching, surfacing similar products, color choices, add-ons, and
more. This gives buyers additional choices, showcases more of your product catalog at a glance, and
provides searchers with all the options they need to stop looking and start buying.
Still, less than half of ecommerce sites use faceting. It’s often a bigger engineering ask than most
of the other best practices on this list and it requires a really well-organized taxonomy to function
properly. For example, your products should have color data if you want to sort by color or brand data
to sort by brand. And while faceting alone might not be the reason to populate your taxonomy with
this information, having all that data for all your products will give you many more weights to test your
algorithm with in the future.
Product images are a necessity for any ecom-
merce site. According to an Internet Retail
Conference Exhibition (IRCE) report, 75% of
consumers cited image quality as a critical or very
important feature on ecommerce sites. Ability to
see color choices and alternate views were cited
by 68% and 66% respectively.
Larger pictures, higher quality pictures, and
multiple angles on products simply drive more
sales. Take a look at the screenshot from Ama-
zon above. Note that even for an item as simple
as a black shoe, there are ﬁve hi-res options to
choose from. Additionally, there are multiple color
choices on the page itself to choose from. This
mirrors the experience at a brick-and-mortar store
about as well as possible online and gives users
additional options without further searching.
Snippets are the blocks of information listed
under or to the side of each product on a search
results page. They contain data like price, rat-
ings, reviews, product summaries, colors, size,
The screenshot from Target does a nice job of
illustrating product snippets at a glance. Users
can see at a glance the price, the sizes, the
available colors, the brand, and relevant reviews.
All this data should be present in your existing
database but, much like with faceting above,
if it doesn’t exist (or if your database is fairly
messy), consider building out that taxonomy so
your front-end experience can provide the right
information to prospective buyers. Conversely,
bad snippets can confuse users and drastically
increase bounce rate.
The other half of the search experience is, of course, relevant results. After all, the best interface
in the world isn’t going to drive conversions if the results your algorithm surfaces are misleading,
jumbled, or inexact. In this section, we’ll discuss the best practices we’ve found our customers use
to iterate on their search experience, some metrics you should consider when improving and testing
your algorithms, and a few counter-intuitive lessons we’ve learned from real-world use-cases.
As most data scientists and search architects know, the process of improving search relevance is
never truly done. Rather, it’s an ongoing exercise. But the importance of a great search experience
is absolutely worth it. We noted this above, but its important to reiterate that shoppers who use site
search on ecommerce domains are two to three times more likely to purchase than those who don’t.
Which is to say: browsers browse and searchers buy.
Past that, it’s not simply returning a great page of
results (though that, of course, is important), but
rather making sure the best results are as high
on the page as you can get them. That’s be-
cause clicks aren’t evenly distributed across all
your results. In fact, the ﬁrst result on your page
will usually get twice the clicks as your second,
which gets twice the clicks as your third, and so
on down the page (though it’s worth noting that
the ﬁnal result, which is generally before the
“Next” button or the pagination links, does get
more interaction than the few before it). You can
see the average click distribution in the chart to
the right. Before you start really digging in, it’s
vital that you know exactly what metrics your site should optimize for. For example, are you worried
about certain kinds of products over others? Are you concerned with pages on holistic level? Do you
want to account for seasonality? Surface more expensive products? The list goes on and on.
Popular Search Relevance Metrics Worth Optimizing
The reason you want to consider this up front is fairly practical: you’ll need a particular metric to show
to your CEO or other stakeholders to prove that your relevance tuning is making tangible improve-
ments. This will keep you from having to drop everything for ad-hoc requests. For example, say your
CEO’s son searches for “black iPad” on your website and half the results are for other black tablets in
addition to the iPad. Your CEO might ask why your search experience is returning “bad” results when
in fact you’d optimized your search for product categories (tablets) and colors before showing white
iPads. He might ask you to redo your strategy based on that single query. Reacting to something like
that can be disastrous long-term.
Decide which metrics are the most important for your site and stick to them. You’ll have clear goals
and a clear number to optimize toward.
Here are three commonly used metrics that smart ecommerce companies use to decide if their
search relevance efforts are bearing fruit.
Metric 1: Click Data
Click data is simply what your users click on after they make certain searches. In other words, if a
search for “black shoes” consistently results in searchers clicking on a certain link, you can assume
that link is a quality result for that query.
Click data is one of the easier ways to ﬁne tune your search relevance algorithm. It’s cheap to collect
a ton of it because all you’re doing is keeping track of what users do on your site. Which is to say that
while some of the metrics down will require human scoring, click data simply doesn’t.
On the other hand, click data can sometimes be difficult to interpret. In fact, sometimes what users
click on might not be what they’re really looking to buy.
Let’s recall that example above. Say a user searches for “black shoes” and presented a result page
with eight pairs of black shoes, one pair of white shoes, and a risqué picture of a black woman’s high
heel. The white shoes might be especially stylish and might stand out on the page and thus get more
clicks than you’d expect. On the other hand, you might have people clicking on that risqué picture
and getting sidetracked. Meanwhile, your more accurate searches (the eight pairs of black shoes)
seem like less accurate results, even though they aren’t.
Speaking of click data sometime being misleading, consider the “next” button. We had a big ecom-
merce company that believed when users clicked the “next” button at the bottom of a search page, it
was actually a good thing. This might seem counter-intuitive, as it suggests the user didn’t ﬁnd what
they were looking for. But looking at it a different way, you could also argue that clicking a “next” but-
ton meant the user actually did think the results were high quality, they just didn’t see the exact pair
of, say, black shoes they wanted to buy, so they clicked next until they found that perfect match.
Metric 2: Per Result Relevance
Per result relevance, on the other hand, requires human scoring. Generally, it works like this: First,
what you’ll want to do is pull a random amount of search queries from your logs (say, 500). Then, pull
some the top results that appear for each query (say, for example, ten). That’d give you somewhere
Click data can sometimes be difﬁcult to interpret. In fact, sometimes
what users click on might not be what they’re really looking to buy.
around ﬁve thousand query-result pairings, depending on how many results you decide to analyze.
You’d want to take that set of pairings and score them on a relevance scale of your choosing (say, a
four point scale).
(Before we go much further, we want to stress
that it’s important the queries you select are in
fact random. You want to test your algorithm
against misspellings and the sort of long-tail
searches that make up the bulk of searches on
most sites. It’s not terrible to cherry-pick a search
here or there–especially if those items make up
a sizable percentage of your queries or if your
site make an inordinate amount of its proﬁt on
certain kinds of search–but the vast majority
should absolutely be random.)
How you score the query-result pairs is up to
you. Some sites will do internal audits, others will
hire interns or temps, others will look to crowd-
based solutions. What’s important is that every-
one scoring these pairs gets clear instructions.
Just make sure you’re asking graders to look
at intent not just plain matching. Which is to say
that “black shoes” should return black shoes, not
black shoe polish (which matches the words in
query decently but the intent poorly).
Once each pairing is scored, you should have a good idea of where your search is performing and
where it isn’t. The most common scoring technique? Discounted cumulative gain or DCG.
DCG aims to get you a single number that explains how accurate results are for a certain query.
There’s no upper limit (the math is partially based on how many levels you allow in your relevance
scale above) but higher numbers are better than low ones. The way it works is best explained with
Let’s go back to black shoes. When you’re gathering queries and search pairings you should also be
collecting the position (rank) of each result. So with black shoes, you’ll have results on the page from,
say, one through ﬁve. DCG takes the position and human-scored relevance number for each pairing
and combines them, so you’re left with a single, easy to understand metric. If all the results for black
shoes are highly relevant, that score will be high; that indicates your algorithm is performing great on
that search. If all the results for black shoes return non-shoes or red shoes or some other inaccurate
result, the score will be low, which of course indicates the opposite.
(Note: at the end of this how to guide, we’ll include a link to an Excel-based DCG calculator, complete
Typical DCG weights with search rank
on the X axis
How to score per result relevance
There are a couple decisions you’ll need to make if you’re looking at optimizing around per result
First, you’ll want to decide on a point scale which the people grading query-result pairing will use to
score each couple.
Many of our most successful customers use a four-point scale that looks something like this:
Score Explanation Example based on “black Nike
shoes” as a query
1- Off-Topic The result is irrelevant and intent
was not matched.
Black gloves, black shoe polish, or
something even less relevant. Not
matching the intent at all.
2- Acceptable The result is somewhat related
but poorly matched.
Black shoes that aren’t Nikes. In
other words, the result matched
part of the query and the intent but
didn’t get it very close.
3- Good The result is close but not perfect. Nikes that are mostly but not all
4- Excellent The result is a precise match. Black Nike shoes.
Each and every query-result pairing should graded on the same scale. Multiple judgements on each
pairing get you a less subjective score, in general, but multiple judgements are more important if you
aren’t doing this analysis in-house (in other words, if you’re using temps, outsourcing, or a crowd-
based approach, consider multiple judgements to normalize your scores).
The other thing you need to consider is what exactly you show your graders as far as a result. A que-
ry, of course, should just be text, but it’s generally good practice to show scorers the page they’d land
on if they clicked the particular result you’re scoring (as opposed to a product title or text descriptions
of the product, etc.). That’s because this more closely mimics the experience a searcher has on your
site (we’ll get into scoring entire pages, which can be used in conjunction with the per result or click
data metrics, in just a moment).
Again, the relevance score and the position (or rank) of multiple results can be combined to give you
a single DCG score you can use to get a high-level idea of your search relevance algorithm as well as
a number to try and beat during your later relevance testing.
Metric 3: Whole Page Relevance
This is another metric we frequently see ecommerce companies using to evaluate search relevance.
Naturally, it has its own set of pros and cons, but it’s important to note that whole page relevance is
often used in tandem with one of the aforementioned metrics.
One of the major advantages about whole page relevance scoring? It accurately portrays what the
search experience is on your site. After all, users who search don’t simply land on particular result,
they land on a page of results. If that entire page returns quality matches, you’re returning a great
Much like per result relevance, this is a human-scored metric so you’ll want to settle on a scoring sys-
tem. And, much like with per-result relevance, we often see a four-point scale put to use here.
Score Explanation Example based on “black Nike
shoes” as a query
1- Off-Topic The results for a particular query
are bad across the board. None
or a small percentage of results
Page returns a host of black shoe
laces, shoe polish, or shoes that
are other colors, perhaps with a
pair of black shoes or two mixed in.
2- Acceptable The results for a particular query
are decent, returning somewhere
around 50% matching results.
Page returns ﬁve pairs of black
shoes, four pairs of brown shoes,
and a black shoe polish kit.
3- Good The results for a particular query
match well, but not everything on
the page is applicable. Anywhere
between 50% and 90% of the
results match intent well.
Page returns eight or nine pairs of
black shoes and one or two pairs
of gray or brown shoes.
4- Excellent All results for a particular query
The entire page of results is black
Keep in mind that judging your algorithm’s accuracy based on this metric requires good understand-
ing of your site’s result inventory. For example, if you only sell seven pairs of black shoes, you’ll be
topping at “good” for that result, since a typical page has ten results.
Whole page relevance is a smart thing to optimize for for a few reasons. One we’ve mentioned al-
ready: it most closely mirrors what your users see when they search. Another is that it can show you
holes in your algorithm you didn’t realize existed (certain product names being conﬂated with others,
the accuracy of searches further down your results page). It can also help account for things like
seasonality. You can ask your graders to, for example, grade pages based on searches made in the
autumn or around Christmas. If you searched “blower” in autumn, you’d want to show a page of leaf
blowers; in the winter, you’d rather surface snow blowers.
That said, whole page relevance isn’t perfect. You really do want the ﬁrst result to be as close to the
query intent as possible and whole page relevance can’t account for that. Moreover, scoring whole
page relevance is a bit more subjective. Graders just ﬁnd it easier to look at simple query-result pair-
ings than they do critiquing an entire page. The cognitive load on the whole page is higher and the
lines between “good” and “acceptable” or “good” and “excellent” are a bit hazier and more nebulous
when you’re viewing an entire page.
More often than that, one or more of the above metrics (click data, per result relevance, and whole
page relevance) are used to evaluate your search experience. Of course, evaluating it is just step
one. Making your algorithm better is what you’ll want to do next.
IMPROVING YOUR SEARCH
Now that you’ve set up an intuitive interface and found a baseline for your site’s search relevance,
you’re ready to start tweaking the algorithm you’ve been working with thus far. Exactly what method
you use to improve your search experience depends on the metrics you’ve chosen your unique set
of results and what users do on your site, but a few general rules apply.
For starters, remember that you should be testing against the same queries. If you’re using per result
relevance (and we’ve found this is what most of our customers test on primarily), you want to test the
same set of queries against a competing algorithm. In other words, if you scored your original results
for “black shoes,” you’ll need to compare them to your new algorithms results for “black shoes.” The
same applies for whole page relevance.
Here are a few common ways to improve search relevance:
Hand-Selecting Better Results for Popular Queries
This is absolutely not a scalable solution but it can be a decent quick ﬁx, especially for products and
queries you know make up an outsized percentage of your site’s traffic.
Say, for example, you’re aware that 15% of your site’s searches are for “black shoes” but your result
pairs and result pages are scoring low on your selected metric(s). You could manually tell your algo-
rithm to surface black shoes one-by-one (or by category, if your database supports it). This ensures
that those 15% of searches will result in quality products or pages of products for your users.
The problem with this, beyond the fact that you can’t possibly match all searches on your site to
results by hand, is that whenever your site gets a new item or set of items that ﬁt that query (say, a
whole line of new black shoes), you’ll need to do this all over again. This is time-consuming and on a
practical level, people just forget. You might not even be aware certain new products have come in
that ﬁt that hand-selected use case.
Moreover, most searches fall in the long tail. You’ll really only be accounting for a fraction of searches
when you make hand-selected pairings and, if you later decide to include an additional category that
each product has, you’ll run into even more troubles. It’s a quick ﬁx and it can pay short-term divi-
dends, but it’s not recommended.
Automatically or Manually Revising Result Weights
Every product or page in your database should have a set of traits or columns associated with it.
Search software like Solr or Elastic Search have these attributes and most proprietary search pro-
grams do as well. Sticking with our ongoing example, a pair of shoes may have any or all of the
• Clothing category
• Additional color
• Shoelace options
• Part of a sale or deal
• Introductory price
By analyzing your DCG scores or whatever other metric you’re using, you should be able to make
some hypotheses on which kinds of searches are less accurate than others. For example, say you’ve
noticed via whole page relevance that most searches with a color are never scored as “excellent.”
You may want to increase the weight your algorithm gives to that particular attribute and test the new
algorithm against your old one. You’d hope that searches like “black shoes” and “white gloves” and
the like would perform better but pay close attention to what performs worse as well. Certain search-
es that were previously returning excellent results could suffer. You’ll need to adjust the balance of
certain categories and attributes against each other to ﬁnd the one that performs best for the metrics
you’ve been optimizing for.
Of course, you don’t necessarily have to do this by hand. There are machine-learning strategies that
will do this automatically and ﬁnd the proper ratio of weights. These can be especially valuable if
there are tons of per item data for each product or page on your site. The more data per item, the
harder it is to ﬁnd the most effective weighing across those attributes.
And while this can be a really effective way to improve your site’s search relevance, keep in mind that
using the same data set of queries can be a danger. You’ll be optimizing based on a certain selection
of random searches over and over while possibly hurting search performance elsewhere. In that case,
you can always test both your original and newly improved algorithms against each other based on
an additional selection of random queries, results, and page pairings.
Adding Extra Attributes or Tags to Your Library of Products or Pages
Another reason our example of “black shoes” might be returning subpar results is that you simply
might not have a color attribute to weigh at all. Conversely, you may look through your search log (or
certain poor performers from your original random sampling) and conclude that you’re missing key
You’ll need to adjust the balance of certain categories and attributes
against each other to ﬁnd the one that performs best for the metrics
you’ve been optimizing for.
data for large swaths of products.
If that’s the case, you should consider adding tags or attributes to your product catalog. This can be
a bit time-consuming (though solutions that run through the crowd, as we’ll explain later, are both
cost-effective and scalable) but the impact can be immense. Not only will you ﬁll in holes in your
existing algorithm but you give it new attributes that you can weigh to improve results that weren’t
quite performing up to par. Additionally, having new weights allows you to surface new facets on your
results page. It’s never a bad idea to have more, good data to iterate on, after all.
You might consider preemptively adding tags instead of waiting to notice use-cases where your
search experience is lacking. Getting back to our seasonality example above, you might consider
adding “christmas” as a tag to account for searches like “christmas gift” or “christmas lights” or some-
This isn’t meant to be an exhaustive list of strategies to adjust and improve your search algorithm.
In fact, new solutions, software, and processes are always on the horizon. The one we know most
about, of course, is how people-powered search relevance tuning works.
A bad model on good data is much better than a good
model on bad data.
Lukas Biewald, CEO, CrowdFlower
HOW CROWDFLOWER CAN HELP
Many of our most successful customers are companies like eBay, sites whose product catalog is con-
stantly changing and who absolutely need their search results to be highly relevant. After all, product
listings expire every second on eBay, so quick ﬁxes like hand-matching are out the window. And
since sellers write the listings themselves, you’re more prone to see misspellings or bizarre charac-
ters in product descriptions or titles.
We’ll go over some of the ways CrowdFlower can help improve your search relevance, complete with
real world use-cases, explanations, and job examples below. We’ll start with setting a baseline for
your current algorithm.
Scoring Results for Search Relevance
Some search relevance metrics need human scoring to verify accuracy. Even if you’re basing your
success on click data (which can be gamed by clever titles and again, can be prone to noise), having
a human score query-result pairs can be a good level-set. But it’s far more valuable if you’re evaluat-
ing your search experience based on per result or whole page relevance.
To really evaluate your algorithm you’re going to want to look at thousands of random searches and
their associated results. Say you choose a lower number of searches (for example 1000) and you
want to evaluate the ﬁrst ten items appearing on the result page (as commonly, each page will return
ten results and many users simply won’t click “next”). That alone is 10,000 data rows. You don’t have
time to hand-score each one. Conversely, hiring and training interns or outsourcing the task is trou-
blesome and inefficient.
That’s where the crowd comes in.
By disseminating the work to a labor pool that’s millions strong, you can get those 10,000 data rows
scored quickly and, most importantly, accurately. That’s because you can run each row through mul-
tiple contributors and either take their most agreed upon answer or assign a value to each score on
your relevance scale and simply average them.
Here’s what a CrowdFlower contributor sees during a typical job:
You present the search query and whatever you want to show as a result: a link to a page, an image
of the product, the title of the product, or anything else. You can also include a generic search button
so members of the crowd can look up the query if it’s unfamiliar to them (something we generally
It’s here we should note an important part of our process: quiz mode. Before you launch your rel-
evance scoring job, you’ll make what are called “test questions.” Test questions are query-result
pairings that you score yourself and judge a contributor’s understanding against. For example, if your
search query is “black shoes” and your result is actually a pair of black shoes, contributors that mark
relevance scores you don’t agree with will be ﬂagged. If they fall under an accuracy threshold that
you set, they won’t be allowed to enter your job.
Further, we surface test questions randomly within each page of their workﬂow. Those test questions
are not repeated and if a contributor who passed quiz mode later falls below your accuracy threshold,
they are ejected from your job and units they’ve previously judged are put back into the pool of data
rows available for judgement.
Once the crowd ﬁnishes scoring the relevance of those thousand queries, you’ll have the data you
need to smart decisions about your algorithm. We even provide a tailor-made DCG calculator that
works with our jobs to allow you to get a high-level look at your site’s search relevance.
We recommend that you score your current search algorithm with the crowd, make changes based
on the metrics that are right for you and your site, then test the query-result pairings the new algo-
rithm produces on the same random set of queries against your old one. You’ll be able to understand,
at a glance, if your new algorithm is an improvement or if you should make further changes.
Say your query for “black shoes” is doing well but your algorithm doesn’t return good results for
something like the year those shoes were released, their Global Trade Item Number (GTIN), or any
other attribute you’ve noticed in your search logs because you simply never tagged those products
with those characteristics. This is another occasion where the crowd can be really helpful.
An example GTIN number
You can surface part or the entirely of your product catalog to the crowd and ask them to tag items
for any use-case. Say you wanted to include seasonal information. Just ask the crowd if a certain item
is associated with spring or Halloween. Say you want to include additional product information, like
available sizes. Just present a listing and have the crowd select which sizes the item comes in.
These additional tags gives you more options when you’re improving your search algorithm. Wheth-
er you’re adjusting weights manually or relying on a more automated, machine-learning approach,
having additional categories to balance against each other can be highly valuable. And since Crowd-
Flower’s contributors can run through thousands of these tagging assignments each hour, you can
populate your entire product database with new tags quickly and efficiently, whenever you need to.
Product databases get messy. Manufacturers may use different wordings for similar products; differ-
ent distributors can describe or title identical products in different ways; or sometimes, you may just
have several images associated with one product with no real way of knowing which is best.
The point is, the bigger your catalog of results, the more likely some of those results will be dirty. And
though some of this cleaning can be automated (sometimes by simple deduplication functionality in
your database), some of it simply cannot. This is where the crowd comes in.
You can set up a job where you show contributors any variety of data you think needs cleaning and
ask them for input. For example, you could show them a query for “black shoes” and ask if the two
pairs returned are identical. You could show them two pictures associated with a certain product in
your database and ask if they are the same. There are tons of options here. Just know that how you
clean your data is up to you.
Categorizing your results is another great use of the platform and, again, one that can be tackled in a
few different ways, all depending on what problem you’d like to solve.
For example, say there’s a result for “black shoes” that’s consistently being returned on your ﬁrst
page, but hardly any one is clicking on it. Is that because that result is invalid or because people sim-
ply don’t want those shoes? If you were looking at a metric like click data, you won’t be able to tell.
What you can do is pull out query-result pairings that match the one described (supposedly good
results without much traction in clicks) and simply ask the crowd: “Is this item categorized well?” A
contributor would see the result (say, black shoe polish) and the query (black shoes) and the category
(in this case, perhaps just “shoes”). Since the result doesn’t match intent, the crowd would mark “no.”
Questions like these across thousands of under-performing results can help you categorize products
the right way.
In fact, piggy-backing on that idea, you could simply take all the results the crowd identiﬁed as poor
and ask them to re-categorize each of those. This works really well for building out a more robust
categorization taxonomy as well. For example, say you applied a category to a whole swath of prod-
ucts and you’ve realized that the categories you’ve gone with just aren’t speciﬁc enough. Maybe you
created a category for “Video Games” but didn’t include the game system for each title. That would
be a pretty big hole in your search experience.
You can also take products and simply provide categories to the crowd so they can categorize them
for you. Just show a product image, title, and/or description and present a few category options.
They’ll do the rest. Here’s an example of what the crowd sees in a job like that:
Do this and you’ll have a more speciﬁc ontology you can use to improve your search experience and
your search algorithm.
Another way to tweak your algorithm is relevance ranking. This helps give you a holistic view of what
users may expect when searching for certain products or items and can help you re-rank results
based on actual feedback.
Here, you present the crowd with a query and the top ten results your algorithm provides. Then, sim-
ply ask the crowd which results ﬁt the intent of the original query the best. You can have the crowd
rank the results in order, one through ten or one through ﬁve.
This can help you ﬁne-tune your search relevance. For example, say you’re using click data as a met-
ric you’re currently optimizing for. You provide “black shoes” as a query and the user ranks the results
your algorithm produces one through ten. Click data might suggest a certain result should be ﬁrst or
second on that list when, in fact, what users are reacting to is a compelling picture (not a compelling
product) or the only pair of shoes on that list that actually isn’t black. That result will get lower marks
than you’d likely anticipate and you can feed those new rankings into your search algorithm to sur-
face better results, higher up.
Search has an outsized impact on an ecommerce company’s bottom line. That’s why they invest
heavily in making their search experience as accurate and intuitive as possible. By learning how smart
ecommerce sites design their interfaces, surface their results, and improve them constantly, we hope
you’ve learned how to both improve your site’s search relevance and how CrowdFlower can help
along the way. We believe people-powered search relevance tuning is the best way to accurately and
scalably improve search and would love to hear from you. You can request a demo or just try Crowd-
Flower on a trial basis. You can also download our discounted cumulative gain (DCG) calculator here.
CrowdFlower’s people-powered data enrichment platform helps data scientists train algorithms to
consistently provide the most relevant search results for ecommerce websites. It ﬁlls in the gaps in
your data by adding product descriptions, IDs, image tags, and other metadata to give you clean-
er, more complete data. It can also handle the most intricate product categorization, giving you the
kind of advanced taxonomies your product search needs. Leverage the power of the world’s largest
on-demand workforce, and take your ecommerce search experience to the next level.