What is ShopWiki?
• ShopWiki is the retail division of Oversee.net.
• We run a collection of retail websites,
• Including the Comparison Shopping Engines (CSE)
How do we use Elasticsearch?
• You know, for search (not logging).
• We index millions of products, offered from
hundreds of thousands of stores, and allow
users to search them.
• ShopWiki was built using a proprietary search
server written in C++.
• Served us well for many years, but it needed
improvements, especially for non-English
• What about Lucene-based solutions?
• We tried out Solr3 when building
• Solr worked well (for English & French), but
the coupon dataset is small in comparison to
our product dataset.
• The setup was simple master-slave replication.
How do we scale?
• To use Solr for our product data we needed to
shard the data across multiple machines.
• But, Solr3’s sharding capabilities were clunky
and difficult to use.
• Enter Elasticsearch!
• Designed to scale out-of-the-box.
• Compare.com was built using Elasticsearch
from the start.
• Allowed us to get up & running very quickly.
• Allowed us to scale up very quickly.
– 60 million products and growing.
• Allows us iterate on new features quickly.
• ShopWiki search is being gradually ported to
• Allows us to have better non-English search
Our Elasticsearch Cluster
• 12 indices, one for each website.
• 3 replicas per shard.
• 3 master nodes (quorum of 2).
• 6 data nodes.
• Plan to add more data nodes as we proceed with
our migration of ShopWiki (500m products).
• Expect to need less hardware than the C++.
cluster (uses 50+ machines).
• C++ search servers need to have the entire
dataset re-indexed and swapped out all at
• Could only do this oncea day, at night (affects
• With Elasticsearch, we can update our data all
the time (it’s not even a limiting factor).
• Use TermsFacet to suggest filters to the user.
• E.g. filter by stores or brands.
• Using the 10 most frequent brands from a
search can produce bad results.
– A single brand may have lots of products that are
all weakly relevant.
• The solution in Solr is to limit facets to the
• Elasticsearch doesn’t have this feature (as
mentioned at last Meetup).
• Solution: TermsStatsFacet(AKA aggregations in 1.0)
• Allows us to get the brands/stores with the
most relevant results.
• E.g. Σ(scoren) n allows us to tune facet results to our liking
N = 0 (same as count)
TermsStatsFacet for Brands
Query: “mixing bowl”
N = 4
• Use “more_like_this” query to find similar
• If result’s score is “high enough”, it’s likely the
same product from a different store.
• “High enough” is defined as a fraction of the
identity match’s score.
• Rob Stewart
• Lead Software Engineer