A high performing search service implies both having an effective search infrastructure and high search relevance.
Seeking for a fault tolerant, self-healing and cost-effective search infrastructure at scale, we built a platform based on Apache Solr search engine with light in-memory indexes, avoiding sharding and decreasing the overall infrastructure needs.
To populate the indexes, we use flexible ETL processes, keeping our product catalog and search indexes updated in a near real-time fashion and distributed across high-performant database engines.
We aim at getting a high search relevance precision and recall by applying query relaxation and boost solutions on top of the optimised platform.
https://www.activate-conf.com/speakers/detail/roger-rafanell
2. STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19
6. Agenda
• Introduction to search in classifieds
• search in the past
• Building a new search platform at scale
• Enabling data science
• The future of search platform
13. Search in the past
Early 2015
Listings
API
Search
API
14. Search in the past
Late 2017
Shard 1
...
8 replicas
/ shard
x 3
Shard 1 Shard 5
...
8 replicas
/ shard
x 2
x 1 (↑nodes)
(↑nodes)
(↑nodes)
Listings
API
Search
API
Shard 5
16. Business limitations
• No enrichment at import time
• Not easy to evolve schemas
• Not agile!
NOA/B TESTING
NODATA SCIENCE
17. Search API limitations
• One API request -> One search query
• PHP + Solarium (↓ concurrency)
• High costs
Search
API
200rpsTHROUGHPUT
400msRESPONSE TIME
60+SERVICE INSTANCES
20. • Spot oldest queries sent by search API
• ↑Traffic for fresh listings
• All fields were stored
Building a new search platform
Analysis
3 monthsCATALOG RETENTION
15 minHIGHLY REQUESTED LISTINGS
21. • Keep only the last 3 month listings
• Index only the queried fields
• Store only listings IDs
Building a new search platform
Looking for a strategy
>100GBOLD CATALOG SIZE
<4GBNEW CATALOG SIZE
22. Solr was used as a key-value storage
NOT as a full-text search engine
23. Building a new search platform
THE BAD
• Where to store all listings fields?
• Need a catalog storage (database)
• Need also a fast serving layer
• Near real-time indexing constraints
THE GOOD
• No more sharding (↓index size)
• Standalone Solr instances
• High bump in performance
Drawing a plan
24. Building a new search platform
Big Data to the rescue
• NRT pipeline to keep the listings catalog up-to-date
• Batch pipeline to fully rebuild the catalog
25. Building a new search platform
The new architecture
Self-healing
26. Building a new search platform
The Search indexer ETL
Fetch
Listings
Enrich
Listings
Fetch
Verticals
Features
Normalize
Attributes
Anonymize
PII
Store
to
DB
Store
to
Fast Layer
27. Building a new search platform
Search engine performance
Throughput Recovery time Latency
↑12x ↓8x12’
28. Building a new search platform
Catalog performance
Catalog
(Fast layer)
Catalog
(Database)
Worst Case
Latency
16ms 56ms40ms
30. Building a new search platform
Search API redesign
x 1
x 1
x 1
Listings
API
Search
Library
Search
API
IDs
31. Building a new search platform
Search library - Scala to rule them all
• Wrap the search retrieval logic
• One request → Multiple parallel queries to Solr
• Non-blocking I/O with solrS, persistence drivers
• Seamless integration with Finagle framework
32. Building a new search platform
Search API - Scala to rule them all
• Based on Finagle services framework
• Finatra/Finagle = ↑concurrency & ↓resources
• Enable backend driven A/B testing
• Personalized search
33. Building a new search platform
Overall performance
↑Throughput &↓ Latency Resources Cost Reduction
13x 100x↓20x
36. Enabling data science
Improving recall - Query expansion
Searching for: ‘mountain bike’
blue mountain bicycle → Synonyms
mountain and road bike → OK
mountain bike frame → Relevant?
bicicleta de montaña → Language
scout montain bike → Spelling
mountain bike lock → Relevant?
Similar Queries Cause
blue mountain bicycle
mountain and road bike
mountain bike frame
bicicleta de montaña
scout montain bike
mountain bike lock
Expected Behavior
48. Conclusions
Raising the bar
• Indexer pipeline enables data enrichment & transformations
• Simplified search architecture with lightweight in-memory indices
• Fault-tolerant and self-healing infrastructure and processes
• Unlock real data science in