SlideShare a Scribd company logo
1 of 50
Download to read offline
SEARCH AND RELEVANCE AT SCALE FOR ONLINE CLASSIFIEDS
STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19
ROGER RAFANELL
Senior Big Data Engineer | letgo
About me
?
letgo
• Second-hand marketplace app
• Founded 2015
• Main markets: US & Turkey
• 5M downloads/month & 20M MAU
Agenda
• Introduction to search in classifieds
• search in the past
• Building a new search platform at scale
• Enabling data science
• The future of search platform
Introduction
Introduction
Search in classifieds
100K items/day
Reposted
We cannot cache results!!!
Sold / Deleted
Introduction
Hyperlocality search
letgo catalog at (39.89,-77.08) letgo catalog at (39.28, -76.69)
Results in Washington, D.C. ≠ Results in Baltimore
Introduction
User input data example
• Typos
• Slang
• Poor pictures
• Wrong information
• Ambiguity
• Weirdness
Introduction
Explorers vs Deal Seekers (Marco Polo vs Hernán Cortés)
• Like browsing
• Recall > Precision
• “Cars”
• Search, filter, haggle
• Precision > Recall
• “2015 honda civic lx”
search in the past
Search in the past
Early 2015
Listings
API
Search
API
Search in the past
Late 2017
Shard 1
...
8 replicas
/ shard
x 3
Shard 1 Shard 5
...
8 replicas
/ shard
x 2
x 1 (↑nodes)
(↑nodes)
(↑nodes)
Listings
API
Search
API
Shard 5
24hFULL INDEXATION
8hINSTANCES RECOVERY
150msRESPONSE TIME
Operation limitations
• Slow full catalog imports
• Slow reactivity to traffic spikes
• High costs
Business limitations
• No enrichment at import time
• Not easy to evolve schemas
• Not agile!
NOA/B TESTING
NODATA SCIENCE
Search API limitations
• One API request -> One search query
• PHP + Solarium (↓ concurrency)
• High costs
Search
API
200rpsTHROUGHPUT
400msRESPONSE TIME
60+SERVICE INSTANCES
The platform was not scaling
Building a new
search platform
• Spot oldest queries sent by search API
• ↑Traffic for fresh listings
• All fields were stored
Building a new search platform
Analysis
3 monthsCATALOG RETENTION
15 minHIGHLY REQUESTED LISTINGS
• Keep only the last 3 month listings
• Index only the queried fields
• Store only listings IDs
Building a new search platform
Looking for a strategy
>100GBOLD CATALOG SIZE
<4GBNEW CATALOG SIZE
Solr was used as a key-value storage
NOT as a full-text search engine
Building a new search platform
THE BAD
• Where to store all listings fields?
• Need a catalog storage (database)
• Need also a fast serving layer
• Near real-time indexing constraints
THE GOOD
• No more sharding (↓index size)
• Standalone Solr instances
• High bump in performance
Drawing a plan
Building a new search platform
Big Data to the rescue
• NRT pipeline to keep the listings catalog up-to-date
• Batch pipeline to fully rebuild the catalog
Building a new search platform
The new architecture
Self-healing
Building a new search platform
The Search indexer ETL
Fetch
Listings
Enrich
Listings
Fetch
Verticals
Features
Normalize
Attributes
Anonymize
PII
Store
to
DB
Store
to
Fast Layer
Building a new search platform
Search engine performance
Throughput Recovery time Latency
↑12x ↓8x12’
Building a new search platform
Catalog performance
Catalog
(Fast layer)
Catalog
(Database)
Worst Case
Latency
16ms 56ms40ms
Gluing all the pieces
Building a new search platform
Search API redesign
x 1
x 1
x 1
Listings
API
Search
Library
Search
API
IDs
Building a new search platform
Search library - Scala to rule them all
• Wrap the search retrieval logic
• One request → Multiple parallel queries to Solr
• Non-blocking I/O with solrS, persistence drivers
• Seamless integration with Finagle framework
Building a new search platform
Search API - Scala to rule them all
• Based on Finagle services framework
• Finatra/Finagle = ↑concurrency & ↓resources
• Enable backend driven A/B testing
• Personalized search
Building a new search platform
Overall performance
↑Throughput &↓ Latency Resources Cost Reduction
13x 100x↓20x
Enabling
data science
SEARCH & RELEVANCE
Enabling data science
Unlocked data science projects
• Recall
– Query expansion
• Precision
– Learning to Rank
Enabling data science
Improving recall - Query expansion
Searching for: ‘mountain bike’
blue mountain bicycle → Synonyms
mountain and road bike → OK
mountain bike frame → Relevant?
bicicleta de montaña → Language
scout montain bike → Spelling
mountain bike lock → Relevant?
Similar Queries Cause
blue mountain bicycle
mountain and road bike
mountain bike frame
bicicleta de montaña
scout montain bike
mountain bike lock
Expected Behavior
Enabling data science
Improving precision - Learning to Rank
‘mountain bike’ Items retrieved
Enabling data science
‘mountain bike’
2 months ago
30 miles
Improving precision - Learning to Rank
Enabling data science
Improving precision - Learning to Rank
Enabling data science
bike
1 2
3 4
5 6
Improving precision - Learning to Rank
y = 0
y = 0
y = 1
Enabling data science
Conversions on query ‘bike’
Improving precision - Learning to Rank
Enabling data science
Before After
SEARCH CONVERSIONS
Improving precision - Learning to Rank
Text score = indicator of relevance.
Freshness and distance are key!
The future of
search platform
Future of search platform
Work in progress
• Migration to Solr 8 (↓latency & better security)
• Iterate Learning to Rank
• Real-time personalization
• Visual categorization (Reveal)
Conclusions
Conclusions
Raising the bar
• Indexer pipeline enables data enrichment & transformations
• Simplified search architecture with lightweight in-memory indices
• Fault-tolerant and self-healing infrastructure and processes
• Unlock real data science in
Tech Stack
Airflow Redshift
THANK YOU
roger.rafanell@letgo.com
https://www.linkedin.com/in/rogerrafanell
https://we.letgo.com/careers

More Related Content

What's hot

Visualizing large datasets with elasticsearch and kibana
Visualizing large datasets with elasticsearch and kibanaVisualizing large datasets with elasticsearch and kibana
Visualizing large datasets with elasticsearch and kibanaDan Fey
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE AtlasRIPE NCC
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
 
Life is but a Stream
Life is but a StreamLife is but a Stream
Life is but a StreamDatabricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
 
Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...
Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...
Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...Zhijie Shen
 
Distilled Power BI Updates for April 2016
Distilled Power BI Updates for April 2016Distilled Power BI Updates for April 2016
Distilled Power BI Updates for April 2016Jen Stirrup
 
Real time ads personalization @ Spotify
Real time ads personalization @ SpotifyReal time ads personalization @ Spotify
Real time ads personalization @ SpotifyKinshuk Mishra
 
Bisp list of courses
Bisp list of coursesBisp list of courses
Bisp list of coursesAmit Sharma
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Objectivity
 
Notebooks @ Netflix: From analytics to engineering with Jupyter notebooks
Notebooks @ Netflix: From analytics to engineering with Jupyter notebooksNotebooks @ Netflix: From analytics to engineering with Jupyter notebooks
Notebooks @ Netflix: From analytics to engineering with Jupyter notebooksMichelle Ufford
 
Zipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkZipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkDatabricks
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Khai Tran
 

What's hot (20)

Visualizing large datasets with elasticsearch and kibana
Visualizing large datasets with elasticsearch and kibanaVisualizing large datasets with elasticsearch and kibana
Visualizing large datasets with elasticsearch and kibana
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Using R in power BI
Using R in power BIUsing R in power BI
Using R in power BI
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
RIPE Atlas
RIPE AtlasRIPE Atlas
RIPE Atlas
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
Life is but a Stream
Life is but a StreamLife is but a Stream
Life is but a Stream
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...
Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...
Hadoop Summit San Jose 2014 - Analyzing Historical Data of Applications on Ha...
 
R training at Aimia
R training at AimiaR training at Aimia
R training at Aimia
 
Distilled Power BI Updates for April 2016
Distilled Power BI Updates for April 2016Distilled Power BI Updates for April 2016
Distilled Power BI Updates for April 2016
 
Real time ads personalization @ Spotify
Real time ads personalization @ SpotifyReal time ads personalization @ Spotify
Real time ads personalization @ Spotify
 
Bisp list of courses
Bisp list of coursesBisp list of courses
Bisp list of courses
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs Managing Large Scale Financial Time-Series Data with Graphs
Managing Large Scale Financial Time-Series Data with Graphs
 
Notebooks @ Netflix: From analytics to engineering with Jupyter notebooks
Notebooks @ Netflix: From analytics to engineering with Jupyter notebooksNotebooks @ Netflix: From analytics to engineering with Jupyter notebooks
Notebooks @ Netflix: From analytics to engineering with Jupyter notebooks
 
Zipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering FrameworkZipline - A Declarative Feature Engineering Framework
Zipline - A Declarative Feature Engineering Framework
 
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
Beam summit 2019 - Unifying Batch and Stream Data Processing with Apache Calc...
 
DC Web API Meetup Oct 4 2016
DC Web API Meetup Oct 4 2016DC Web API Meetup Oct 4 2016
DC Web API Meetup Oct 4 2016
 

Similar to Activate 2019 - Search and relevance at scale for online classifieds

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 PresentationsAna Rebelo
 
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...Amazon Web Services
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMrtpaem
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu
 
Filipe paternot - Case Study: Zabbix Deployment at Globo.com
Filipe paternot - Case Study: Zabbix Deployment at Globo.comFilipe paternot - Case Study: Zabbix Deployment at Globo.com
Filipe paternot - Case Study: Zabbix Deployment at Globo.comZabbix
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data miningmaxonlinetr
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseRubedo, a WebTales solution
 
Graphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionGraphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionNeo4j
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsMariaDB plc
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Lucidworks
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Nishant Gandhi
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Elasticsearch : petit déjeuner du 13 mars 2014
Elasticsearch : petit déjeuner du 13 mars 2014Elasticsearch : petit déjeuner du 13 mars 2014
Elasticsearch : petit déjeuner du 13 mars 2014ALTER WAY
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldPartner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldDenodo
 
Netflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyNetflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyKetan Patil
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)GOKb Project
 

Similar to Activate 2019 - Search and relevance at scale for online classifieds (20)

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
Meetup070416 Presentations
Meetup070416 PresentationsMeetup070416 Presentations
Meetup070416 Presentations
 
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
AWS Summit 2013 | Singapore - Delivering Search for Today's Local, Social, an...
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEM
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Filipe paternot - Case Study: Zabbix Deployment at Globo.com
Filipe paternot - Case Study: Zabbix Deployment at Globo.comFilipe paternot - Case Study: Zabbix Deployment at Globo.com
Filipe paternot - Case Study: Zabbix Deployment at Globo.com
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
 
Le big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entrepriseLe big data à l'épreuve des projets d'entreprise
Le big data à l'épreuve des projets d'entreprise
 
Graphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in ProductionGraphs in Action: In-depth look at Neo4j in Production
Graphs in Action: In-depth look at Neo4j in Production
 
Fast, Powerful and Scalable Analytics
Fast, Powerful and Scalable AnalyticsFast, Powerful and Scalable Analytics
Fast, Powerful and Scalable Analytics
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Elasticsearch : petit déjeuner du 13 mars 2014
Elasticsearch : petit déjeuner du 13 mars 2014Elasticsearch : petit déjeuner du 13 mars 2014
Elasticsearch : petit déjeuner du 13 mars 2014
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldPartner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
 
Netflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case StudyNetflix Recommender System : Big Data Case Study
Netflix Recommender System : Big Data Case Study
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)Building the Global Open Knowledgebase (ER&L 2013)
Building the Global Open Knowledgebase (ER&L 2013)
 

More from Roger Rafanell Mas

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?Roger Rafanell Mas
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshopRoger Rafanell Mas
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialRoger Rafanell Mas
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingRoger Rafanell Mas
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersRoger Rafanell Mas
 

More from Roger Rafanell Mas (13)

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?
 
Pensamiento lateral
Pensamiento lateralPensamiento lateral
Pensamiento lateral
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshop
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud Computing
 
SDS Amazon RDS
SDS Amazon RDSSDS Amazon RDS
SDS Amazon RDS
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of Datacenters
 
EEDC Everthing as a Service
EEDC Everthing as a ServiceEEDC Everthing as a Service
EEDC Everthing as a Service
 
EEDC Apache Pig Language
EEDC Apache Pig LanguageEEDC Apache Pig Language
EEDC Apache Pig Language
 
EEDC Distributed Systems
EEDC Distributed SystemsEEDC Distributed Systems
EEDC Distributed Systems
 
EEDC SOAP vs REST
EEDC SOAP vs RESTEEDC SOAP vs REST
EEDC SOAP vs REST
 

Recently uploaded

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Recently uploaded (20)

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Activate 2019 - Search and relevance at scale for online classifieds

  • 1. SEARCH AND RELEVANCE AT SCALE FOR ONLINE CLASSIFIEDS
  • 2. STAY CONNECTED Twitter @activate_conf Facebook @activateconf #Activate19 Log in to wifi, follow Activate on social media, and download the event app where you can submit an evaluation after the session WIFI NETWORK: Activate2019 PASSWORD: Lucidworks DOWNLOAD THE ACTIVATE 2019 MOBILE APP Search Activate2019 in the App/Play store Or visit: http://crowd.cc/activate19
  • 3. ROGER RAFANELL Senior Big Data Engineer | letgo About me
  • 4. ?
  • 5. letgo • Second-hand marketplace app • Founded 2015 • Main markets: US & Turkey • 5M downloads/month & 20M MAU
  • 6. Agenda • Introduction to search in classifieds • search in the past • Building a new search platform at scale • Enabling data science • The future of search platform
  • 8. Introduction Search in classifieds 100K items/day Reposted We cannot cache results!!! Sold / Deleted
  • 9. Introduction Hyperlocality search letgo catalog at (39.89,-77.08) letgo catalog at (39.28, -76.69) Results in Washington, D.C. ≠ Results in Baltimore
  • 10. Introduction User input data example • Typos • Slang • Poor pictures • Wrong information • Ambiguity • Weirdness
  • 11. Introduction Explorers vs Deal Seekers (Marco Polo vs Hernán Cortés) • Like browsing • Recall > Precision • “Cars” • Search, filter, haggle • Precision > Recall • “2015 honda civic lx”
  • 13. Search in the past Early 2015 Listings API Search API
  • 14. Search in the past Late 2017 Shard 1 ... 8 replicas / shard x 3 Shard 1 Shard 5 ... 8 replicas / shard x 2 x 1 (↑nodes) (↑nodes) (↑nodes) Listings API Search API Shard 5
  • 15. 24hFULL INDEXATION 8hINSTANCES RECOVERY 150msRESPONSE TIME Operation limitations • Slow full catalog imports • Slow reactivity to traffic spikes • High costs
  • 16. Business limitations • No enrichment at import time • Not easy to evolve schemas • Not agile! NOA/B TESTING NODATA SCIENCE
  • 17. Search API limitations • One API request -> One search query • PHP + Solarium (↓ concurrency) • High costs Search API 200rpsTHROUGHPUT 400msRESPONSE TIME 60+SERVICE INSTANCES
  • 18. The platform was not scaling
  • 20. • Spot oldest queries sent by search API • ↑Traffic for fresh listings • All fields were stored Building a new search platform Analysis 3 monthsCATALOG RETENTION 15 minHIGHLY REQUESTED LISTINGS
  • 21. • Keep only the last 3 month listings • Index only the queried fields • Store only listings IDs Building a new search platform Looking for a strategy >100GBOLD CATALOG SIZE <4GBNEW CATALOG SIZE
  • 22. Solr was used as a key-value storage NOT as a full-text search engine
  • 23. Building a new search platform THE BAD • Where to store all listings fields? • Need a catalog storage (database) • Need also a fast serving layer • Near real-time indexing constraints THE GOOD • No more sharding (↓index size) • Standalone Solr instances • High bump in performance Drawing a plan
  • 24. Building a new search platform Big Data to the rescue • NRT pipeline to keep the listings catalog up-to-date • Batch pipeline to fully rebuild the catalog
  • 25. Building a new search platform The new architecture Self-healing
  • 26. Building a new search platform The Search indexer ETL Fetch Listings Enrich Listings Fetch Verticals Features Normalize Attributes Anonymize PII Store to DB Store to Fast Layer
  • 27. Building a new search platform Search engine performance Throughput Recovery time Latency ↑12x ↓8x12’
  • 28. Building a new search platform Catalog performance Catalog (Fast layer) Catalog (Database) Worst Case Latency 16ms 56ms40ms
  • 29. Gluing all the pieces
  • 30. Building a new search platform Search API redesign x 1 x 1 x 1 Listings API Search Library Search API IDs
  • 31. Building a new search platform Search library - Scala to rule them all • Wrap the search retrieval logic • One request → Multiple parallel queries to Solr • Non-blocking I/O with solrS, persistence drivers • Seamless integration with Finagle framework
  • 32. Building a new search platform Search API - Scala to rule them all • Based on Finagle services framework • Finatra/Finagle = ↑concurrency & ↓resources • Enable backend driven A/B testing • Personalized search
  • 33. Building a new search platform Overall performance ↑Throughput &↓ Latency Resources Cost Reduction 13x 100x↓20x
  • 35. Enabling data science Unlocked data science projects • Recall – Query expansion • Precision – Learning to Rank
  • 36. Enabling data science Improving recall - Query expansion Searching for: ‘mountain bike’ blue mountain bicycle → Synonyms mountain and road bike → OK mountain bike frame → Relevant? bicicleta de montaña → Language scout montain bike → Spelling mountain bike lock → Relevant? Similar Queries Cause blue mountain bicycle mountain and road bike mountain bike frame bicicleta de montaña scout montain bike mountain bike lock Expected Behavior
  • 37. Enabling data science Improving precision - Learning to Rank ‘mountain bike’ Items retrieved
  • 38. Enabling data science ‘mountain bike’ 2 months ago 30 miles Improving precision - Learning to Rank
  • 39. Enabling data science Improving precision - Learning to Rank
  • 40. Enabling data science bike 1 2 3 4 5 6 Improving precision - Learning to Rank y = 0 y = 0 y = 1
  • 41. Enabling data science Conversions on query ‘bike’ Improving precision - Learning to Rank
  • 42. Enabling data science Before After SEARCH CONVERSIONS Improving precision - Learning to Rank
  • 43. Text score = indicator of relevance. Freshness and distance are key!
  • 45. Future of search platform Work in progress • Migration to Solr 8 (↓latency & better security) • Iterate Learning to Rank • Real-time personalization • Visual categorization (Reveal)
  • 46.
  • 48. Conclusions Raising the bar • Indexer pipeline enables data enrichment & transformations • Simplified search architecture with lightweight in-memory indices • Fault-tolerant and self-healing infrastructure and processes • Unlock real data science in