Airbnb Search Architecture 
Maxim Charkov, Engineering Manager 
maxim.charkov@airbnb.com, @mcharkov
Airbnb 
Total Guests 
20,000,000+ 
Countries 
190 
Cities 
34,000+ 
Castles 
600+ 
Listings Worldwide 
800,000+
Search 
www.airbnb.com
Booking Model 
Search Contact Accept Book
Search Backend 
Technical Stack 
____________________________ 
DropWizard as a service framework (incl. Jetty, Jersey, Jackson) 
Guice dependency injection framework, Guava libraries, etc. 
ZooKeeper (via Smartstack) for service discovery. 
Lucene for index storage and simple retrieval. 
In-house built real time indexing, ranking, advanced filtering.
Search Backend 
~150 search threads 
4 indexing threads 
Data maintained by indexers: 
Inverted Lucene index for retrieval 
Forward index for ranking signals 
Relevance models 
JVM
Indexing 
What’s in the Lucene index? 
____________________________ 
Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy) 
Categorical and numerical properties like room type and maximum occupancy 
Calendar information 
Full text (descriptions, reviews, etc.) 
~40 fields per listing from a variety of data sources, all updated in real time
Indexing 
Challenges 
____________________________ 
Bootstrap (creating the index from scratch) 
Ensuring consistency of the index with ground truth data in real time
Indexing 
master calendar fraud 
SpinalTap 
Medusa PersistentStorage 
Search1 Search2 … SearchN
Indexing 
master calendar fraud 
SpinalTap 
Medusa PersistentStorage 
Search1 Search2 … SearchN
Indexing 
SpinalTap 
____________________________ 
Responsible for detecting updates happening to the ground truth data 
(no need to maintain search index invalidation logic in application code) 
Tails binary update logs from MySQL servers (5.6+) 
Converts them into actionable data objects, called “Mutations” 
Broadcasts using a distributed queue, like Kafka or RabbitMQ
Indexing 
# sources for mysql binary logs 
sources: 
- name : airslave 
host : localhost 
port : 11 
user : spinaltap 
password: spinaltap 
- name : calendar_db 
host : localhost 
port : 11 
user : spinaltap 
password: spinaltap 
! 
destinations: 
- name : kafka 
clazzName : 
com.airbnb.spinaltap.destination.kafka.KafkaDestination 
! 
pipes: 
- name : search 
sources : [“airslave", "calendar_db"] 
tables : ["production:listings,calendar_db:schedule2s"] 
destination : kafka 
SpinalTap Pipes 
____________________________ 
Each pipe connects one or more binlog sources (MySQL) with a 
destination (e.g. Kafka) 
Configured via YAML files
Indexing 
{ 
"seq" : 3, 
"binlogpos" : "mysql-bin.000002:5217:5273", 
"id" : -1857589909002862756, 
"type" : 2, 
"table" : { 
"id" : 70, 
"name" : "users", 
"db" : "my_db", 
"columns" : [ { 
"name" : "name", 
"type" : 15, 
"ispk" : false 
}, { 
"name" : "age", 
"type" : 2, 
"ispk" : false 
} ] 
}, 
"rows" : [ { 
"1" : { 
"name" : "eric", 
"age" : 31, 
}, 
"2" : { 
"name" : "eric", 
"age" : 28, 
} 
} ] 
} 
SpinalTap Mutations 
____________________________ 
Each binlog entry is parsed and converted into one of three 
event types: “Insert”, “Delete” or “Update” 
“Insert” and “Delete” carry the entire row to be inserted or 
deleted 
“Update” mutations contain both the old and the current row 
Additional information: unique id, sequence number, column 
and table metadata
Indexing 
Medusa 
____________________________ 
Documents in index contain data from ~15 different source tables 
Lucene needs a copy of all fields (not just fields that changed) to update the index 
We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL
Indexing 
Reads from SpinalTap or directly from MySQL 
Data from multiple tables is joined into Thrift objects, 
which correspond to Lucene documents 
The intermediate Thrift objects are persisted in Redis 
As changes are detected, updated objects are pushed 
to the Search instances to update Lucene indexes 
Can bootstrap the entire index in 3 minutes via 
multithreaded streaming 
Leader election via ZooKeeper 
Medusa PersistentStorage 
Search1 Search2 … SearchN
Ranking 
Ranking Problem 
____________________________ 
Not a text search problem 
Users are almost never searching for a specific item, rather they’re looking to “Discover” 
The most common component of a query is location 
Highly personalized – the user is a part of the query 
Optimizing for conversion (Search -> Inquiry -> Booking) 
Evolution through continuos experimentation
Ranking 
Ranking Components 
____________________________ 
Relevance 
Quality 
Bookability 
Personalization 
Desirability of location 
New host promotion 
etc.
Ranking 
Several hundred signals determining search ranking: 
Properties of the listing (reviews, location, etc.) 
Behavioral signals (mined from request logs) 
Image quality and click ability (computer vision) 
Host behavior (response time/rate, cancellations, etc.) 
Host preferences model 
DB snapshots Logs
Ranking 
public void attemptLoadData() { 
DateTime remoteTs = dataLoader.getModTime(pathToSignals); 
! 
if (currentTs == null || remoteTs.isAfter(currentTs) { 
Map<K, D> newSignals = loadData(); 
if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) { 
synchronized (this) { 
signalsMap = newSignals; 
currentTs = remoteTs; 
this.notifyAll(); 
} 
} else { 
LOG.severe("Failed to load the avro file: " + pathToSignals); 
} 
} 
} 
! 
… 
! 
ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader = 
loaders.get(LoaderCollection.Loader.QualitySignals); 
final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true); 
Loading Signals 
____________________________ 
Storing signals in a separate data structure 
Pros: 
Good fit for this type of update pattern: not real-time, but 
almost everything changes on each load 
No need for costly Lucene index rebuild 
Greatly simplifies design 
Cons: 
Unable to use Lucene retrieval on such data
Life of a Query 
Query 
Understanding 
Retrieval 
External Calls 
Geocoding 
Configuring retrieval options 
Choosing ranking models Quality 
Populator Scorer 
2000 results 
Third Pass Ranking 
Result Generation AirEvents Logging 
Bookability 
2000 results Relevance 
Filtering and Reranking 
Pricing Service 
Social Connections 
25 results 
25 results
Ranking 
Second Pass Ranking 
____________________________ 
Traditional ranking works like this: 
! 
then sort by rr 
In contrast, second pass operates on the entire list at once: 
! 
Makes it possible to implement features like result diversity, etc.
Life of a Query 
Query 
Understanding 
Retrieval 
External Calls 
Geocoding 
Configuring retrieval options 
Choosing ranking models Quality 
Populator Scorer 
2000 results 
Third Pass Ranking 
Result Generation AirEvents Logging 
Bookability 
2000 results Relevance 
Filtering and Reranking 
Pricing Service 
Social Connections 
25 results 
25 results
Ranking
Ranking
Ranking
Ranking
Outside of the scope of this talk 
____________________________ 
Ranking models 
Machine Learning infrastructure 
Tools (loadtest, deploy, etc.) 
Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

  • 2.
    Airbnb Search Architecture Maxim Charkov, Engineering Manager maxim.charkov@airbnb.com, @mcharkov
  • 3.
    Airbnb Total Guests 20,000,000+ Countries 190 Cities 34,000+ Castles 600+ Listings Worldwide 800,000+
  • 4.
  • 5.
    Booking Model SearchContact Accept Book
  • 6.
    Search Backend TechnicalStack ____________________________ DropWizard as a service framework (incl. Jetty, Jersey, Jackson) Guice dependency injection framework, Guava libraries, etc. ZooKeeper (via Smartstack) for service discovery. Lucene for index storage and simple retrieval. In-house built real time indexing, ranking, advanced filtering.
  • 7.
    Search Backend ~150search threads 4 indexing threads Data maintained by indexers: Inverted Lucene index for retrieval Forward index for ranking signals Relevance models JVM
  • 8.
    Indexing What’s inthe Lucene index? ____________________________ Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy) Categorical and numerical properties like room type and maximum occupancy Calendar information Full text (descriptions, reviews, etc.) ~40 fields per listing from a variety of data sources, all updated in real time
  • 9.
    Indexing Challenges ____________________________ Bootstrap (creating the index from scratch) Ensuring consistency of the index with ground truth data in real time
  • 10.
    Indexing master calendarfraud SpinalTap Medusa PersistentStorage Search1 Search2 … SearchN
  • 11.
    Indexing master calendarfraud SpinalTap Medusa PersistentStorage Search1 Search2 … SearchN
  • 12.
    Indexing SpinalTap ____________________________ Responsible for detecting updates happening to the ground truth data (no need to maintain search index invalidation logic in application code) Tails binary update logs from MySQL servers (5.6+) Converts them into actionable data objects, called “Mutations” Broadcasts using a distributed queue, like Kafka or RabbitMQ
  • 13.
    Indexing # sourcesfor mysql binary logs sources: - name : airslave host : localhost port : 11 user : spinaltap password: spinaltap - name : calendar_db host : localhost port : 11 user : spinaltap password: spinaltap ! destinations: - name : kafka clazzName : com.airbnb.spinaltap.destination.kafka.KafkaDestination ! pipes: - name : search sources : [“airslave", "calendar_db"] tables : ["production:listings,calendar_db:schedule2s"] destination : kafka SpinalTap Pipes ____________________________ Each pipe connects one or more binlog sources (MySQL) with a destination (e.g. Kafka) Configured via YAML files
  • 14.
    Indexing { "seq": 3, "binlogpos" : "mysql-bin.000002:5217:5273", "id" : -1857589909002862756, "type" : 2, "table" : { "id" : 70, "name" : "users", "db" : "my_db", "columns" : [ { "name" : "name", "type" : 15, "ispk" : false }, { "name" : "age", "type" : 2, "ispk" : false } ] }, "rows" : [ { "1" : { "name" : "eric", "age" : 31, }, "2" : { "name" : "eric", "age" : 28, } } ] } SpinalTap Mutations ____________________________ Each binlog entry is parsed and converted into one of three event types: “Insert”, “Delete” or “Update” “Insert” and “Delete” carry the entire row to be inserted or deleted “Update” mutations contain both the old and the current row Additional information: unique id, sequence number, column and table metadata
  • 15.
    Indexing Medusa ____________________________ Documents in index contain data from ~15 different source tables Lucene needs a copy of all fields (not just fields that changed) to update the index We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL
  • 16.
    Indexing Reads fromSpinalTap or directly from MySQL Data from multiple tables is joined into Thrift objects, which correspond to Lucene documents The intermediate Thrift objects are persisted in Redis As changes are detected, updated objects are pushed to the Search instances to update Lucene indexes Can bootstrap the entire index in 3 minutes via multithreaded streaming Leader election via ZooKeeper Medusa PersistentStorage Search1 Search2 … SearchN
  • 17.
    Ranking Ranking Problem ____________________________ Not a text search problem Users are almost never searching for a specific item, rather they’re looking to “Discover” The most common component of a query is location Highly personalized – the user is a part of the query Optimizing for conversion (Search -> Inquiry -> Booking) Evolution through continuos experimentation
  • 18.
    Ranking Ranking Components ____________________________ Relevance Quality Bookability Personalization Desirability of location New host promotion etc.
  • 19.
    Ranking Several hundredsignals determining search ranking: Properties of the listing (reviews, location, etc.) Behavioral signals (mined from request logs) Image quality and click ability (computer vision) Host behavior (response time/rate, cancellations, etc.) Host preferences model DB snapshots Logs
  • 20.
    Ranking public voidattemptLoadData() { DateTime remoteTs = dataLoader.getModTime(pathToSignals); ! if (currentTs == null || remoteTs.isAfter(currentTs) { Map<K, D> newSignals = loadData(); if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) { synchronized (this) { signalsMap = newSignals; currentTs = remoteTs; this.notifyAll(); } } else { LOG.severe("Failed to load the avro file: " + pathToSignals); } } } ! … ! ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader = loaders.get(LoaderCollection.Loader.QualitySignals); final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true); Loading Signals ____________________________ Storing signals in a separate data structure Pros: Good fit for this type of update pattern: not real-time, but almost everything changes on each load No need for costly Lucene index rebuild Greatly simplifies design Cons: Unable to use Lucene retrieval on such data
  • 21.
    Life of aQuery Query Understanding Retrieval External Calls Geocoding Configuring retrieval options Choosing ranking models Quality Populator Scorer 2000 results Third Pass Ranking Result Generation AirEvents Logging Bookability 2000 results Relevance Filtering and Reranking Pricing Service Social Connections 25 results 25 results
  • 22.
    Ranking Second PassRanking ____________________________ Traditional ranking works like this: ! then sort by rr In contrast, second pass operates on the entire list at once: ! Makes it possible to implement features like result diversity, etc.
  • 23.
    Life of aQuery Query Understanding Retrieval External Calls Geocoding Configuring retrieval options Choosing ranking models Quality Populator Scorer 2000 results Third Pass Ranking Result Generation AirEvents Logging Bookability 2000 results Relevance Filtering and Reranking Pricing Service Social Connections 25 results 25 results
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Outside of thescope of this talk ____________________________ Ranking models Machine Learning infrastructure Tools (loadtest, deploy, etc.) Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.