Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Airbnb Search Architecture
Maxim Charkov, Engineering Manager
maxim.charkov@airbnb.com, @mcharkov

Airbnb
Total Guests
20,000,000+
Countries
190
Cities
34,000+
Castles
600+
Listings Worldwide
800,000+

Booking Model
Search Contact Accept Book

Search Backend
Technical Stack
____________________________
DropWizard as a service framework (incl. Jetty, Jersey, Jackson)
Guice dependency injection framework, Guava libraries, etc.
ZooKeeper (via Smartstack) for service discovery.
Lucene for index storage and simple retrieval.
In-house built real time indexing, ranking, advanced filtering.

Search Backend
~150 search threads
4 indexing threads
Data maintained by indexers:
Inverted Lucene index for retrieval
Forward index for ranking signals
Relevance models
JVM

Indexing
What’s in the Lucene index?
____________________________
Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy)
Categorical and numerical properties like room type and maximum occupancy
Calendar information
Full text (descriptions, reviews, etc.)
~40 fields per listing from a variety of data sources, all updated in real time

Indexing
Challenges
____________________________
Bootstrap (creating the index from scratch)
Ensuring consistency of the index with ground truth data in real time

Indexing
master calendar fraud
SpinalTap
Medusa PersistentStorage
Search1 Search2 … SearchN

Indexing
SpinalTap
____________________________
Responsible for detecting updates happening to the ground truth data
(no need to maintain search index invalidation logic in application code)
Tails binary update logs from MySQL servers (5.6+)
Converts them into actionable data objects, called “Mutations”
Broadcasts using a distributed queue, like Kafka or RabbitMQ

Indexing
# sources for mysql binary logs
sources:
- name : airslave
host : localhost
port : 11
user : spinaltap
password: spinaltap
- name : calendar_db
host : localhost
port : 11
user : spinaltap
password: spinaltap
!
destinations:
- name : kafka
clazzName :
com.airbnb.spinaltap.destination.kafka.KafkaDestination
!
pipes:
- name : search
sources : [“airslave", "calendar_db"]
tables : ["production:listings,calendar_db:schedule2s"]
destination : kafka
SpinalTap Pipes
____________________________
Each pipe connects one or more binlog sources (MySQL) with a
destination (e.g. Kafka)
Configured via YAML files

Indexing
{
"seq" : 3,
"binlogpos" : "mysql-bin.000002:5217:5273",
"id" : -1857589909002862756,
"type" : 2,
"table" : {
"id" : 70,
"name" : "users",
"db" : "my_db",
"columns" : [ {
"name" : "name",
"type" : 15,
"ispk" : false
}, {
"name" : "age",
"type" : 2,
"ispk" : false
} ]
},
"rows" : [ {
"1" : {
"name" : "eric",
"age" : 31,
},
"2" : {
"name" : "eric",
"age" : 28,
}
} ]
}
SpinalTap Mutations
____________________________
Each binlog entry is parsed and converted into one of three
event types: “Insert”, “Delete” or “Update”
“Insert” and “Delete” carry the entire row to be inserted or
deleted
“Update” mutations contain both the old and the current row
Additional information: unique id, sequence number, column
and table metadata

Indexing
Medusa
____________________________
Documents in index contain data from ~15 different source tables
Lucene needs a copy of all fields (not just fields that changed) to update the index
We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL

Indexing
Reads from SpinalTap or directly from MySQL
Data from multiple tables is joined into Thrift objects,
which correspond to Lucene documents
The intermediate Thrift objects are persisted in Redis
As changes are detected, updated objects are pushed
to the Search instances to update Lucene indexes
Can bootstrap the entire index in 3 minutes via
multithreaded streaming
Leader election via ZooKeeper
Medusa PersistentStorage
Search1 Search2 … SearchN

Ranking
Ranking Problem
____________________________
Not a text search problem
Users are almost never searching for a specific item, rather they’re looking to “Discover”
The most common component of a query is location
Highly personalized – the user is a part of the query
Optimizing for conversion (Search -> Inquiry -> Booking)
Evolution through continuos experimentation

Ranking
Ranking Components
____________________________
Relevance
Quality
Bookability
Personalization
Desirability of location
New host promotion
etc.

Ranking
Several hundred signals determining search ranking:
Properties of the listing (reviews, location, etc.)
Behavioral signals (mined from request logs)
Image quality and click ability (computer vision)
Host behavior (response time/rate, cancellations, etc.)
Host preferences model
DB snapshots Logs

Ranking
public void attemptLoadData() {
DateTime remoteTs = dataLoader.getModTime(pathToSignals);
!
if (currentTs == null || remoteTs.isAfter(currentTs) {
Map<K, D> newSignals = loadData();
if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) {
synchronized (this) {
signalsMap = newSignals;
currentTs = remoteTs;
this.notifyAll();
}
} else {
LOG.severe("Failed to load the avro file: " + pathToSignals);
}
}
}
!
…
!
ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader =
loaders.get(LoaderCollection.Loader.QualitySignals);
final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true);
Loading Signals
____________________________
Storing signals in a separate data structure
Pros:
Good fit for this type of update pattern: not real-time, but
almost everything changes on each load
No need for costly Lucene index rebuild
Greatly simplifies design
Cons:
Unable to use Lucene retrieval on such data

Life of a Query
Query
Understanding
Retrieval
External Calls
Geocoding
Configuring retrieval options
Choosing ranking models Quality
Populator Scorer
2000 results
Third Pass Ranking
Result Generation AirEvents Logging
Bookability
2000 results Relevance
Filtering and Reranking
Pricing Service
Social Connections
25 results
25 results

Ranking
Second Pass Ranking
____________________________
Traditional ranking works like this:
!
then sort by rr
In contrast, second pass operates on the entire list at once:
!
Makes it possible to implement features like result diversity, etc.

Outside of the scope of this talk
____________________________
Ranking models
Machine Learning infrastructure
Tools (loadtest, deploy, etc.)
Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.

Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

More Related Content

What's hot

Similar to Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

More from Lucidworks

Recently uploaded

Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb