Successfully reported this slideshow.
Your SlideShare is downloading. ×

Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 29 Ad

More Related Content

Slideshows for you (20)

Similar to Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb (20)

Advertisement

More from Lucidworks (20)

Recently uploaded (20)

Advertisement

Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb

  1. 1. Airbnb Search Architecture Maxim Charkov, Engineering Manager maxim.charkov@airbnb.com, @mcharkov
  2. 2. Airbnb Total Guests 20,000,000+ Countries 190 Cities 34,000+ Castles 600+ Listings Worldwide 800,000+
  3. 3. Search www.airbnb.com
  4. 4. Booking Model Search Contact Accept Book
  5. 5. Search Backend Technical Stack ____________________________ DropWizard as a service framework (incl. Jetty, Jersey, Jackson) Guice dependency injection framework, Guava libraries, etc. ZooKeeper (via Smartstack) for service discovery. Lucene for index storage and simple retrieval. In-house built real time indexing, ranking, advanced filtering.
  6. 6. Search Backend ~150 search threads 4 indexing threads Data maintained by indexers: Inverted Lucene index for retrieval Forward index for ranking signals Relevance models JVM
  7. 7. Indexing What’s in the Lucene index? ____________________________ Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy) Categorical and numerical properties like room type and maximum occupancy Calendar information Full text (descriptions, reviews, etc.) ~40 fields per listing from a variety of data sources, all updated in real time
  8. 8. Indexing Challenges ____________________________ Bootstrap (creating the index from scratch) Ensuring consistency of the index with ground truth data in real time
  9. 9. Indexing master calendar fraud SpinalTap Medusa PersistentStorage Search1 Search2 … SearchN
  10. 10. Indexing master calendar fraud SpinalTap Medusa PersistentStorage Search1 Search2 … SearchN
  11. 11. Indexing SpinalTap ____________________________ Responsible for detecting updates happening to the ground truth data (no need to maintain search index invalidation logic in application code) Tails binary update logs from MySQL servers (5.6+) Converts them into actionable data objects, called “Mutations” Broadcasts using a distributed queue, like Kafka or RabbitMQ
  12. 12. Indexing # sources for mysql binary logs sources: - name : airslave host : localhost port : 11 user : spinaltap password: spinaltap - name : calendar_db host : localhost port : 11 user : spinaltap password: spinaltap ! destinations: - name : kafka clazzName : com.airbnb.spinaltap.destination.kafka.KafkaDestination ! pipes: - name : search sources : [“airslave", "calendar_db"] tables : ["production:listings,calendar_db:schedule2s"] destination : kafka SpinalTap Pipes ____________________________ Each pipe connects one or more binlog sources (MySQL) with a destination (e.g. Kafka) Configured via YAML files
  13. 13. Indexing { "seq" : 3, "binlogpos" : "mysql-bin.000002:5217:5273", "id" : -1857589909002862756, "type" : 2, "table" : { "id" : 70, "name" : "users", "db" : "my_db", "columns" : [ { "name" : "name", "type" : 15, "ispk" : false }, { "name" : "age", "type" : 2, "ispk" : false } ] }, "rows" : [ { "1" : { "name" : "eric", "age" : 31, }, "2" : { "name" : "eric", "age" : 28, } } ] } SpinalTap Mutations ____________________________ Each binlog entry is parsed and converted into one of three event types: “Insert”, “Delete” or “Update” “Insert” and “Delete” carry the entire row to be inserted or deleted “Update” mutations contain both the old and the current row Additional information: unique id, sequence number, column and table metadata
  14. 14. Indexing Medusa ____________________________ Documents in index contain data from ~15 different source tables Lucene needs a copy of all fields (not just fields that changed) to update the index We also need a mechanism to build the entire index from scratch, without putting too much strain on MySQL
  15. 15. Indexing Reads from SpinalTap or directly from MySQL Data from multiple tables is joined into Thrift objects, which correspond to Lucene documents The intermediate Thrift objects are persisted in Redis As changes are detected, updated objects are pushed to the Search instances to update Lucene indexes Can bootstrap the entire index in 3 minutes via multithreaded streaming Leader election via ZooKeeper Medusa PersistentStorage Search1 Search2 … SearchN
  16. 16. Ranking Ranking Problem ____________________________ Not a text search problem Users are almost never searching for a specific item, rather they’re looking to “Discover” The most common component of a query is location Highly personalized – the user is a part of the query Optimizing for conversion (Search -> Inquiry -> Booking) Evolution through continuos experimentation
  17. 17. Ranking Ranking Components ____________________________ Relevance Quality Bookability Personalization Desirability of location New host promotion etc.
  18. 18. Ranking Several hundred signals determining search ranking: Properties of the listing (reviews, location, etc.) Behavioral signals (mined from request logs) Image quality and click ability (computer vision) Host behavior (response time/rate, cancellations, etc.) Host preferences model DB snapshots Logs
  19. 19. Ranking public void attemptLoadData() { DateTime remoteTs = dataLoader.getModTime(pathToSignals); ! if (currentTs == null || remoteTs.isAfter(currentTs) { Map<K, D> newSignals = loadData(); if (newSignals != null && (signalsMap == null || isHealthy(newSignals)) { synchronized (this) { signalsMap = newSignals; currentTs = remoteTs; this.notifyAll(); } } else { LOG.severe("Failed to load the avro file: " + pathToSignals); } } } ! … ! ThreadedLoader<Integer, QualitySignalsAvro> qualitySignalsLoader = loaders.get(LoaderCollection.Loader.QualitySignals); final QualitySignalsAvro qs = qualitySignalsLoader.get(hostingId, true); Loading Signals ____________________________ Storing signals in a separate data structure Pros: Good fit for this type of update pattern: not real-time, but almost everything changes on each load No need for costly Lucene index rebuild Greatly simplifies design Cons: Unable to use Lucene retrieval on such data
  20. 20. Life of a Query Query Understanding Retrieval External Calls Geocoding Configuring retrieval options Choosing ranking models Quality Populator Scorer 2000 results Third Pass Ranking Result Generation AirEvents Logging Bookability 2000 results Relevance Filtering and Reranking Pricing Service Social Connections 25 results 25 results
  21. 21. Ranking Second Pass Ranking ____________________________ Traditional ranking works like this: ! then sort by rr In contrast, second pass operates on the entire list at once: ! Makes it possible to implement features like result diversity, etc.
  22. 22. Life of a Query Query Understanding Retrieval External Calls Geocoding Configuring retrieval options Choosing ranking models Quality Populator Scorer 2000 results Third Pass Ranking Result Generation AirEvents Logging Bookability 2000 results Relevance Filtering and Reranking Pricing Service Social Connections 25 results 25 results
  23. 23. Ranking
  24. 24. Ranking
  25. 25. Ranking
  26. 26. Ranking
  27. 27. Outside of the scope of this talk ____________________________ Ranking models Machine Learning infrastructure Tools (loadtest, deploy, etc.) Other Search Infrastructure services: UserProfiler, Pricing, Social, Hoods, etc.

×