Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark at Airbnb


Published on

Airbnb primarily leverages Spark to power mission critical data applications. In this talk, we would like to share our major production use cases including both Streaming applications and Batch processing applications. In addition, we would like share our optimizations on how to improve the throughput of Spark Kafka connector by 10X. Furthermore, we plan to share our journey and lessons learned during the process of upgrading Spark 1+ applications to Spark 2+. The key takeaways includes best practices learned from building and scaling production Spark applications as well as tips and benefits of migrating to Spark 2.x. We hope to share our experiences of making Spark successful at Airbnb with a broader audience of Spark users.

Published in: Data & Analytics
  • Develops your Dog's "Hidden Intelligence" To eliminate bad behavior and Create the obedient, well-behaved pet of your dreams... ■■■
    Are you sure you want to  Yes  No
    Your message goes here

Apache Spark at Airbnb

  1. 1. Spark@Airbnb HAO WANG • APRIL 24, 2019 • SPARK SUMMIT
  2. 2. • What is Airbnb • Spark Use Cases at Airbnb • Upgrade to Spark 2.3 • Near Real-Time Data Ingestion with Spark Streaming Outline
  3. 3. 50-100 LISTINGS 101-300 LISTINGS 301-1000 LISTINGS 1001+ LISTINGS 2009
  4. 4. 6 Million TOTAL HOMES ON AIRBNB
  5. 5. 191+ COUNTRIES 81K CITIES
  6. 6. Sharedroom Privateroom Entirehome Bed&breakfast Boutique UniqueVacationhome H O S T E D B Y M A G D A L E N A   ·   M A D R I D , S PA I N Stunning at old city center F R O M $ 9 6 P E R N I G H T
  7. 7. Sharedroom Privateroom Entirehome Vacationhome Bed&breakfast Boutique Unique H O S T E D B Y R E M Y   ·   T E L F S , A U S T R I A Romantic Chalet with fantastic view F R O M $ 1 6 5 P E R N I G H T
  8. 8. Sharedroom Privateroom Entirehome Bed&breakfast Boutique UniqueVacationhome H O S T E D B Y F R A N C E S A N D D E N N I S   ·   C O T T O N W O O D , I D A H O Dog Bark Park Inn F R O M $ 1 3 2 P E R N I G H T
  9. 9. Airbnb Experiences
  13. 13. DatawarehouseStorage(2017-2019)
  14. 14. • Search • Pricing • Machine Learning • Data Ingestion • Near real-time applications • … AirbnbSpark UseCases
  15. 15. SearchRanking • More powerful models (DNNs) require more training data! • To process large amounts of data, we need to leverage tools like Spark  • Allows for more complex error handling and unit tests • Can re-use code between online system (Java) and offline data pipeline (Scala) • Use broadcast-join to optimize data pipelines to only extract raw logs for searches of interest arXiv paper: Applying Deep Learning To Airbnb Search
  16. 16. SmartPricing Blog Post: Learning Market Dynamics for Optimal Pricing
  17. 17. FinancialIntelligence • Support Finance and Accounting Functions • Process all the data in the company - Yes, ALL of it - Varying quality and “cleanliness” - Immense Scale - 10+ years of transactional data • Output clean financial data to power various business functions - Treasury - Revenue Ops - Financial Systems and Technologies Group
  18. 18. SparkUpgradeto2.3
  19. 19. SparkUpgradefrom1.xto2.3 • Up to 2.5X performance improvement - 60% reduction of batch processing time in a production job • Better SQL support. Spark 2.x can run all the 99 TPC-DS queries, which require many of the SQL:2003 features • Vectorized reader for ORC and Parquet • Reduced cost due to better performance • Better support and integration with Spark and Hadoop ecosystem • Numerous improvements and bug fixes have went into Spark after the 1.6 release in 2016
  20. 20. Scaling NearReal-Time DataIngestion
  21. 21. ArchitectureofLoggingDataFlow • Bridge between online and offline data • Mission critical • Near real-time • High throughput • SLA & recovery • Efficiency & cost
  22. 22. • Fast growth - Topics from dozens to thousands - Bytes grow 6x in 2018 • Bottlenecks (e.g., Spark parallelism determined by Kafka partitions) • Skew in event size and QPS Challenges andPainPoints
  23. 23. LoggingEvents SizeSkew
  24. 24. Current 1-to-1 KafkaReader
  25. 25. SparkTask RunningTimeSkew Image from
  26. 26. • Stability & SLA depend on many systems - Near Real-time Ingestion - Headroom for Catch Up - SLA suffer - Operational nightmare - Oncall burnout • Efficiency & cost MoreChallenges andPainPoints
  27. 27. SolutionFrom SparkCommunity • Outstanding issue and PR • Does not handle data skew among topics
  28. 28. BalancedKafka ReaderforSpark
  29. 29. 1-to-N KafkaReader
  30. 30. Balanced Partitioning Algorithm • Pre-compute averageeventsize (bytes) per topic • Compute the ideal bytespersplit • For new topics, use the average size of all topics Blog Post: Scaling Spark Streaming for Logging Event Ingestion
  31. 31. Balanced Partitioning Algorithm • Shuffle the list of offset ranges • Starting from split 1. For each offset range - Assign it to the current split if the total weight is less than weight-per-split - If it doesn’t fit, break it apart and assign the subset of the offset range that fits - If the current split is more than bytes-per-split, move to the next split Blog Post: Scaling Spark Streaming for Logging Event Ingestion
  32. 32. BalancedKafka ReaderPerformance
  33. 33. Finally • Support 20X higher throughput with imbalanced topics • Better SLA (happy customers! happy engineers!) • Faster recovery and less hand holding • Ever-increasing throughput
  34. 34. OpenSource • Balanced Kafka Reader for Spark will be available on and Airbnb Github soon!
  36. 36. DEREK & IMAN  
  37. 37. Airbnb Celebrates HalfABillion GuestArrivals
  38. 38. ArchitectureofLoggingDataFlow
  39. 39. Wearehiring!