Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

Apache Spark is a general-purpose big data execution engine. You can work with different data sources with the same set of API in both batch and streaming mode.

  • Be the first to comment

Sputnik: Airbnb’s Apache Spark Framework for Data Engineering

  1. 1. Sputnik: Airbnb’s Apache Spark framework for data engineering Egor Pakhomov Software Engineer,
  2. 2. Typical Spark job ds=2010-01-01 ds=2010-01-02 ds=2010-01-03 … ds=2010-01-01 ds=2010-01-02 … …Spark Application
  3. 3. Typical Spark job
  4. 4. https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png DataFrame API Spark Core PackagesGraphXMLlib Spark Streaming Spark SQL DataSource API
  5. 5. https://ogirardot.files.wordpress.com/2015/05/future-of-spark.png DataFrame API Spark Core PackagesGraphXMLlib Spark Streaming Spark SQL DataSource API
  6. 6. Data engineer should write only
  7. 7. Job logic vs Run Logic ▪ Job logic ▪ Job does some business logic (for example counting unique visits for every url) ▪ Job specifies ▪ Input tables and output tables ▪ Partitioning schema ▪ Validation for result data ▪ Run logic ▪ Running job for specific date retrieves input only for that date from input table ▪ Job tries to write to table, which does not exists, so we need to create the table ▪ Job runs in testing mode, so all result tables are created with “_testing” suffix
  8. 8. Job logic vs Run Logic Console parameters --ds : 2020-01-01 --writeEnv : DEV Sputnik TableReader TableWriter Hive Input Table Result Table Sputnik job Get Data Business logic Write Result
  9. 9. Sputnik job
  10. 10. Running the Sputnik job
  11. 11. Writing data
  12. 12. Writing data What Sputnik HiveTableWriter does: ▪ creates table with “CREATE TABLE” hive statement, if table does not exist ▪ updates table metainformation ▪ normalize dataframe schema according to output Hive table ▪ repartitions and tries to reduce number of result files on disk ▪ runs the checks on result, before writing it ▪ changes output table name (staging/testing mode)
  13. 13. Reading data Reading dataframe Reading dataset
  14. 14. Testing ▪ Singleton Spark Session ▪ DataFrame comparison ▪ Loading data from csv/json ▪ Cleaning hive between runs
  15. 15. Managing configuration Job: Config:
  16. 16. Checks on output
  17. 17. Backfilling tHive Table 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 2020-01-08 Daily job --ds 2020-01-08 Daily job --ds 2020-01-07
  18. 18. Backfilling tHive Table 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 2020-01-08 Daily job --ds 2020-01-08 Daily job --ds 2020-01-07 Backfill job --startDate 2020-01-01 --endDate 2020-01-06
  19. 19. Backfilling tHive Table 2020-01-01 2020-01-02 2020-01-03 2020-01-04 2020-01-05 2020-01-06 2020-01-07 2020-01-08 Daily job --ds 2020-01-08 Daily job --ds 2020-01-07 Backfill job --startDate 2020-01-01 --endDate 2020-01-06 --stepSize 3
  20. 20. Managing environments database.input_table database.output_table Sputnik Production environment database.input_table database.output_table_dev Sputnik Developer environment
  21. 21. Managing environments --writeEnv PROD --writeEnv DEV --writeEnv STAGE database.input_table database.input_table_dev database.input_table_staging
  22. 22. Other parameters
  23. 23. https://github.com/airbnb/sputnik egor.pakhomov@airbnb.com

×