Successfully reported this slideshow.
Your SlideShare is downloading. ×

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 52 Ad

More Related Content

Slideshows for you (20)

Similar to From HDFS to S3: Migrate Pinterest Apache Spark Clusters (20)

Advertisement

More from Databricks (20)

Advertisement

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

  1. 1. From HDFS to S3: Migrate Pinterest Apache Spark Clusters Xin Yao, Daniel Dai Pinterest
  2. 2. About us Xin Yao xyao@pinterest.com ▪ Tech Lead at Pinterest (Ads Team) ▪ Previously on Data Warehouse team at Facebook and Hulu Daniel Dai jdai@pinterest.com ▪ Tech Lead at Pinterest (Data Team) ▪ PMC member for Apache Hive and Pig ▪ Previously work at Cloudera/Hortonworks and Yahoo
  3. 3. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  4. 4. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  5. 5. Big Data Platform Spark Hive Mesos/Aurora HDFS Kafka Presto ▪ Use Cases ▪ Ads ▪ Machine Learning ▪ Recommendations ▪ ...
  6. 6. Old vs New cluster Spark Hive Mesos/Aurora HDFS Kafka Old Cluster New Cluster Presto Spark Hive YARN S3 Kafka Presto
  7. 7. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  8. 8. Identify Bottleneck of Old Cluster
  9. 9. Low local disk IOPS Slow Shuffle Slow Job Slow Workflow Old Cluster: Performance Bottleneck
  10. 10. Why Local Disk IO is important for Spark ▪ Spark mappers write shuffle data to local disk ▪ Spark mappers read local disk to serve shuffle data for reducer ▪ Spark spills data to local disk when data is bigger than memory
  11. 11. A Simple Aggregation Query SELECT id, max(value) FROM table GROUP BY id
  12. 12. 9k Mappers * 9k Reducers map map map ... reducer reducer reducer ... 9K Mappers Network 9k ReducersMapper Local Disk Mappers Reducers
  13. 13. 9k * 9k | One Mapper Machine map map reducer reducer reducer Local Disk 270k IO Ops Too many for our machine ... 30 Mappers ... One Mapper Machine | 30 Mappers Mapper machine ... ...
  14. 14. How to optimize jobs in old Cluster
  15. 15. Optimization. Reduce # of Mapper/Reducer map map map ... reducer reducer reducer ... 3K Mappers Network 3k Reducersmapper local disk input input input input input input input input input More files per Mapper NetworkMappers Reducers
  16. 16. Optimization map map map ... reducer reducer reducer input input input input input input input input input mapper local disk 30k Ops 9X better One Mapper Machine | 10 Mappers ... 10 Mappers ... Mapper machine ... input input input ...
  17. 17. Result
  18. 18. Building NEW cluster
  19. 19. New Cluster: Choose the right EC2 instance Old Cluster New Cluster EC2 Node Local Disk IOPS Cap 3k 40k EC2 Node Type r4.16xlarge r5d.12xlarge EC2 Node CPU 64 vcores 48 vcores EC2 Node Mem 480 GB 372 GB
  20. 20. Production Result § After migration, prod jobs have 25% improvement on avg, without any extra resources and tuning § One typical heavy job even got 35% improvement from 90 minutes to 57 minutes Old Cluster New Cluster
  21. 21. Key Takeaways ▪ Measure before Optimize ▪ Premature optimization is the root of all evil
  22. 22. Key Takeaways ▪ Optimization could happen at different levels ▪ Cluster level ▪ New EC2 instance type ▪ Spark level ▪ Mapper number/cpu/mem tuning ▪ Job level ▪ Simplify logic
  23. 23. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  24. 24. S3 != HDFS ▪ HDFS is a filesystem that is strong consistent. Changes are immediately visible ▪ S3 is an object storage service that is eventually consistent. There is no guarantee a change is immediately visible to the client. This might cause missing file issue without reader even know it.
  25. 25. Read after write consistency
  26. 26. Spark Job read less files from S3
  27. 27. How often does this happen ▪ Numbers from AWS: less than 1 in 10 million ▪ Numbers from our test: less than 1 in 50 million
  28. 28. Solution. Considerations ▪ Write Consistency ▪ Whether the job could write the output consistently, without partial or corrupted data as long as the job succeed. Even when some of the tasks failed or retried. ▪ Read Consistency ▪ Whether the job could read files in a folder, no more or less than it supposed to read. ▪ Monitor Consistency ▪ Requires Reader or Writer side change ▪ Query Performance
  29. 29. Solution. Considerations ▪ Storage ▪ Isolation ▪ Transaction ▪ Supports Spark ▪ Supports Hive/Presto ▪ Project Origin ▪ Adoption Effort
  30. 30. Solutions sorted by the complexity, simple => complex Raw S3 Data Quality Read Monitor Write Waiting Write Listing S3Committe r Consistent Listing S3Guard Iceberg Delta Lake Monitor Consistency No Partial Partial No No No N/A N/A N/A Write Consistency No No No No No Yes Yes Yes Yes Read Consistency No No Partial Partial Partial No Yes Yes Yes Reader/Writer change No No No Writer Writer Writer R/W R/W R/W Query Performance Normal Normal Normal Normal Normal Normal Normal Good Good Storage Normal Normal Normal Normal Normal Normal Normal Good Good Isolation No No No No No No No Strong Snapshot Strong Snapshot Transaction No No No No No No No Table Table Supports Spark Yes Yes Yes Yes Yes Yes Yes Yes Yes Supports Hive/Presto Yes Yes Yes Yes* Yes* Yes Yes WIP WIP Project Origin In House In House Not Exist Not Exist Not Exist Netflix OSS Hadoop 3.0 Apache Incubator Databricks OSS Effort None M M M M L XL XL
  31. 31. Our Approach ▪ Short Term ▪ S3 Committer ▪ Number of file monitor ▪ Data quality tool ▪ Long Term ▪ Systematical solutions
  32. 32. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  33. 33. Performance Comparison: S3 vs HDFS ▪ Similar throughput ▪ Metadata operation is slow, especially move operation ▪ Our Spark streaming job is heavily impacted ▪ Spending most time moving output files around (3 times) 13s 55s Microbatch Runtime
  34. 34. Dealing with Metadata Operation ▪ Move file at least twice in a Spark Application ▪ commitTask ▪ commitJob ▪ May also move to the Hive table location output/_temporary/taskId/_temporary/taskAttemptID/part-xxxxxx output/_temporary/taskId/part-xxxxxx output/part-xxxxxx /warehouse/pinterest.db/table/date=20200626 commitTask commitJob Hive MoveTask df.write.mode(SaveMode.Append).insertInto(partitionedTable)
  35. 35. Reduce Move Operations ▪ FileOutputCommitter algorithm 2 ▪ spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 ▪ Skip move operation in job level, only task level ▪ DirectOutputCommitter ▪ Further skip move operation at task level ▪ Problem: File corruption when job fail ▪ Netflix s3committer ▪ spark.sql.sources.outputCommitterClass=com.netflix.bdp.s3.S3PartitionedOutputCommitter ▪ Use multi-part upload api, no move operation ▪ Other solutions ▪ Iceberg ▪ Hadoop s3acommitter
  36. 36. Multipart Upload API ▪ For every file ▪ initiateMultipartUpload ▪ Multiple uploadPart ▪ Finally completeMultipartUpload/abortMultipartUpload ▪ AWS will save all parts until completeMultipartUpload/abortMultipartUpload ▪ Setup a lifecycle policy ▪ Separate s3 permission for abortMultipartUpload
  37. 37. S3committer ▪ Upload File to Output Directly use multi-part upload api ▪ Atomic completeMultipartUpload leaves no corrupt output ▪ Parallel upload parts of a file to increase throughput uploadPart completeMultipartUpload commitTask commitJob
  38. 38. The Last Move Operation ▪ Before: Use staging directory to figure out the new partitions ▪ After: A table level tracking file for the new partitions ds=20200101 ds=20200102 ds=20200103 ds=20200104 ds=20200105 ds=20200106 ds=20200107 ds=20200108 ds=20200109 ds=20200110 ds=20200111 Table ds=20200112 Staging Directory .s3_dyn_parts ds=20200101 ds=20200102 ds=20200103 ds=20200104 ds=20200105 ds=20200106 ds=20200107 ds=20200108 ds=20200109 ds=20200110 ds=20200111 ds=20200112 Table ds=20200112
  39. 39. The Result 13s 11s Microbatch runtime 13s 55s Microbatch runtime HFDS S3 HFDS S3
  40. 40. Fix Bucket Rate Limit Issue (503) ▪ S3 bucket partition ▪ Task and Job level retry ▪ Tune the parallelism in part file uploads
  41. 41. Improving S3Committer ▪ Fix Bucket Rate Limit (503) ▪ Parallel upload parts of a file to increase throughput ▪ Integrity check of S3 multipart upload ETags ▪ Fix thread pool leaking for long-running application ▪ Remove local output early
  42. 42. S3 Benefit Compare to HDFS ▪ Reduce 80% storage cost ▪ S3: 99.99% availability, 99.999999999% durability ▪ HDFS: 99.9% target availability ▪ Namenode single point failure ▪ Potential data lost
  43. 43. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  44. 44. Things We Miss in Mesos ▪ Manage services inside Mesos ▪ Simple workflow, long running job and cron job via Aurora ▪ Rolling restart ▪ Built-in health check
  45. 45. Things We Like in Yarn ▪ Global view of all running applications ▪ Better queue management for organization isolation ▪ Consolidate with the rest of clusters
  46. 46. Cost Saving ▪ We achieve cost savings with YARN ▪ Queue isolation ▪ Preemption
  47. 47. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  48. 48. Spark at Pinterest ▪ We are still in the early stages ▪ Spark represents 12% of all compute resource usage ▪ Batch use case ▪ Mostly Scala, also PySpark
  49. 49. We Are Working On ▪ Automatic migration from Hive -> Spark SQL ▪ Cascading/Scalding -> Spark ▪ Adopting Dr Elephant for Spark ▪ Used for code review ▪ Integrate with internal metrics system ▪ Include features from Sparklens ▪ Spark history server performance
  50. 50. xyao@pinterest.com jdai@pinterest.com
  51. 51. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×