SlideShare a Scribd company logo
From HDFS to S3: Migrate
Pinterest Apache Spark Clusters
Xin Yao, Daniel Dai
Pinterest
About us
Xin Yao xyao@pinterest.com
▪ Tech Lead at Pinterest (Ads Team)
▪ Previously on Data Warehouse team at Facebook and Hulu
Daniel Dai jdai@pinterest.com
▪ Tech Lead at Pinterest (Data Team)
▪ PMC member for Apache Hive and Pig
▪ Previously work at Cloudera/Hortonworks and Yahoo
Agenda
▪ NextGen Big Data Platform
▪ Performance
▪ S3 Consistency
▪ Storage Difference
▪ Scheduling
▪ Spark at Pinterest
Agenda
▪ NextGen Big Data Platform
▪ Performance
▪ S3 Consistency
▪ Storage Difference
▪ Scheduling
▪ Spark at Pinterest
Big Data Platform
Spark Hive
Mesos/Aurora
HDFS
Kafka
Presto
▪ Use Cases
▪ Ads
▪ Machine Learning
▪ Recommendations
▪ ...
Old vs New cluster
Spark Hive
Mesos/Aurora
HDFS
Kafka
Old Cluster New Cluster
Presto
Spark Hive
YARN
S3
Kafka
Presto
Agenda
▪ NextGen Big Data Platform
▪ Performance
▪ S3 Consistency
▪ Storage Difference
▪ Scheduling
▪ Spark at Pinterest
Identify Bottleneck of Old Cluster
Low local
disk IOPS
Slow
Shuffle
Slow Job
Slow
Workflow
Old Cluster: Performance Bottleneck
Why Local Disk IO is important for Spark
▪ Spark mappers write shuffle data to local disk
▪ Spark mappers read local disk to serve shuffle data for reducer
▪ Spark spills data to local disk when data is bigger than memory
A Simple Aggregation Query
SELECT id, max(value)
FROM table
GROUP BY id
9k Mappers * 9k Reducers
map
map
map
...
reducer
reducer
reducer
...
9K Mappers
Network
9k ReducersMapper Local Disk
Mappers Reducers
9k * 9k | One Mapper Machine
map
map
reducer
reducer
reducer
Local Disk 270k IO Ops
Too many for our machine
...
30 Mappers
...
One Mapper Machine | 30 Mappers
Mapper machine
...
...
How to optimize jobs in old Cluster
Optimization. Reduce # of Mapper/Reducer
map
map
map
...
reducer
reducer
reducer
...
3K Mappers Network 3k Reducersmapper local disk
input
input
input
input
input
input
input
input
input
More files per Mapper
NetworkMappers Reducers
Optimization
map
map
map
...
reducer
reducer
reducer
input
input
input
input
input
input
input
input
input
mapper local disk 30k Ops
9X better
One Mapper Machine | 10 Mappers
...
10 Mappers
...
Mapper machine
...
input
input
input
...
Result
Building NEW cluster
New Cluster: Choose the right EC2 instance
Old Cluster New Cluster
EC2 Node Local Disk IOPS
Cap 3k 40k
EC2 Node Type r4.16xlarge r5d.12xlarge
EC2 Node CPU 64 vcores 48 vcores
EC2 Node Mem 480 GB 372 GB
Production Result
§ After migration, prod jobs have 25% improvement on avg,
without any extra resources and tuning
§ One typical heavy job even got 35% improvement from 90
minutes to 57 minutes
Old Cluster New Cluster
Key Takeaways
▪ Measure before Optimize
▪ Premature optimization is the root of all evil
Key Takeaways
▪ Optimization could happen at different levels
▪ Cluster level
▪ New EC2 instance type
▪ Spark level
▪ Mapper number/cpu/mem tuning
▪ Job level
▪ Simplify logic
Agenda
▪ NextGen Big Data Platform
▪ Performance
▪ S3 Consistency
▪ Storage Difference
▪ Scheduling
▪ Spark at Pinterest
S3 != HDFS
▪ HDFS is a filesystem that is strong consistent. Changes are
immediately visible
▪ S3 is an object storage service that is eventually consistent. There
is no guarantee a change is immediately visible to the client. This
might cause missing file issue without reader even know it.
Read after write consistency
Spark Job read less files from S3
How often does this happen
▪ Numbers from AWS: less than 1 in 10 million
▪ Numbers from our test: less than 1 in 50 million
Solution. Considerations
▪ Write Consistency
▪ Whether the job could write the output consistently, without partial or
corrupted data as long as the job succeed. Even when some of the tasks
failed or retried.
▪ Read Consistency
▪ Whether the job could read files in a folder, no more or less than it supposed
to read.
▪ Monitor Consistency
▪ Requires Reader or Writer side change
▪ Query Performance
Solution. Considerations
▪ Storage
▪ Isolation
▪ Transaction
▪ Supports Spark
▪ Supports Hive/Presto
▪ Project Origin
▪ Adoption Effort
Solutions sorted by the complexity, simple => complex
Raw S3 Data
Quality
Read
Monitor
Write
Waiting
Write Listing S3Committe
r
Consistent
Listing
S3Guard
Iceberg Delta Lake
Monitor
Consistency
No Partial Partial No No No N/A N/A N/A
Write
Consistency
No No No No No Yes Yes Yes Yes
Read
Consistency
No No Partial Partial Partial No Yes Yes Yes
Reader/Writer
change
No No No Writer Writer Writer R/W R/W R/W
Query
Performance
Normal Normal Normal Normal Normal Normal Normal Good Good
Storage Normal Normal Normal Normal Normal Normal Normal Good Good
Isolation No No No No No No No Strong
Snapshot
Strong
Snapshot
Transaction No No No No No No No Table Table
Supports Spark Yes Yes Yes Yes Yes Yes Yes Yes Yes
Supports
Hive/Presto
Yes Yes Yes Yes* Yes* Yes Yes WIP WIP
Project Origin In House In House Not Exist Not Exist Not Exist Netflix OSS Hadoop 3.0 Apache
Incubator
Databricks OSS
Effort None M M M M L XL XL
Our Approach
▪ Short Term
▪ S3 Committer
▪ Number of file monitor
▪ Data quality tool
▪ Long Term
▪ Systematical solutions
Agenda
▪ NextGen Big Data Platform
▪ Performance
▪ S3 Consistency
▪ Storage Difference
▪ Scheduling
▪ Spark at Pinterest
Performance Comparison: S3 vs HDFS
▪ Similar throughput
▪ Metadata operation is slow, especially move operation
▪ Our Spark streaming job is heavily impacted
▪ Spending most time moving output files around (3 times)
13s
55s
Microbatch Runtime
Dealing with Metadata Operation
▪ Move file at least twice in a Spark Application
▪ commitTask
▪ commitJob
▪ May also move to the Hive table location
output/_temporary/taskId/_temporary/taskAttemptID/part-xxxxxx
output/_temporary/taskId/part-xxxxxx
output/part-xxxxxx
/warehouse/pinterest.db/table/date=20200626
commitTask
commitJob
Hive MoveTask
df.write.mode(SaveMode.Append).insertInto(partitionedTable)
Reduce Move Operations
▪ FileOutputCommitter algorithm 2
▪ spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
▪ Skip move operation in job level, only task level
▪ DirectOutputCommitter
▪ Further skip move operation at task level
▪ Problem: File corruption when job fail
▪ Netflix s3committer
▪ spark.sql.sources.outputCommitterClass=com.netflix.bdp.s3.S3PartitionedOutputCommitter
▪ Use multi-part upload api, no move operation
▪ Other solutions
▪ Iceberg
▪ Hadoop s3acommitter
Multipart Upload API
▪ For every file
▪ initiateMultipartUpload
▪ Multiple uploadPart
▪ Finally completeMultipartUpload/abortMultipartUpload
▪ AWS will save all parts until
completeMultipartUpload/abortMultipartUpload
▪ Setup a lifecycle policy
▪ Separate s3 permission for abortMultipartUpload
S3committer
▪ Upload File to Output Directly use multi-part upload api
▪ Atomic completeMultipartUpload leaves no corrupt output
▪ Parallel upload parts of a file to increase throughput
uploadPart
completeMultipartUpload
commitTask
commitJob
The Last Move Operation
▪ Before: Use staging directory to figure out the new partitions
▪ After: A table level tracking file for the new partitions
ds=20200101
ds=20200102
ds=20200103
ds=20200104
ds=20200105
ds=20200106
ds=20200107
ds=20200108
ds=20200109
ds=20200110
ds=20200111
Table
ds=20200112
Staging Directory
.s3_dyn_parts
ds=20200101
ds=20200102
ds=20200103
ds=20200104
ds=20200105
ds=20200106
ds=20200107
ds=20200108
ds=20200109
ds=20200110
ds=20200111
ds=20200112
Table
ds=20200112
The Result
13s
11s
Microbatch runtime
13s
55s
Microbatch runtime
HFDS
S3
HFDS
S3
Fix Bucket Rate Limit Issue (503)
▪ S3 bucket partition
▪ Task and Job level retry
▪ Tune the parallelism in part file uploads
Improving S3Committer
▪ Fix Bucket Rate Limit (503)
▪ Parallel upload parts of a file to increase throughput
▪ Integrity check of S3 multipart upload ETags
▪ Fix thread pool leaking for long-running application
▪ Remove local output early
S3 Benefit Compare to HDFS
▪ Reduce 80% storage cost
▪ S3: 99.99% availability, 99.999999999% durability
▪ HDFS: 99.9% target availability
▪ Namenode single point failure
▪ Potential data lost
Agenda
▪ NextGen Big Data Platform
▪ Performance
▪ S3 Consistency
▪ Storage Difference
▪ Scheduling
▪ Spark at Pinterest
Things We Miss in Mesos
▪ Manage services inside Mesos
▪ Simple workflow, long running job and cron job via Aurora
▪ Rolling restart
▪ Built-in health check
Things We Like in Yarn
▪ Global view of all running applications
▪ Better queue management for organization isolation
▪ Consolidate with the rest of clusters
Cost Saving
▪ We achieve cost savings with YARN
▪ Queue isolation
▪ Preemption
Agenda
▪ NextGen Big Data Platform
▪ Performance
▪ S3 Consistency
▪ Storage Difference
▪ Scheduling
▪ Spark at Pinterest
Spark at Pinterest
▪ We are still in the early stages
▪ Spark represents 12% of all compute resource usage
▪ Batch use case
▪ Mostly Scala, also PySpark
We Are Working On
▪ Automatic migration from Hive -> Spark SQL
▪ Cascading/Scalding -> Spark
▪ Adopting Dr Elephant for Spark
▪ Used for code review
▪ Integrate with internal metrics system
▪ Include features from Sparklens
▪ Spark history server performance
xyao@pinterest.com
jdai@pinterest.com
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

More Related Content

What's hot

Building and running cloud native cassandra
Building and running cloud native cassandraBuilding and running cloud native cassandra
Building and running cloud native cassandra
Vinay Kumar Chella
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
Amazon Web Services
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Amazon Web Services
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
greggulrich
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
Info Alchemy Corporation
 
Deep Dive: AWS Command Line Interface
Deep Dive: AWS Command Line InterfaceDeep Dive: AWS Command Line Interface
Deep Dive: AWS Command Line Interface
Amazon Web Services
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
Amazon Web Services
 
A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
A deep dive into Amazon MSK - ADB206 - Chicago AWS SummitA deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
Amazon Web Services
 
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Amazon Web Services
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Databricks
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planning
JamieAlquiza
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
 
[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...
[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...
[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...
Amazon Web Services
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
Amazon Web Services
 

What's hot (20)

Building and running cloud native cassandra
Building and running cloud native cassandraBuilding and running cloud native cassandra
Building and running cloud native cassandra
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
Deep Dive: AWS Command Line Interface
Deep Dive: AWS Command Line InterfaceDeep Dive: AWS Command Line Interface
Deep Dive: AWS Command Line Interface
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
A deep dive into Amazon MSK - ADB206 - Chicago AWS SummitA deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
A deep dive into Amazon MSK - ADB206 - Chicago AWS Summit
 
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
Deep Dive on Amazon Aurora with PostgreSQL Compatibility (DAT305-R1) - AWS re...
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planning
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...
[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...
[NEW LAUNCH!] Deep Dive on Amazon FSx for Windows File Server (STG322-R) - AW...
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 

Similar to From HDFS to S3: Migrate Pinterest Apache Spark Clusters

Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
Rose Toomey
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Revealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataRevealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine Data
Databricks
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
Databricks
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
Dongwon Kim
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 

Similar to From HDFS to S3: Migrate Pinterest Apache Spark Clusters (20)

Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Revealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine DataRevealing the Power of Legacy Machine Data
Revealing the Power of Legacy Machine Data
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 

Recently uploaded (20)

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

  • 1.
  • 2. From HDFS to S3: Migrate Pinterest Apache Spark Clusters Xin Yao, Daniel Dai Pinterest
  • 3. About us Xin Yao xyao@pinterest.com ▪ Tech Lead at Pinterest (Ads Team) ▪ Previously on Data Warehouse team at Facebook and Hulu Daniel Dai jdai@pinterest.com ▪ Tech Lead at Pinterest (Data Team) ▪ PMC member for Apache Hive and Pig ▪ Previously work at Cloudera/Hortonworks and Yahoo
  • 4. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  • 5. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  • 6. Big Data Platform Spark Hive Mesos/Aurora HDFS Kafka Presto ▪ Use Cases ▪ Ads ▪ Machine Learning ▪ Recommendations ▪ ...
  • 7. Old vs New cluster Spark Hive Mesos/Aurora HDFS Kafka Old Cluster New Cluster Presto Spark Hive YARN S3 Kafka Presto
  • 8. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  • 10. Low local disk IOPS Slow Shuffle Slow Job Slow Workflow Old Cluster: Performance Bottleneck
  • 11. Why Local Disk IO is important for Spark ▪ Spark mappers write shuffle data to local disk ▪ Spark mappers read local disk to serve shuffle data for reducer ▪ Spark spills data to local disk when data is bigger than memory
  • 12. A Simple Aggregation Query SELECT id, max(value) FROM table GROUP BY id
  • 13. 9k Mappers * 9k Reducers map map map ... reducer reducer reducer ... 9K Mappers Network 9k ReducersMapper Local Disk Mappers Reducers
  • 14. 9k * 9k | One Mapper Machine map map reducer reducer reducer Local Disk 270k IO Ops Too many for our machine ... 30 Mappers ... One Mapper Machine | 30 Mappers Mapper machine ... ...
  • 15. How to optimize jobs in old Cluster
  • 16. Optimization. Reduce # of Mapper/Reducer map map map ... reducer reducer reducer ... 3K Mappers Network 3k Reducersmapper local disk input input input input input input input input input More files per Mapper NetworkMappers Reducers
  • 17. Optimization map map map ... reducer reducer reducer input input input input input input input input input mapper local disk 30k Ops 9X better One Mapper Machine | 10 Mappers ... 10 Mappers ... Mapper machine ... input input input ...
  • 20. New Cluster: Choose the right EC2 instance Old Cluster New Cluster EC2 Node Local Disk IOPS Cap 3k 40k EC2 Node Type r4.16xlarge r5d.12xlarge EC2 Node CPU 64 vcores 48 vcores EC2 Node Mem 480 GB 372 GB
  • 21. Production Result § After migration, prod jobs have 25% improvement on avg, without any extra resources and tuning § One typical heavy job even got 35% improvement from 90 minutes to 57 minutes Old Cluster New Cluster
  • 22. Key Takeaways ▪ Measure before Optimize ▪ Premature optimization is the root of all evil
  • 23. Key Takeaways ▪ Optimization could happen at different levels ▪ Cluster level ▪ New EC2 instance type ▪ Spark level ▪ Mapper number/cpu/mem tuning ▪ Job level ▪ Simplify logic
  • 24. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  • 25. S3 != HDFS ▪ HDFS is a filesystem that is strong consistent. Changes are immediately visible ▪ S3 is an object storage service that is eventually consistent. There is no guarantee a change is immediately visible to the client. This might cause missing file issue without reader even know it.
  • 26. Read after write consistency
  • 27. Spark Job read less files from S3
  • 28. How often does this happen ▪ Numbers from AWS: less than 1 in 10 million ▪ Numbers from our test: less than 1 in 50 million
  • 29. Solution. Considerations ▪ Write Consistency ▪ Whether the job could write the output consistently, without partial or corrupted data as long as the job succeed. Even when some of the tasks failed or retried. ▪ Read Consistency ▪ Whether the job could read files in a folder, no more or less than it supposed to read. ▪ Monitor Consistency ▪ Requires Reader or Writer side change ▪ Query Performance
  • 30. Solution. Considerations ▪ Storage ▪ Isolation ▪ Transaction ▪ Supports Spark ▪ Supports Hive/Presto ▪ Project Origin ▪ Adoption Effort
  • 31. Solutions sorted by the complexity, simple => complex Raw S3 Data Quality Read Monitor Write Waiting Write Listing S3Committe r Consistent Listing S3Guard Iceberg Delta Lake Monitor Consistency No Partial Partial No No No N/A N/A N/A Write Consistency No No No No No Yes Yes Yes Yes Read Consistency No No Partial Partial Partial No Yes Yes Yes Reader/Writer change No No No Writer Writer Writer R/W R/W R/W Query Performance Normal Normal Normal Normal Normal Normal Normal Good Good Storage Normal Normal Normal Normal Normal Normal Normal Good Good Isolation No No No No No No No Strong Snapshot Strong Snapshot Transaction No No No No No No No Table Table Supports Spark Yes Yes Yes Yes Yes Yes Yes Yes Yes Supports Hive/Presto Yes Yes Yes Yes* Yes* Yes Yes WIP WIP Project Origin In House In House Not Exist Not Exist Not Exist Netflix OSS Hadoop 3.0 Apache Incubator Databricks OSS Effort None M M M M L XL XL
  • 32. Our Approach ▪ Short Term ▪ S3 Committer ▪ Number of file monitor ▪ Data quality tool ▪ Long Term ▪ Systematical solutions
  • 33. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  • 34. Performance Comparison: S3 vs HDFS ▪ Similar throughput ▪ Metadata operation is slow, especially move operation ▪ Our Spark streaming job is heavily impacted ▪ Spending most time moving output files around (3 times) 13s 55s Microbatch Runtime
  • 35. Dealing with Metadata Operation ▪ Move file at least twice in a Spark Application ▪ commitTask ▪ commitJob ▪ May also move to the Hive table location output/_temporary/taskId/_temporary/taskAttemptID/part-xxxxxx output/_temporary/taskId/part-xxxxxx output/part-xxxxxx /warehouse/pinterest.db/table/date=20200626 commitTask commitJob Hive MoveTask df.write.mode(SaveMode.Append).insertInto(partitionedTable)
  • 36. Reduce Move Operations ▪ FileOutputCommitter algorithm 2 ▪ spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 ▪ Skip move operation in job level, only task level ▪ DirectOutputCommitter ▪ Further skip move operation at task level ▪ Problem: File corruption when job fail ▪ Netflix s3committer ▪ spark.sql.sources.outputCommitterClass=com.netflix.bdp.s3.S3PartitionedOutputCommitter ▪ Use multi-part upload api, no move operation ▪ Other solutions ▪ Iceberg ▪ Hadoop s3acommitter
  • 37. Multipart Upload API ▪ For every file ▪ initiateMultipartUpload ▪ Multiple uploadPart ▪ Finally completeMultipartUpload/abortMultipartUpload ▪ AWS will save all parts until completeMultipartUpload/abortMultipartUpload ▪ Setup a lifecycle policy ▪ Separate s3 permission for abortMultipartUpload
  • 38. S3committer ▪ Upload File to Output Directly use multi-part upload api ▪ Atomic completeMultipartUpload leaves no corrupt output ▪ Parallel upload parts of a file to increase throughput uploadPart completeMultipartUpload commitTask commitJob
  • 39. The Last Move Operation ▪ Before: Use staging directory to figure out the new partitions ▪ After: A table level tracking file for the new partitions ds=20200101 ds=20200102 ds=20200103 ds=20200104 ds=20200105 ds=20200106 ds=20200107 ds=20200108 ds=20200109 ds=20200110 ds=20200111 Table ds=20200112 Staging Directory .s3_dyn_parts ds=20200101 ds=20200102 ds=20200103 ds=20200104 ds=20200105 ds=20200106 ds=20200107 ds=20200108 ds=20200109 ds=20200110 ds=20200111 ds=20200112 Table ds=20200112
  • 41. Fix Bucket Rate Limit Issue (503) ▪ S3 bucket partition ▪ Task and Job level retry ▪ Tune the parallelism in part file uploads
  • 42. Improving S3Committer ▪ Fix Bucket Rate Limit (503) ▪ Parallel upload parts of a file to increase throughput ▪ Integrity check of S3 multipart upload ETags ▪ Fix thread pool leaking for long-running application ▪ Remove local output early
  • 43. S3 Benefit Compare to HDFS ▪ Reduce 80% storage cost ▪ S3: 99.99% availability, 99.999999999% durability ▪ HDFS: 99.9% target availability ▪ Namenode single point failure ▪ Potential data lost
  • 44. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  • 45. Things We Miss in Mesos ▪ Manage services inside Mesos ▪ Simple workflow, long running job and cron job via Aurora ▪ Rolling restart ▪ Built-in health check
  • 46. Things We Like in Yarn ▪ Global view of all running applications ▪ Better queue management for organization isolation ▪ Consolidate with the rest of clusters
  • 47. Cost Saving ▪ We achieve cost savings with YARN ▪ Queue isolation ▪ Preemption
  • 48. Agenda ▪ NextGen Big Data Platform ▪ Performance ▪ S3 Consistency ▪ Storage Difference ▪ Scheduling ▪ Spark at Pinterest
  • 49. Spark at Pinterest ▪ We are still in the early stages ▪ Spark represents 12% of all compute resource usage ▪ Batch use case ▪ Mostly Scala, also PySpark
  • 50. We Are Working On ▪ Automatic migration from Hive -> Spark SQL ▪ Cascading/Scalding -> Spark ▪ Adopting Dr Elephant for Spark ▪ Used for code review ▪ Integrate with internal metrics system ▪ Include features from Sparklens ▪ Spark history server performance
  • 52. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.