SlideShare a Scribd company logo
Migrating ETL
workflow to Spark at
scale in Pinterest
Daniel Dai, Zirui Li
Pinterest Inc
About Us
• Daniel Dai
• Tech Lead at Pinterest
• PMC member for Apache Hive and Pig
• Zirui Li
• Software Engineer at Pinterest Spark Platform Team
• Focus on building Pinterest in-house Spark platform & functionalities
Agenda
▪ Spark @ Pinterest
▪ Cascading/Scalding to
Spark Conversion
▪ Technical Challenges
▪ Migration Process
▪ Result and Future Plan
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
We Are on Cloud
• We use AWS
• However, we build our
own clusters
• Avoid vendor lockdown
• Timely support by our own team
• We store everything on
S3
• Cost less than HDFS
• HDFS is for temporary storage
S3
EC2
HDFS
Yarn
EC2
HDFS
Yarn
EC2
HDFS
Yarn
Spark Clusters
• We have a couple of Spark clusters
• From several hundred nodes to 1000+ nodes
• Spark only cluster and mixed use cluster
• Cross cluster routing
• R5D instance type for Spark only cluster
• Faster local disk
• High memory to cpu ratio
Spark Versions and Use Cases
• We are running Spark 2.4
• With quite a few internal fixes
• Will migrate to 3.1 this year
• Use cases
• Production use cases
• SparkSQL, PySpark, Spark Native via airflow
• Adhoc use case
• SparkSQL via Querybook, PySpark via Jupyter
Migration Plan
• 40% workloads are already
Spark
• The number is 12% one year ago
• Migration in progress
• Hive to SparkSQL
• Cacading/Scalding to Spark
• Hadoop streaming to Spark pipe
Hive
Cascading/Scalding
Hadoop Streaming
Where are we?
Migration Plan
• Still half workloads are on
Cascading/Scalding
• ETL use cases
• Spark Future
• Query engine: Presto/SparkSQL
• ETL: Spark native
• Machine learning: PySpark
Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Cascading
• Simple DAG
• Only 6 different pipes
• Most logic in UDF
• Each – UDF in map
• Every – UDF in reduce
• Java API
Source
Each
GroupBy
Every
Sink
Pattern 1
Source
Each
CoGroup
Every
Sink
Pattern 2
Source
Each
Scalding
• Rich set of operators on top of Cascading
• Operators are very similar to Spark RDD
• Scala API
Migration Path
+
▪ UDF
interface is
private
▪ SQL easy to
migrate to
any engine
Recommend if there’s not
many UDFs
SparkSQL
−
PySpark
▪ Suboptimal
performanc
e, especially
for Python
UDF
▪ Rich Python
libraries
available to
use
+ −
Recommended for Machine
Learning only
+
Native Spark
▪ most structured path to enjoin
rich spark syntax
▪ Work for almost all
Cascading/Scalding
applications
Default & Recommended for
general cases
Spark API
▪ Newer & Recommended API
RDD
Spark Dataframe/Dataset
▪ Most inputs are thrift sequence files
▪ Encode/Decode thrift object to/from
dataframe is slow
Recommended only for non-thrift-
sequence file
▪ More Flexible on handling thrift object
serialization / deserialization
▪ Semantically close to Scalding
▪ Older API
▪ Less performant than Dataframe
Default choice for the conversion
+
−
+
−
Rewrite the
application manually
Reuse most of
Cascading/Scalding
library code
▪ However, avoid
Cascading
specific structure
Automatic tool to help
result validation &
performance tuning
Approach
Translate Cascading
• DAG is usually simple
• Most Cascading pipe has one-to-one mapping to Spark transformation
// val processedInput: RDD[(String, Token)]
// val tokenFreq: RDD[(String, Double)]
val tokenFreqVar = spark.sparkContext.broadcast(tokenFreq.collectAsMap())
val joined = processedInput.map {
t => (t._1, (t._2, tokenFreqVar.value.get(t._1)))
}
Cascading Pipe Spark RDD Operator Note
Each Map side UDF
Every Reduce side UDF
Merge union
CoGroup join/leftOuterJoin/right
OuterJoin/fullOuterJoin
GroupBy GroupBy/GroupByKey secondary sort might be needed
HashJoin Broadcast join no native support in RDD, simulate via broadcast variable
• Complexity is in UDF
UDF Translation
Semantic Difference
Multi-threading
UDF initialization
and cleanup
▪ Do both filtering &
transformation
▪ Java
▪ map + filter
▪ Scala
▪ Multi-thread model
▪ Worst case set
executor-cores=1
▪ Single-thread model
▪ Class with initialization &
cleanup
▪ No init / cleanup hook
▪ mapPartitions to
simulate
Cascading UDF Spark
VS
.mapPartitions{iter =>
// Expensive initialization block
// init block
while (iter.hasNext()) {
val event = iter.next
process(event)
}
// cleanup block
}
Translate Scalding
• Most operator has 1 to 1 mapping to RDD operator
• UDF can be used in Spark without change
Scalding Operator Spark RDD Operator Note
map map
flatMap flatMap
filter filter
filterNot filter Spark does not have filterNot, use filter with negative condition
groupBy groupBy
group groupByKey
groupAll groupBy(t=>1)
...
Agenda
• Spark in Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Secondary Sort
• Use “repartitionAndSortWithinPartitions” in Spark
• There’s gap in semantics: Use GroupSortedIterator to fill the gap
output = new GroupBy(output, new Fields("user_id"), new Fields("sec_key"));
group key sort key
(2, 2), "apple"
(1, 3), "facebook"
(1, 1), "pinterest"
(1, 2), "twitter"
(3, 2), "google"
input
iterator for key 1:
(1, 1), "pinterest"
(1, 2), "twitter"
(1, 3), "facebook"
iterator for key 2:
(2, 2), "apple"
iterator for key 3:
(3, 2), "google"
Cascading
(1, 1), "pinterest"
(1, 2), "twitter"
(1, 3), "facebook"
(2, 2), "apple"
(3, 2), "google"
Spark
Accumulators
• Spark accumulator is not
accurate
• Stage retry
• Same code run multiple times in different
stage
• Solution
• Deduplicate with stage+partition
• persist
val sc = new SparkContext(conf);
val inputRecords = sc.longAccumulator("Input")
val a = sc.textFile("studenttab10k");
val b = a.map(line => line.split("t"));
val c = b.map { t =>
inputRecords.add(1L)
(t(0), t(1).toInt, t(2).toDouble)
};
val sumScore = c.map(t => t._3).sum()
// c.persist()
c.map { t =>
(t._1, t._3/sumScore)
}.saveAsTextFile("output")
Accumulator Continue
• Retrieve the Accumulator of
the Earliest Stage
• Exception: user intentionally
use the same accumulator in
different stages
NUM_OUTPUT_TOKENS
Stage 14: 168006868318
Stage 21: 336013736636
val sc = new SparkContext(conf);
val inputRecords = sc.longAccumulator("Input")
val input1 = sc.textFile("input1");
val input1_processed = input1.map { t =>
inputRecords.add(1L)
(t(0), (t(1).toInt, t(2).toDouble))
};
val input2 = sc.textFile("input2");
val input2_processed = input2.map { t =>
inputRecords.add(1L)
(t(0), (t(1).toInt, t(2).toDouble))
};
input1_processed.join(input2_processed)
.saveAsTextFile("output")
Accumulator Tab in Spark UI
• SPARK-35197
Profiling
• Visualize frame graph using Nebula
• Realtime
• Ability to segment into stage/task
• Focus on only useful threads
OutputCommitter
• Issue with OutputCommitter
• slow metadata operation
• 503 errors
• Netflix s3committer
• Wrapper for Spark RDD
• s3committer only support old API
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Automatic Migration Service (AMS)
• A tool to automate majority of migration process
Data Validation
Row counts
Checksum
Comparison
Create a table around
output
SparkSQL UDF
CountAndChecksumUdaf
Doesn’t work for
double/float
Doesn’t work for array if
order is different
−
Input depends
on current
timestamp
There's
random
number
generator in
the code
Rounding
differences
which result
differences in
filter condition
test
Unstable top
result if there's
a tie
Source of Uncertainty
Performance Tuning
Collect runtime
memory/vcore usage
Tuning passed if
criterias meet:
▪ Runtime reduced
▪ Vcore*sec
reduced 20%+
▪ Memory increase
less than 100%
Retry with tuned
memory / vcore if
necessary
Balancing Performance
• Trade-offs
• More executors
• Better performance, but cost more
• Use more cores per executor
• Save on memory, but cost more on cpu
• Use dynamic allocation usually save cost
• Skew won’t cost more with dynamic allocation
• Control parallelism
• spark.default.parallelism for RDD
• spark.sql.shuffle.partitions for dataframe/dataset/SparkSQL
▪ Automatically pick Spark over
Cascading/Scalding during runtime if
condition meets
▪ Data Validation Pass
▪ Performance Optimization Pass
▪ Automatically handle failure with handlers if
applicable
▪ Configuration incorrectness
▪ OutOfMemory
▪ ...
▪ Manual troubleshooting is needed for other
uncaught failures
Failure handling
Automatic Migration
Automatic Migration & Failure Handling
Agenda
• Spark @ Pinterest
• Cascading/Scalding to Spark Conversion
• Technical Challenges
• Migration Process
• Result and Future Plan
Result
• 40% performance improvement
• 47% cost saving on cpu
• Use 33% more memory
Future Plan
• Manual conversion for application still evolving
• Spark backend for legacy application
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
Databricks
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Li Ming Tsai
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Databricks
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 

What's hot (20)

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Optimizing Apache Spark UDFs
Optimizing Apache Spark UDFsOptimizing Apache Spark UDFs
Optimizing Apache Spark UDFs
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleBucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
 

Similar to Migrating ETL Workflow to Apache Spark at Scale in Pinterest

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
Juan Pedro Moreno
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
DataStax Academy
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
Juan Pedro Moreno
 

Similar to Migrating ETL Workflow to Apache Spark at Scale in Pinterest (20)

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Spark Worshop
Spark WorshopSpark Worshop
Spark Worshop
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 

Recently uploaded (20)

原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 

Migrating ETL Workflow to Apache Spark at Scale in Pinterest

  • 1. Migrating ETL workflow to Spark at scale in Pinterest Daniel Dai, Zirui Li Pinterest Inc
  • 2. About Us • Daniel Dai • Tech Lead at Pinterest • PMC member for Apache Hive and Pig • Zirui Li • Software Engineer at Pinterest Spark Platform Team • Focus on building Pinterest in-house Spark platform & functionalities
  • 3. Agenda ▪ Spark @ Pinterest ▪ Cascading/Scalding to Spark Conversion ▪ Technical Challenges ▪ Migration Process ▪ Result and Future Plan
  • 4. Agenda • Spark @ Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 5. We Are on Cloud • We use AWS • However, we build our own clusters • Avoid vendor lockdown • Timely support by our own team • We store everything on S3 • Cost less than HDFS • HDFS is for temporary storage S3 EC2 HDFS Yarn EC2 HDFS Yarn EC2 HDFS Yarn
  • 6. Spark Clusters • We have a couple of Spark clusters • From several hundred nodes to 1000+ nodes • Spark only cluster and mixed use cluster • Cross cluster routing • R5D instance type for Spark only cluster • Faster local disk • High memory to cpu ratio
  • 7. Spark Versions and Use Cases • We are running Spark 2.4 • With quite a few internal fixes • Will migrate to 3.1 this year • Use cases • Production use cases • SparkSQL, PySpark, Spark Native via airflow • Adhoc use case • SparkSQL via Querybook, PySpark via Jupyter
  • 8. Migration Plan • 40% workloads are already Spark • The number is 12% one year ago • Migration in progress • Hive to SparkSQL • Cacading/Scalding to Spark • Hadoop streaming to Spark pipe Hive Cascading/Scalding Hadoop Streaming Where are we?
  • 9. Migration Plan • Still half workloads are on Cascading/Scalding • ETL use cases • Spark Future • Query engine: Presto/SparkSQL • ETL: Spark native • Machine learning: PySpark
  • 10. Agenda • Spark in Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 11. Cascading • Simple DAG • Only 6 different pipes • Most logic in UDF • Each – UDF in map • Every – UDF in reduce • Java API Source Each GroupBy Every Sink Pattern 1 Source Each CoGroup Every Sink Pattern 2 Source Each
  • 12. Scalding • Rich set of operators on top of Cascading • Operators are very similar to Spark RDD • Scala API
  • 13. Migration Path + ▪ UDF interface is private ▪ SQL easy to migrate to any engine Recommend if there’s not many UDFs SparkSQL − PySpark ▪ Suboptimal performanc e, especially for Python UDF ▪ Rich Python libraries available to use + − Recommended for Machine Learning only + Native Spark ▪ most structured path to enjoin rich spark syntax ▪ Work for almost all Cascading/Scalding applications Default & Recommended for general cases
  • 14. Spark API ▪ Newer & Recommended API RDD Spark Dataframe/Dataset ▪ Most inputs are thrift sequence files ▪ Encode/Decode thrift object to/from dataframe is slow Recommended only for non-thrift- sequence file ▪ More Flexible on handling thrift object serialization / deserialization ▪ Semantically close to Scalding ▪ Older API ▪ Less performant than Dataframe Default choice for the conversion + − + −
  • 15. Rewrite the application manually Reuse most of Cascading/Scalding library code ▪ However, avoid Cascading specific structure Automatic tool to help result validation & performance tuning Approach
  • 16. Translate Cascading • DAG is usually simple • Most Cascading pipe has one-to-one mapping to Spark transformation // val processedInput: RDD[(String, Token)] // val tokenFreq: RDD[(String, Double)] val tokenFreqVar = spark.sparkContext.broadcast(tokenFreq.collectAsMap()) val joined = processedInput.map { t => (t._1, (t._2, tokenFreqVar.value.get(t._1))) } Cascading Pipe Spark RDD Operator Note Each Map side UDF Every Reduce side UDF Merge union CoGroup join/leftOuterJoin/right OuterJoin/fullOuterJoin GroupBy GroupBy/GroupByKey secondary sort might be needed HashJoin Broadcast join no native support in RDD, simulate via broadcast variable • Complexity is in UDF
  • 17. UDF Translation Semantic Difference Multi-threading UDF initialization and cleanup ▪ Do both filtering & transformation ▪ Java ▪ map + filter ▪ Scala ▪ Multi-thread model ▪ Worst case set executor-cores=1 ▪ Single-thread model ▪ Class with initialization & cleanup ▪ No init / cleanup hook ▪ mapPartitions to simulate Cascading UDF Spark VS .mapPartitions{iter => // Expensive initialization block // init block while (iter.hasNext()) { val event = iter.next process(event) } // cleanup block }
  • 18. Translate Scalding • Most operator has 1 to 1 mapping to RDD operator • UDF can be used in Spark without change Scalding Operator Spark RDD Operator Note map map flatMap flatMap filter filter filterNot filter Spark does not have filterNot, use filter with negative condition groupBy groupBy group groupByKey groupAll groupBy(t=>1) ...
  • 19. Agenda • Spark in Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 20. Secondary Sort • Use “repartitionAndSortWithinPartitions” in Spark • There’s gap in semantics: Use GroupSortedIterator to fill the gap output = new GroupBy(output, new Fields("user_id"), new Fields("sec_key")); group key sort key (2, 2), "apple" (1, 3), "facebook" (1, 1), "pinterest" (1, 2), "twitter" (3, 2), "google" input iterator for key 1: (1, 1), "pinterest" (1, 2), "twitter" (1, 3), "facebook" iterator for key 2: (2, 2), "apple" iterator for key 3: (3, 2), "google" Cascading (1, 1), "pinterest" (1, 2), "twitter" (1, 3), "facebook" (2, 2), "apple" (3, 2), "google" Spark
  • 21. Accumulators • Spark accumulator is not accurate • Stage retry • Same code run multiple times in different stage • Solution • Deduplicate with stage+partition • persist val sc = new SparkContext(conf); val inputRecords = sc.longAccumulator("Input") val a = sc.textFile("studenttab10k"); val b = a.map(line => line.split("t")); val c = b.map { t => inputRecords.add(1L) (t(0), t(1).toInt, t(2).toDouble) }; val sumScore = c.map(t => t._3).sum() // c.persist() c.map { t => (t._1, t._3/sumScore) }.saveAsTextFile("output")
  • 22. Accumulator Continue • Retrieve the Accumulator of the Earliest Stage • Exception: user intentionally use the same accumulator in different stages NUM_OUTPUT_TOKENS Stage 14: 168006868318 Stage 21: 336013736636 val sc = new SparkContext(conf); val inputRecords = sc.longAccumulator("Input") val input1 = sc.textFile("input1"); val input1_processed = input1.map { t => inputRecords.add(1L) (t(0), (t(1).toInt, t(2).toDouble)) }; val input2 = sc.textFile("input2"); val input2_processed = input2.map { t => inputRecords.add(1L) (t(0), (t(1).toInt, t(2).toDouble)) }; input1_processed.join(input2_processed) .saveAsTextFile("output")
  • 23. Accumulator Tab in Spark UI • SPARK-35197
  • 24. Profiling • Visualize frame graph using Nebula • Realtime • Ability to segment into stage/task • Focus on only useful threads
  • 25. OutputCommitter • Issue with OutputCommitter • slow metadata operation • 503 errors • Netflix s3committer • Wrapper for Spark RDD • s3committer only support old API
  • 26. Agenda • Spark @ Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 27. Automatic Migration Service (AMS) • A tool to automate majority of migration process
  • 28. Data Validation Row counts Checksum Comparison Create a table around output SparkSQL UDF CountAndChecksumUdaf Doesn’t work for double/float Doesn’t work for array if order is different −
  • 29. Input depends on current timestamp There's random number generator in the code Rounding differences which result differences in filter condition test Unstable top result if there's a tie Source of Uncertainty
  • 30. Performance Tuning Collect runtime memory/vcore usage Tuning passed if criterias meet: ▪ Runtime reduced ▪ Vcore*sec reduced 20%+ ▪ Memory increase less than 100% Retry with tuned memory / vcore if necessary
  • 31. Balancing Performance • Trade-offs • More executors • Better performance, but cost more • Use more cores per executor • Save on memory, but cost more on cpu • Use dynamic allocation usually save cost • Skew won’t cost more with dynamic allocation • Control parallelism • spark.default.parallelism for RDD • spark.sql.shuffle.partitions for dataframe/dataset/SparkSQL
  • 32. ▪ Automatically pick Spark over Cascading/Scalding during runtime if condition meets ▪ Data Validation Pass ▪ Performance Optimization Pass ▪ Automatically handle failure with handlers if applicable ▪ Configuration incorrectness ▪ OutOfMemory ▪ ... ▪ Manual troubleshooting is needed for other uncaught failures Failure handling Automatic Migration Automatic Migration & Failure Handling
  • 33. Agenda • Spark @ Pinterest • Cascading/Scalding to Spark Conversion • Technical Challenges • Migration Process • Result and Future Plan
  • 34. Result • 40% performance improvement • 47% cost saving on cpu • Use 33% more memory
  • 35. Future Plan • Manual conversion for application still evolving • Spark backend for legacy application
  • 36. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.