SlideShare a Scribd company logo
1 of 38
Download to read offline
Migration from Redshift to
Spark
Sky Yin
Data scientist @ Stitch Fix
What’s Stitch Fix?
• An online personal clothes styling service
since 2011
A loop of recommendation
• We recommend clothes through a
combination of human stylists and
algorithms
Inventory
Clients
Humans
Algorithms
What do I do?
• As part of a ~80 people data team, I work on
inventory data infrastructureand analysis
• The biggest table I manage adds ~500-800M
rows per day
Inventory
Data infrastructure: computation
• Data science jobs are mostly done in Python
or R, and later deployed through Docker
Data infrastructure: computation
• Data science jobs are mostly done in Python
or R, and later deployed through Docker
– Gap between data scientist daily work flow and
production deployment
– Not every job requires big data
Data infrastructure: computation
• Data science jobs are mostly done in Python
or R, and later deployed through Docker
• As business grows, Redshift was brought in
to help extract relevant data
Redshift: the good parts
• Can be very fast
• Familiar interface: SQL
• Managed by AWS
• Can scale up and down on demand
• Cost effective*
Redshift: the troubles
• Congestion: too many people running
queries in the morning (~80 people data
team)
Redshift: the troubles
• Congestion: too many people running
queries in the morning
– Scale up and down on demand?
Redshift: the troubles
• Congestion: too many people running
queries in the morning
– Scale up and down on demand?
– Scalable =/= easy to scale
Redshift: the troubles
• Congestion: too many people running
queries in the morning
• Production pipelines and exploratory queries
share the same cluster
Redshift: the troubles
• Congestion: too many people running
queries in the morning
• Production pipelines and ad-hoc queries
share the same cluster
– Yet another cluster to isolate dev from prod?
Redshift: the troubles
• Congestion: too many people running
queries in the morning
• Production pipelines and dev queries share
the same cluster
– Yet another cluster to isolate dev from prod?
• Sync between two clusters
• Will hit scalability problem again (even on prod)
Redshift: the troubles
• Congestion: too many people running queries in
the morning
• Production pipelines and ad-hoc queries share
the same cluster
• There’s no isolation between computationand
storage
– Can’t scale computation without paying for storage
– Can’t scale storage without paying for unneeded CPUs
Data Infrastructure: storage
• Single source of truth: S3, not Redshift
• Compatible interface: Hive metastore
• With the foundation in storage, the transition
from Redshift to Spark is much easier
S3
Journey to Spark
• Netflix Genie as a “job server” on top of a
group of Spark clusters in EMR
Internal Python/R
libs to return DF
Job execution
service: Flotilla
Dev
Prod
On-demand
Manual job
submission
Journey to Spark: SQL
• Heavy on SparkSQL with a mix of PySpark
• Lower migration cost and gentle learning
curve
• Data storage layout is transparent to users
– Every write we create a new folder with timestamp
as batch_id, to avoid consistency problem in S3
Journey to Spark: SQL
• Difference in functions and syntax
– Redshift
– SparkSQL
Journey to Spark: SQL
• Difference in functions and syntax
• Partition
– In Redshift, it’s defined by dist_key in schema
– In Spark, the term is a bit confusing
• There’s another kind of partition on storage level
• In our case, it’s defined in Hive metastore
– S3://bucket/table/country=US/year=2012/
Journey to Spark: SQL
• Functions and syntax
• Partition
• SparkSQL
– DataFrame API
Journey to Spark: SQL
• Functions and syntax
• Partition
• Sorting
– Redshift defines sort_key in schema
– Spark can specify that in runtime
OR != ORDER BY
Journey to Spark: Performance
• There’s a price in performance if we naively
treat Spark as yet another SQL data
warehouse
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
• FAQ #2: “Why is my Spark job so slow?”
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
• FAQ #2: “Why is my Spark job so slow?”
– Aggressive caching
Journey to Spark: Performance
• Key difference: Spark is an in-memory
computing platform while Redshift is not
• FAQ #2: “Why is my Spark job so slow?”
– Aggressive cache
– Use partition and filtering
– Detect data skew
– Small file problem in S3
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? Hmm, the log
says some executors were killed by
YARN???”
Journey to Spark: Performance
• FAQ #1: “Why did my job fail?”
– Executor memory and GC tuning
• Adding more memory doesn’t always work
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? “
– Executor memory and GC tuning
– Data skew in shuffling
• Repartition and salting
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? “
– Executor memory and GC tuning
– Data skew in shuffling
• Repartition and salting
• spark.sql.shuffle.partitions
– On big data, 200 is too small: 2G limit on shuffle partition
– On small data, 200 is too big: small file problem in S3
Journey to Spark: Performance
• FAQ #1: “Why did my job fail? “
– Executor memory and GC tuning
– Data skew in shuffling
– Broadcasting a large table unfortunately
• Missing table statistics in Hive metastore
Journey to Spark: Productivity
• SQL, as a language, doesn’t scale well
– Nested structure in sub-queries
– Lack of tools to manage complexity
Journey to Spark: Productivity
• SQL, as a language, doesn’t scale well
• CTE: one trick pony
Journey to Spark: Productivity
• SQL, as a language, doesn’t scale well
• CTE: one trick pony
=> Spark temp tables
Journey to Spark: Productivity
• Inspiration from data flow languages
– Spark DataFrames API
– dplyr in R
– Pandas pipe
• Need for a higher level abstraction
Journey to Spark: Productivity
• Dataset.transform() to chain functions
Thank You.
Contact me @piggybox or syin@stitchfix.com

More Related Content

What's hot

What is a PRINCE2 Project Management
What is a PRINCE2 Project ManagementWhat is a PRINCE2 Project Management
What is a PRINCE2 Project ManagementprojectingIT
 
Project Procurement Management
 Project Procurement Management  Project Procurement Management
Project Procurement Management Serdar Temiz
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptxHarissh16
 
Introduce Project Management
Introduce Project ManagementIntroduce Project Management
Introduce Project Managementguest90bddb
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQueryPradeep Bhadani
 
05 project scope management
05  project scope management05  project scope management
05 project scope managementAla Ibrahim
 
PMP Exam Prep Module 1.pptx
PMP Exam Prep Module 1.pptxPMP Exam Prep Module 1.pptx
PMP Exam Prep Module 1.pptxAmjadAlk
 
Resource Scheduling
Resource SchedulingResource Scheduling
Resource SchedulingNicola2903
 
Chap 7.2 Estimate Cost
Chap 7.2 Estimate CostChap 7.2 Estimate Cost
Chap 7.2 Estimate CostAnand Bobade
 
Project scope management
Project scope managementProject scope management
Project scope managementANOOP S NAIR
 
Construindo APIs com Amazon API Gateway e AWS Lambda
Construindo APIs com Amazon API Gateway e AWS LambdaConstruindo APIs com Amazon API Gateway e AWS Lambda
Construindo APIs com Amazon API Gateway e AWS LambdaAmazon Web Services LATAM
 
Project Schedule Management - PMBOK6
Project Schedule Management - PMBOK6Project Schedule Management - PMBOK6
Project Schedule Management - PMBOK6Agus Suhanto
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud Alithya
 
Creating the project charter
Creating the project charterCreating the project charter
Creating the project charterbadrux
 
The Triple Constraint of Project Management - Lou Bergner
The Triple Constraint of Project Management - Lou BergnerThe Triple Constraint of Project Management - Lou Bergner
The Triple Constraint of Project Management - Lou BergnerMAX Technical Training
 
ProjectLibre1.5 - Lesson 5 - Reports
ProjectLibre1.5 - Lesson 5 - ReportsProjectLibre1.5 - Lesson 5 - Reports
ProjectLibre1.5 - Lesson 5 - ReportsHezequias Vasconcelos
 
2.project lifecycle
2.project lifecycle2.project lifecycle
2.project lifecyclerlabsza
 
Programmable logic-control
Programmable logic-controlProgrammable logic-control
Programmable logic-controlakashganesan
 

What's hot (20)

What is a PRINCE2 Project Management
What is a PRINCE2 Project ManagementWhat is a PRINCE2 Project Management
What is a PRINCE2 Project Management
 
Project Procurement Management
 Project Procurement Management  Project Procurement Management
Project Procurement Management
 
bigquery.pptx
bigquery.pptxbigquery.pptx
bigquery.pptx
 
Introduce Project Management
Introduce Project ManagementIntroduce Project Management
Introduce Project Management
 
Getting started with BigQuery
Getting started with BigQueryGetting started with BigQuery
Getting started with BigQuery
 
05 project scope management
05  project scope management05  project scope management
05 project scope management
 
PMP Exam Prep Module 1.pptx
PMP Exam Prep Module 1.pptxPMP Exam Prep Module 1.pptx
PMP Exam Prep Module 1.pptx
 
Resource Scheduling
Resource SchedulingResource Scheduling
Resource Scheduling
 
Project management
Project managementProject management
Project management
 
Chap 7.2 Estimate Cost
Chap 7.2 Estimate CostChap 7.2 Estimate Cost
Chap 7.2 Estimate Cost
 
Project scope management
Project scope managementProject scope management
Project scope management
 
Kalite Yönetim Planının 7 Özelliği
Kalite Yönetim Planının 7 Özelliği Kalite Yönetim Planının 7 Özelliği
Kalite Yönetim Planının 7 Özelliği
 
Construindo APIs com Amazon API Gateway e AWS Lambda
Construindo APIs com Amazon API Gateway e AWS LambdaConstruindo APIs com Amazon API Gateway e AWS Lambda
Construindo APIs com Amazon API Gateway e AWS Lambda
 
Project Schedule Management - PMBOK6
Project Schedule Management - PMBOK6Project Schedule Management - PMBOK6
Project Schedule Management - PMBOK6
 
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
 
Creating the project charter
Creating the project charterCreating the project charter
Creating the project charter
 
The Triple Constraint of Project Management - Lou Bergner
The Triple Constraint of Project Management - Lou BergnerThe Triple Constraint of Project Management - Lou Bergner
The Triple Constraint of Project Management - Lou Bergner
 
ProjectLibre1.5 - Lesson 5 - Reports
ProjectLibre1.5 - Lesson 5 - ReportsProjectLibre1.5 - Lesson 5 - Reports
ProjectLibre1.5 - Lesson 5 - Reports
 
2.project lifecycle
2.project lifecycle2.project lifecycle
2.project lifecycle
 
Programmable logic-control
Programmable logic-controlProgrammable logic-control
Programmable logic-control
 

Similar to Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky Yin

Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesDatabricks
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and MRégis Baccaro
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planningasya999
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerSarah Dutkiewicz
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsIke Ellis
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedInkbajda
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestDatabricks
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Sascha Wenninger
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 DistilledGrig Gheorghiu
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core ConceptsJon Haddad
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]shuwutong
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...Lucas Jellema
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

Similar to Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky Yin (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Creating Reusable Geospatial Pipelines
Creating Reusable Geospatial PipelinesCreating Reusable Geospatial Pipelines
Creating Reusable Geospatial Pipelines
 
ETL for the masses with Power Query and M
ETL for the masses with Power Query and METL for the masses with Power Query and M
ETL for the masses with Power Query and M
 
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
2013 CPM Conference, Nov 6th, NoSQL Capacity Planning
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
 
Relational data modeling trends for transactional applications
Relational data modeling trends for transactional applicationsRelational data modeling trends for transactional applications
Relational data modeling trends for transactional applications
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]Kb 40 kevin_klineukug_reading20070717[1]
Kb 40 kevin_klineukug_reading20070717[1]
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
The Evolution of the Oracle Database - Then, Now and Later (Fontys Hogeschool...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky Yin

  • 1. Migration from Redshift to Spark Sky Yin Data scientist @ Stitch Fix
  • 2. What’s Stitch Fix? • An online personal clothes styling service since 2011
  • 3. A loop of recommendation • We recommend clothes through a combination of human stylists and algorithms Inventory Clients Humans Algorithms
  • 4. What do I do? • As part of a ~80 people data team, I work on inventory data infrastructureand analysis • The biggest table I manage adds ~500-800M rows per day Inventory
  • 5. Data infrastructure: computation • Data science jobs are mostly done in Python or R, and later deployed through Docker
  • 6. Data infrastructure: computation • Data science jobs are mostly done in Python or R, and later deployed through Docker – Gap between data scientist daily work flow and production deployment – Not every job requires big data
  • 7. Data infrastructure: computation • Data science jobs are mostly done in Python or R, and later deployed through Docker • As business grows, Redshift was brought in to help extract relevant data
  • 8. Redshift: the good parts • Can be very fast • Familiar interface: SQL • Managed by AWS • Can scale up and down on demand • Cost effective*
  • 9. Redshift: the troubles • Congestion: too many people running queries in the morning (~80 people data team)
  • 10. Redshift: the troubles • Congestion: too many people running queries in the morning – Scale up and down on demand?
  • 11. Redshift: the troubles • Congestion: too many people running queries in the morning – Scale up and down on demand? – Scalable =/= easy to scale
  • 12. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and exploratory queries share the same cluster
  • 13. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and ad-hoc queries share the same cluster – Yet another cluster to isolate dev from prod?
  • 14. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and dev queries share the same cluster – Yet another cluster to isolate dev from prod? • Sync between two clusters • Will hit scalability problem again (even on prod)
  • 15. Redshift: the troubles • Congestion: too many people running queries in the morning • Production pipelines and ad-hoc queries share the same cluster • There’s no isolation between computationand storage – Can’t scale computation without paying for storage – Can’t scale storage without paying for unneeded CPUs
  • 16. Data Infrastructure: storage • Single source of truth: S3, not Redshift • Compatible interface: Hive metastore • With the foundation in storage, the transition from Redshift to Spark is much easier S3
  • 17. Journey to Spark • Netflix Genie as a “job server” on top of a group of Spark clusters in EMR Internal Python/R libs to return DF Job execution service: Flotilla Dev Prod On-demand Manual job submission
  • 18. Journey to Spark: SQL • Heavy on SparkSQL with a mix of PySpark • Lower migration cost and gentle learning curve • Data storage layout is transparent to users – Every write we create a new folder with timestamp as batch_id, to avoid consistency problem in S3
  • 19. Journey to Spark: SQL • Difference in functions and syntax – Redshift – SparkSQL
  • 20. Journey to Spark: SQL • Difference in functions and syntax • Partition – In Redshift, it’s defined by dist_key in schema – In Spark, the term is a bit confusing • There’s another kind of partition on storage level • In our case, it’s defined in Hive metastore – S3://bucket/table/country=US/year=2012/
  • 21. Journey to Spark: SQL • Functions and syntax • Partition • SparkSQL – DataFrame API
  • 22. Journey to Spark: SQL • Functions and syntax • Partition • Sorting – Redshift defines sort_key in schema – Spark can specify that in runtime OR != ORDER BY
  • 23. Journey to Spark: Performance • There’s a price in performance if we naively treat Spark as yet another SQL data warehouse
  • 24. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not
  • 25. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not • FAQ #2: “Why is my Spark job so slow?”
  • 26. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not • FAQ #2: “Why is my Spark job so slow?” – Aggressive caching
  • 27. Journey to Spark: Performance • Key difference: Spark is an in-memory computing platform while Redshift is not • FAQ #2: “Why is my Spark job so slow?” – Aggressive cache – Use partition and filtering – Detect data skew – Small file problem in S3
  • 28. Journey to Spark: Performance • FAQ #1: “Why did my job fail? Hmm, the log says some executors were killed by YARN???”
  • 29. Journey to Spark: Performance • FAQ #1: “Why did my job fail?” – Executor memory and GC tuning • Adding more memory doesn’t always work
  • 30. Journey to Spark: Performance • FAQ #1: “Why did my job fail? “ – Executor memory and GC tuning – Data skew in shuffling • Repartition and salting
  • 31. Journey to Spark: Performance • FAQ #1: “Why did my job fail? “ – Executor memory and GC tuning – Data skew in shuffling • Repartition and salting • spark.sql.shuffle.partitions – On big data, 200 is too small: 2G limit on shuffle partition – On small data, 200 is too big: small file problem in S3
  • 32. Journey to Spark: Performance • FAQ #1: “Why did my job fail? “ – Executor memory and GC tuning – Data skew in shuffling – Broadcasting a large table unfortunately • Missing table statistics in Hive metastore
  • 33. Journey to Spark: Productivity • SQL, as a language, doesn’t scale well – Nested structure in sub-queries – Lack of tools to manage complexity
  • 34. Journey to Spark: Productivity • SQL, as a language, doesn’t scale well • CTE: one trick pony
  • 35. Journey to Spark: Productivity • SQL, as a language, doesn’t scale well • CTE: one trick pony => Spark temp tables
  • 36. Journey to Spark: Productivity • Inspiration from data flow languages – Spark DataFrames API – dplyr in R – Pandas pipe • Need for a higher level abstraction
  • 37. Journey to Spark: Productivity • Dataset.transform() to chain functions
  • 38. Thank You. Contact me @piggybox or syin@stitchfix.com