Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

•

2 likes•935 views

Spark Summit

Adopting Dataframes and Parquet in an Already Existing Warehouse

Data & Analytics

Tier 2 FrontroomWebsite
MySQL HTTP APIs
HDFS JSON
PySpark RDD
Tier 1 Frontroom
Redshift
Operational DB
Shopify User
Kafka
PySpark
Streaming
Presto

6
>300kSTORES
2 PBDATA
4000JOBS / DAY
1.5 TBNEW DATA / DAY
Warehouse Factoids

● Processing time, time to fresh
data
● Development/Iteration time
● Cost overhead

● Columnar format
● Python simplicity, JVM
performance
● Unlocks SQL on HDFS

1
1
• Zero down time for analysts / users
• Fully backwards compatible
• No data loss / corruption
• Incremental rollout, mitigating risk, downtime, revertability
• Incremental rol
Goals

• Keep old code
• Introduce Parquet + Dataframes
• Rewrite slow jobs as Dataframe pipelines
Adoption Plan

13
Data agnostic reader / writer
Backwards Compatible
Output
JSON / Parquet
Input
JSON / Parquet

Data agnostic reader
rdd1 = read('hdfs://file1.parquet', load_type='rdd')
rdd2 = read('hdfs://file2.json', load_type='rdd')
df1 = read('hdfs://file1.parquet', load_type='dataframe')
df2 = read('hdfs://file2.json', load_type='dataframe')

Configurable Output-format
sample-job:
owner: franklyn.dsouza@shopify.com
build:
resource_class: large
command_options:
some-file-on-hdfs: hdfs://input/to/job
output: hdfs://output/from/job
output-format: parquet
file:job.py

Configurable Output-format
def write_output(self, path, out):
if self.output_format == 'json':
save_as_json_file(out, path)
elif self.output_format == 'parquet':
save_as_parquet_file(out, path)

Job Types
• Full Drop : Jobs that rebuild their output from
scratch every time.
Input
Output
Input
Input

• Incremental Drop : Jobs that build their output
incrementally.
Job Types
Part 1
Part 2
Part 3
Part 1
Part 2
Part 3
Input Output

Forwards with Dataframes
• Is the dataset Dataframe-able?
• Checking data integrity
• Regression blocking

Will it Dataframe?
Python Dataframe
long any size
sql.LongType
(-2^63, 2^63-1)
Decimal
28 prec
28 sig digits
sql.DecimalType(38, 18)
[20 digits].[18 digits]

$Long: if (v < -9223372036854775808 or v > 9223372036854775807): logger.warn('Field: {}, Value: {}. long will not fit inside sql.LongType'.format(k, v)) Decimal: DECIMAL_CONTEXT = Context(prec=38, Emax=19, Emin=19, traps=[Inexact]) try: DECIMAL_CONTEXT.create_decimal(v) except: logger.warn('Field: {}, Value: {}. Decimal will not fit inside sql.DecimalType(38,18)'.format(k, v))$

Record based format
[{'a': 1},
{'a': 2, 'b': True} ...]
if 'b' in data:
# DataFrame path
else:
# RDD path
Column based format
a b
1 None
2 True
Will it Dataframe?

$Optional Blocking def test_no_new_optional_jobs(): white_list = {...} jobs_with_optionals = find_jobs_with_optionals() new_jobs_with_optionals = jobs_with_optionals - white_list assert len(new_jobs_with_optionals) == 0$

Checking Data Integrity
Parquet
Input
JSON / Parquet
JSON
== ?
Reconcile Job

Checking Data Integrity
Parquet
Input
Parquet
Parquet
== ?
Reconcile Job
RDD
DF

$def test_no_new_json_jobs(): white_list = {...} jobs_outputing_json = find_json_jobs() new_json_jobs = jobs_outputing_json - white_list assert len(new_json_jobs) == 0 Regression Blocking$

10xFASTER
0.7xRESOURCES USED
1.3xJAVA MEM
Results
SQLDATA ACCELERATED

● Dataframes pay for
themselves, especially at scale
● There is no Dataframe switch
● Safety first
Lessons Learned

What's hot

Spark Summit EU talk by Herman van HovellSpark Summit

Spark Summit EU talk by Sital KediaSpark Summit

A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks

EclairJS = Node.Js + Apache SparkJen Aman

Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit

Intro to Apache SparkMammoth Data

Building a Business Logic Translation Engine with Spark Streaming for Communi...Spark Summit

Extending Spark With Java Agent (handout)Jaroslav Bachorik

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien

Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit

Spark Summit EU talk by Luca CanaliSpark Summit

Javantura v4 - Getting started with Apache Spark - Dinko SrkočHUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association

Introduction to datasetdatamantra

Clipper: A Low-Latency Online Prediction Serving SystemDatabricks

Recent Developments In SparkR For Advanced AnalyticsDatabricks

Spark Summit EU 2015: Lessons from 300+ production usersDatabricks

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks

Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks

Apache Spark Usage in the Open Source EcosystemDatabricks

What's hot (20)

Spark Summit EU talk by Herman van Hovell

Spark Summit EU talk by Sital Kedia

A Journey into Databricks' Pipelines: Journey and Lessons Learned

EclairJS = Node.Js + Apache Spark

Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins

Intro to Apache Spark

Building a Business Logic Translation Engine with Spark Streaming for Communi...

Extending Spark With Java Agent (handout)

Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...

Spark Summit EU talk by Kent Buenaventura and Willaim Lau

Spark Summit EU talk by Luca Canali

Javantura v4 - Getting started with Apache Spark - Dinko Srkoč

Introduction to dataset

Clipper: A Low-Latency Online Prediction Serving System

Recent Developments In SparkR For Advanced Analytics

Spark Summit EU 2015: Lessons from 300+ production users

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Taking Spark Streaming to the Next Level with Datasets and DataFrames

Apache Spark Usage in the Open Source Ecosystem

Viewers also liked

Parquet Strata/Hadoop World, New York 2013Julien Le Dem

All vs. All Correlation Using Spark/HadoopMahmoud Parsian

Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar

Spark Meetup at UberDatabricks

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

Parquet and AVROairisData

File Format Benchmark - Avro, JSON, ORC & ParquetDataWorks Summit/Hadoop Summit

Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Spark 2.x Troubleshooting GuideIBM

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit

Mastering The Fourth Industrial Revolution Monty C. M. Metzger

Study: The Future of VR, AR and Self-Driving CarsLinkedIn

Viewers also liked (18)

Parquet Strata/Hadoop World, New York 2013

All vs. All Correlation Using Spark/Hadoop

Hadoop Strata Talk - Uber, your hadoop has arrived

Spark Meetup at Uber

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

Parquet and AVRO

File Format Benchmark - Avro, JSON, ORC & Parquet

Efficient Data Storage for Analytics with Apache Parquet 2.0

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Spark 2.x Troubleshooting Guide

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed

Mastering The Fourth Industrial Revolution

Study: The Future of VR, AR and Self-Driving Cars

Similar to Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to StreamingYaroslav Tkachenko

pgconfasia2016 plcuda enKohei KaiGai

Porting a Streaming Pipeline from Scala to RustEvan Chan

GR740 User dayklepsydratechnologie

OLTP+OLAP=HTAPEDB

PL/CUDA - Fusion of HPC Grade Power with In-Database AnalyticsKohei KaiGai

SparkSQL: A Compiler from Queries to RDDsDatabricks

Data Pipeline at TapadToby Matejovsky

A New Chapter of Data Processing with CDKShu-Jeng Hsieh

Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit

Deep dive into spark streamingTao Li

New Developments in SparkDatabricks

GumGum: Multi-Region Cassandra in AWSDataStax Academy

20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai

Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project

Flink internals web Kostas Tzoumas

20140513_jeffyang_demo_openstackJeff Yang

Similar to Spark Summit EU talk by Sol Ackerman and Franklyn D'souza (20)

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...

Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming

pgconfasia2016 plcuda en

Porting a Streaming Pipeline from Scala to Rust

GR740 User day

OLTP+OLAP=HTAP

PL/CUDA - Fusion of HPC Grade Power with In-Database Analytics

SparkSQL: A Compiler from Queries to RDDs

Data Pipeline at Tapad

A New Chapter of Data Processing with CDK

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust

End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...

Deep dive into spark streaming

New Developments in Spark

GumGum: Multi-Region Cassandra in AWS

20181116 Massive Log Processing using I/O optimized PostgreSQL

Planet-scale Data Ingestion Pipeline: Bigdam

Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

Flink internals web

20140513_jeffyang_demo_openstack

Recently uploaded

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Learn How Data Science Changes Our WorldEduminds Learning

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Multiple time frame trading analysis -brianshannon.pdfchwongval

Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

How we prevented account sharing with MFAAndrei Kaleshka

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort

Recently uploaded (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

Defining Constituents, Data Vizzes and Telling a Data Story

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Learn How Data Science Changes Our World

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Heart Disease Classification Report: A Data Analysis Project

Call Girls In Dwarka 9654467111 Escorts Service

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

Data Factory in Microsoft Fabric (MsBIP #82)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Multiple time frame trading analysis -brianshannon.pdf

Student profile product demonstration on grades, ability, well-being and mind...

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

How we prevented account sharing with MFA

Generative AI for Social Good at Open Data Science East 2024

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi

Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

1. Sol Ackerman & Franklyn D’souza Adopting Dataframes and Parquet in an Already Existing Warehouse

4. 4

5. Tier 2 FrontroomWebsite MySQL HTTP APIs HDFS JSON PySpark RDD Tier 1 Frontroom Redshift Operational DB Shopify User Kafka PySpark Streaming Presto

6. 6 >300kSTORES 2 PBDATA 4000JOBS / DAY 1.5 TBNEW DATA / DAY Warehouse Factoids

7. Why Dataframes + Parquet ?

9. ● Processing time, time to fresh data ● Development/Iteration time ● Cost overhead

10. ● Columnar format ● Python simplicity, JVM performance ● Unlocks SQL on HDFS

11. 1 1 • Zero down time for analysts / users • Fully backwards compatible • No data loss / corruption • Incremental rollout, mitigating risk, downtime, revertability • Incremental rol Goals

12. • Keep old code • Introduce Parquet + Dataframes • Rewrite slow jobs as Dataframe pipelines Adoption Plan

13. 13 Data agnostic reader / writer Backwards Compatible Output JSON / Parquet Input JSON / Parquet

14. Data agnostic reader rdd1 = read('hdfs://file1.parquet', load_type='rdd') rdd2 = read('hdfs://file2.json', load_type='rdd') df1 = read('hdfs://file1.parquet', load_type='dataframe') df2 = read('hdfs://file2.json', load_type='dataframe')

15. Configurable Output-format sample-job: owner: franklyn.dsouza@shopify.com build: resource_class: large command_options: some-file-on-hdfs: hdfs://input/to/job output: hdfs://output/from/job output-format: parquet file:job.py

16. Configurable Output-format sample-job: owner: franklyn.dsouza@shopify.com build: resource_class: large command_options: some-file-on-hdfs: hdfs://input/to/job output: hdfs://output/from/job output-format: parquet file:job.py

17. Configurable Output-format def write_output(self, path, out): if self.output_format == 'json': save_as_json_file(out, path) elif self.output_format == 'parquet': save_as_parquet_file(out, path)

18. Job Types • Full Drop : Jobs that rebuild their output from scratch every time. Input Output Input Input

19. • Incremental Drop : Jobs that build their output incrementally. Job Types Part 1 Part 2 Part 3 Part 1 Part 2 Part 3 Input Output

20. Forwards with Dataframes • Is the dataset Dataframe-able? • Checking data integrity • Regression blocking

21. Will it Dataframe? Python Dataframe long any size sql.LongType (-2^63, 2^63-1) Decimal 28 prec 28 sig digits sql.DecimalType(38, 18) [20 digits].[18 digits]

22. Long: if (v < -9223372036854775808 or v > 9223372036854775807): logger.warn('Field: {}, Value: {}. long will not fit inside sql.LongType'.format(k, v)) Decimal: DECIMAL_CONTEXT = Context(prec=38, Emax=19, Emin=19, traps=[Inexact]) try: DECIMAL_CONTEXT.create_decimal(v) except: logger.warn('Field: {}, Value: {}. Decimal will not fit inside sql.DecimalType(38,18)'.format(k, v))

23. Record based format [{'a': 1}, {'a': 2, 'b': True} ...] if 'b' in data: # DataFrame path else: # RDD path Column based format a b 1 None 2 True Will it Dataframe?

24. Optional Blocking def test_no_new_optional_jobs(): white_list = {...} jobs_with_optionals = find_jobs_with_optionals() new_jobs_with_optionals = jobs_with_optionals - white_list assert len(new_jobs_with_optionals) == 0

25. Checking Data Integrity Parquet Input JSON / Parquet JSON == ? Reconcile Job

26. Checking Data Integrity Parquet Input Parquet Parquet == ? Reconcile Job RDD DF

27. def test_no_new_json_jobs(): white_list = {...} jobs_outputing_json = find_json_jobs() new_json_jobs = jobs_outputing_json - white_list assert len(new_json_jobs) == 0 Regression Blocking

28. 10xFASTER 0.7xRESOURCES USED 1.3xJAVA MEM Results SQLDATA ACCELERATED

29. ● Dataframes pay for themselves, especially at scale ● There is no Dataframe switch ● Safety first Lessons Learned

30. 3 0 shopify.com/careers We’re Hiring

Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Spark Summit EU talk by Sol Ackerman and Franklyn D'souza

Similar to Spark Summit EU talk by Sol Ackerman and Franklyn D'souza (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark Summit EU talk by Sol Ackerman and Franklyn D'souza