SlideShare a Scribd company logo
1 of 26
Download to read offline
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark + AI Summit Europe, London, Oct 4, 2018
•
Pitfalls of Apache Spark at Scale
Cesar Delgado @hpcfarmer
DB Tsai @dbtsai
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri Open Source Team
• We’re Spark, Hadoop, HBase PMCs / Committers / Contributors
• We’re the advocate for open source
• Pushing our internal changes back to the upstreams
• Working with the communities to review pull requests, develop
new features and bug fixes
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri
The world’s largest virtual assistant service powering every
iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri Data
• Machine learning is used to personalize your experience
throughout your day
• We believe privacy is a fundamental human right
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Apple Siri Scale
• Large amounts of requests, Data Centers all over the world
• Hadoop / Yarn Cluster has thousands of nodes
• HDFS has hundred of PB
• 100's TB of raw event data per day
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Data Pipeline
• Downstream data consumers were doing the same expensive
query with expensive joins
• Different teams had their own repos, and built their jars - hard to
track data lineages
• Raw client request data is tricky to process as it involves deep
understanding of business logic
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Unified Pipeline
• Single repo for Spark application across the Siri
• Shared business logic code to avoid any discrepancy
• Raw data is cleaned, joined, and transformed into one
standardized data model for data consumers to query on
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Technical Details about Strongly Typed Data
• Schema of data is checked in as case class, and CI ensures
schema changes won’t break the jobs
• Deeply nested relational model data with 5 top level columns
• The total fields are around 2k
• Stored in Parquet format partitioned by UTC day
• Data consumers query on the subset of data
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
DataFrame: Relational untyped APIs introduced in
Spark 1.3. From Spark 2.0, 

type DataFrame = Dataset[Row]
Dataset: Support all the untyped APIs in DataFrame
+ typed functional APIs
Review of APIs of Spark
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Review of DataFrame
val df: DataFrame = Seq(
(1, "iPhone", 5),
(2, "MacBook", 4),
(3, "iPhone", 3),
(4, "iPad", 2),
(5, "appleWatch", 1)
).toDF("userId", "device", “counts")
df.printSchema()
"""
|root
||-- userId: integer (nullable = false)
||-- device: string (nullable = true)
||-- counts: integer (nullable = false)
|
""".stripMargin
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Review of DataFrame
df.withColumn(“counts", $"counts" + 1)
.filter($"device" === "iPhone").show()
"""
|+------+------+------+
||userId|device|counts|
|+------+------+------+
|| 1|iPhone| 6|
|| 3|iPhone| 4|
|+------+------+------+
""".stripMargin

// $”counts” == df(“counts”)
// via implicit conversion
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Execution Plan - Dataframe Untyped APIs
df.withColumn("counts", $"counts" + 1).filter($”device" === “iPhone").explain(true)
"""
|== Parsed Logical Plan ==
|'Filter ('device = iPhone)
|+- Project [userId#3, device#4, (counts#5 + 1) AS counts#10]
| +- Relation[userId#3,device#4,counts#5] parquet
|
|== Physical Plan ==
|*Project [userId#3, device#4, (counts#5 + 1) AS counts#10]
|+- *Filter (isnotnull(device#4) && (device#4 = iPhone))
| +- *FileScan parquet [userId#3,device#4,counts#5]
| Batched: true, Format: Parquet,
| PartitionFilters: [],
| PushedFilters: [IsNotNull(device), EqualTo(device,iPhone)],
| ReadSchema: struct<userId:int,device:string,counts:int>
|
""".stripMargin
•
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Review of Dataset
case class ErrorEvent(userId: Long, device: String, counts: Long)
val ds = df.as[ErrorEvent]
ds.map(row => ErrorEvent(row.userId, row.device, row.counts + 1))
.filter(row => row.device == “iPhone").show()
"""
|+------+------+------+
||userId|device|counts|
|+------+------+------+
|| 1|iPhone| 6|
|| 3|iPhone| 4|
|+------+------+------+
“"".stripMargin
// It’s really easy to put existing Java / Scala code here.
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Execution Plan - Dataset Typed APIs
ds.map { row => ErrorEvent(row.userId, row.device, row.counts + 1) }.filter { row =>
row.device == "iPhone"
}.explain(true)
"""
|== Physical Plan ==
|*SerializeFromObject [
| assertnotnull(input[0, com.apple.ErrorEvent, true]).userId AS userId#27L,
| assertnotnull(input[0, com.apple.ErrorEvent, true]).device, true) AS device#28,
| assertnotnull(input[0, com.apple.ErrorEvent, true]).counts AS counts#29L]
|+- *Filter <function1>.apply
| +- *MapElements <function1>, obj#26: com.apple.ErrorEvent
| +- *DeserializeToObject newInstance(class com.apple.ErrorEvent), obj#25:
| com.apple.siri.ErrorEvent
| +- *FileScan parquet [userId#3,device#4,counts#5]
| Batched: true, Format: Parquet,
| PartitionFilters: [], PushedFilters: [],
| ReadSchema: struct<userId:int,device:string,counts:int>
""".stripMargin
•
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Strongly Typed Pipeline
• Typed Dataset is used to guarantee the schema consistency
• Enables Java/Scala interoperability between systems
• Increases Data Scientist productivity
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Drawbacks of Strongly Typed Pipeline
• Dataset are slower than Dataframe https://tinyurl.com/dataset-vs-dataframe
• In Dataset, many POJO are created for each row resulting high
GC pressure
• Data consumers typically query on subsets of data, but schema
pruning and predicate pushdown are not working well in nested
fields
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
In Spark 2.3.1
case class FullName(first: String, middle: String, last: String)
case class Contact(id: Int,
name: FullName,
address: String)
sql("select name.first from contacts").where("name.first = 'Jane'").explain(true)
"""
|== Physical Plan ==
|*(1) Project [name#10.first AS first#23]
|+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane))
| +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet,
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema:
struct<name:struct<first:string,middle:string,last:string>>
|
""".stripMargin
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
In Spark 2.4 with Schema Pruning
"""
|== Physical Plan ==
|*(1) Project [name#10.first AS first#23]
|+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane))
| +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet,
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema:
struct<name:struct<first:string>>
|
""".stripMargin
• [SPARK-4502], [SPARK-25363] Parquet nested column pruning
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
In Spark 2.4 with Schema Pruning + Predicate Pushdown
"""
|== Physical Plan ==
|*(1) Project [name#10.first AS first#23]
|+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane))
| +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet,
PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name.first,Jane)],
ReadSchema: struct<name:struct<first:string>>
|
""".stripMargin
• [SPARK-4502], [SPARK-25363] Parquet nested column pruning
• [SPARK-17636] Parquet nested Predicate Pushdown
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Production Query - Finding a Needle in a Haystack
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.3.1
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-17636]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
• 21x faster in wall clock time

• 8x less data being read

• Saving electric bills in many
data centers
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Future Work
• Use Spark Streaming can be used to aggregate the request data
in the edge first to reduce the data movement
• Enhance the Dataset performance by analyzing JVM bytecode
and turn closures into Catalyst expressions
• Building a table format on top of Parquet using Spark's
Datasource V2 API to tracks individual data files with richer
metadata to manage versioning and enhance query performance
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Conclusions
With some work, engineering rigor and some optimizations
Spark can run at very large scale in lightning speed
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Thank you

More Related Content

Similar to Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai

Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 

Similar to Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai (20)

Building Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMakerBuilding Deep Learning Applications with TensorFlow and Amazon SageMaker
Building Deep Learning Applications with TensorFlow and Amazon SageMaker
 
Build Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMakerBuild Deep Learning Applications with TensorFlow and Amazon SageMaker
Build Deep Learning Applications with TensorFlow and Amazon SageMaker
 
APEX Office Hours Interactive Grid Deep Dive
APEX Office Hours Interactive Grid Deep DiveAPEX Office Hours Interactive Grid Deep Dive
APEX Office Hours Interactive Grid Deep Dive
 
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
Build Deep Learning Applications with TensorFlow & SageMaker: Machine Learnin...
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlowAWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
AWS Machine Learning Week SF: Amazon SageMaker & TensorFlow
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python Profiling
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
 
Immersion Day - Estratégias e melhores práticas para ingestão de dados
Immersion Day - Estratégias e melhores práticas para ingestão de dadosImmersion Day - Estratégias e melhores práticas para ingestão de dados
Immersion Day - Estratégias e melhores práticas para ingestão de dados
 
Delivering Mobile Apps to the Field with Oracle
Delivering Mobile Apps to the Field with OracleDelivering Mobile Apps to the Field with Oracle
Delivering Mobile Apps to the Field with Oracle
 
[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...[DSC Europe 22] Smart approach in development and deployment process for vari...
[DSC Europe 22] Smart approach in development and deployment process for vari...
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
Accelerate AI/ML Adoption with Intel Processors and C3IoT on AWS (AIM386-S) -...
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
F5 Automation Toolchain
F5 Automation ToolchainF5 Automation Toolchain
F5 Automation Toolchain
 
Workshop: Build Deep Learning Applications with TensorFlow and SageMaker
Workshop: Build Deep Learning Applications with TensorFlow and SageMakerWorkshop: Build Deep Learning Applications with TensorFlow and SageMaker
Workshop: Build Deep Learning Applications with TensorFlow and SageMaker
 
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
 
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science CloudArtificial Intelligence and Machine Learning with the Oracle Data Science Cloud
Artificial Intelligence and Machine Learning with the Oracle Data Science Cloud
 
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
RA TechED 2019 - SY08 - Developing Information Ready Applications using Smart...
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 

Recently uploaded (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 

Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai

  • 1. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark + AI Summit Europe, London, Oct 4, 2018 • Pitfalls of Apache Spark at Scale Cesar Delgado @hpcfarmer DB Tsai @dbtsai
  • 2. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri Open Source Team • We’re Spark, Hadoop, HBase PMCs / Committers / Contributors • We’re the advocate for open source • Pushing our internal changes back to the upstreams • Working with the communities to review pull requests, develop new features and bug fixes
  • 3. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri The world’s largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
  • 4. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri Data • Machine learning is used to personalize your experience throughout your day • We believe privacy is a fundamental human right
  • 5. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Apple Siri Scale • Large amounts of requests, Data Centers all over the world • Hadoop / Yarn Cluster has thousands of nodes • HDFS has hundred of PB • 100's TB of raw event data per day
  • 6. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Data Pipeline • Downstream data consumers were doing the same expensive query with expensive joins • Different teams had their own repos, and built their jars - hard to track data lineages • Raw client request data is tricky to process as it involves deep understanding of business logic
  • 7. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Unified Pipeline • Single repo for Spark application across the Siri • Shared business logic code to avoid any discrepancy • Raw data is cleaned, joined, and transformed into one standardized data model for data consumers to query on
  • 8. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Technical Details about Strongly Typed Data • Schema of data is checked in as case class, and CI ensures schema changes won’t break the jobs • Deeply nested relational model data with 5 top level columns • The total fields are around 2k • Stored in Parquet format partitioned by UTC day • Data consumers query on the subset of data
  • 9. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. DataFrame: Relational untyped APIs introduced in Spark 1.3. From Spark 2.0, 
 type DataFrame = Dataset[Row] Dataset: Support all the untyped APIs in DataFrame + typed functional APIs Review of APIs of Spark
  • 10. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Review of DataFrame val df: DataFrame = Seq( (1, "iPhone", 5), (2, "MacBook", 4), (3, "iPhone", 3), (4, "iPad", 2), (5, "appleWatch", 1) ).toDF("userId", "device", “counts") df.printSchema() """ |root ||-- userId: integer (nullable = false) ||-- device: string (nullable = true) ||-- counts: integer (nullable = false) | """.stripMargin
  • 11. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Review of DataFrame df.withColumn(“counts", $"counts" + 1) .filter($"device" === "iPhone").show() """ |+------+------+------+ ||userId|device|counts| |+------+------+------+ || 1|iPhone| 6| || 3|iPhone| 4| |+------+------+------+ """.stripMargin
 // $”counts” == df(“counts”) // via implicit conversion
  • 12. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Execution Plan - Dataframe Untyped APIs df.withColumn("counts", $"counts" + 1).filter($”device" === “iPhone").explain(true) """ |== Parsed Logical Plan == |'Filter ('device = iPhone) |+- Project [userId#3, device#4, (counts#5 + 1) AS counts#10] | +- Relation[userId#3,device#4,counts#5] parquet | |== Physical Plan == |*Project [userId#3, device#4, (counts#5 + 1) AS counts#10] |+- *Filter (isnotnull(device#4) && (device#4 = iPhone)) | +- *FileScan parquet [userId#3,device#4,counts#5] | Batched: true, Format: Parquet, | PartitionFilters: [], | PushedFilters: [IsNotNull(device), EqualTo(device,iPhone)], | ReadSchema: struct<userId:int,device:string,counts:int> | """.stripMargin •
  • 13. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Review of Dataset case class ErrorEvent(userId: Long, device: String, counts: Long) val ds = df.as[ErrorEvent] ds.map(row => ErrorEvent(row.userId, row.device, row.counts + 1)) .filter(row => row.device == “iPhone").show() """ |+------+------+------+ ||userId|device|counts| |+------+------+------+ || 1|iPhone| 6| || 3|iPhone| 4| |+------+------+------+ “"".stripMargin // It’s really easy to put existing Java / Scala code here.
  • 14. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Execution Plan - Dataset Typed APIs ds.map { row => ErrorEvent(row.userId, row.device, row.counts + 1) }.filter { row => row.device == "iPhone" }.explain(true) """ |== Physical Plan == |*SerializeFromObject [ | assertnotnull(input[0, com.apple.ErrorEvent, true]).userId AS userId#27L, | assertnotnull(input[0, com.apple.ErrorEvent, true]).device, true) AS device#28, | assertnotnull(input[0, com.apple.ErrorEvent, true]).counts AS counts#29L] |+- *Filter <function1>.apply | +- *MapElements <function1>, obj#26: com.apple.ErrorEvent | +- *DeserializeToObject newInstance(class com.apple.ErrorEvent), obj#25: | com.apple.siri.ErrorEvent | +- *FileScan parquet [userId#3,device#4,counts#5] | Batched: true, Format: Parquet, | PartitionFilters: [], PushedFilters: [], | ReadSchema: struct<userId:int,device:string,counts:int> """.stripMargin •
  • 15. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Strongly Typed Pipeline • Typed Dataset is used to guarantee the schema consistency • Enables Java/Scala interoperability between systems • Increases Data Scientist productivity
  • 16. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Drawbacks of Strongly Typed Pipeline • Dataset are slower than Dataframe https://tinyurl.com/dataset-vs-dataframe • In Dataset, many POJO are created for each row resulting high GC pressure • Data consumers typically query on subsets of data, but schema pruning and predicate pushdown are not working well in nested fields
  • 17. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. In Spark 2.3.1 case class FullName(first: String, middle: String, last: String) case class Contact(id: Int, name: FullName, address: String) sql("select name.first from contacts").where("name.first = 'Jane'").explain(true) """ |== Physical Plan == |*(1) Project [name#10.first AS first#23] |+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane)) | +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>> | """.stripMargin
  • 18. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. In Spark 2.4 with Schema Pruning """ |== Physical Plan == |*(1) Project [name#10.first AS first#23] |+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane)) | +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string>> | """.stripMargin • [SPARK-4502], [SPARK-25363] Parquet nested column pruning
  • 19. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. In Spark 2.4 with Schema Pruning + Predicate Pushdown """ |== Physical Plan == |*(1) Project [name#10.first AS first#23] |+- *(1) Filter (isnotnull(name#10) && (name#10.first = Jane)) | +- *(1) FileScan parquet [name#10] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name.first,Jane)], ReadSchema: struct<name:struct<first:string>> | """.stripMargin • [SPARK-4502], [SPARK-25363] Parquet nested column pruning • [SPARK-17636] Parquet nested Predicate Pushdown
  • 20. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Production Query - Finding a Needle in a Haystack
  • 21. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.3.1
  • 22. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-17636]
  • 23. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. • 21x faster in wall clock time
 • 8x less data being read
 • Saving electric bills in many data centers
  • 24. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Future Work • Use Spark Streaming can be used to aggregate the request data in the edge first to reduce the data movement • Enhance the Dataset performance by analyzing JVM bytecode and turn closures into Catalyst expressions • Building a table format on top of Parquet using Spark's Datasource V2 API to tracks individual data files with richer metadata to manage versioning and enhance query performance
  • 25. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Conclusions With some work, engineering rigor and some optimizations Spark can run at very large scale in lightning speed
  • 26. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Thank you