SlideShare a Scribd company logo
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark+AI SF
•
Making Nested Columns as First
Citizens in Apache Spark SQL
Cesar Delgado @hpcfarmer
DB Tsai @dbtsai
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri
The world’s most popular intelligent assistant service powering
every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Open Source Team
• We’re Spark, Hadoop, HBase PMCs / Committers / Contributors
• We’re the advocate for Open Source
• Pushing our internal changes back to the upstreams
• Working with the communities to review pull requests, develop
new features and bug fixes
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Data
• Machine learning is used to personalize your experience
throughout your day
• We believe privacy is a fundamental human right
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Scale
• Large amounts of requests, Data Centers all over the world
• Hadoop / Yarn Cluster has thousands of nodes
• HDFS has hundred of PB
• 100's TB of raw event data per day
• More than 90% of jobs are Spark
• Less than 10% are legacy Pig and MapReduce jobs
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Details about our data
• Deeply nested relational model data with couple top level columns
• The total nested fields are more than 2k
• Stored in Parquet format partitioned by UTC day
• Most queries are only for a small subset of the data
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
An Example of Hierarchically Organized Table
Real estate information can be naturally modeled by
case class Address(houseNumber: Int,
streetAddress: String,
city: String,
state: String,
zipCode: String)
case class Facts(price: Int,
size: Int,
yearBuilt: Int)
case class School(name: String)
case class Home(address: Address,
facts: Facts,
schools: List[School])
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
root
|-- address: struct (nullable = true)
| |-- houseNumber: integer (nullable = true)
| |-- streetAddress: string (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
| |-- zipCode: string (nullable = true)
|-- facts: struct (nullable = true)
| |-- price: integer (nullable = true)
| |-- size: integer (nullable = true)
| |-- yearBuilt: integer (nullable = true)
|-- schools: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
sql("select * from homes”).printSchema()
Nested SQL Schema
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
sql("select address.city from homes where facts.price > 2000000”)
.explain(true)
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#75]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56],
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:strin…,
facts:struct(address:int…)>
• We only need two nested columns, address.city and facts.prices
• But entire address and facts are read
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
[SPARK-4502], [SPARK-25363] Parquet with Nested Columns
• Parquet is a columnar storage format with complex nested data
structures in mind
• Support very efficient compression and encoding schemes
• As a columnar format, each nested column is stored separately as if it's a
flattened table
• No easy way to cherry pick couple nested columns in Spark
• Foundation - Allow reading subset of nested columns right after Parquet
FileScan
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
sql("select address.city from homes where facts.price > 2000000”)
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Only two nested columns are read!
• With [SPARK-4502], [SPARK-25363]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Parquet predicate pushdown are not working for nested fields in Spark
sql("select address.city from homes where facts.price > 2000000”)
• With [SPARK-4502], [SPARK-25363]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Predicate Pushdown in Parquet for nested fields provides significant
performance gain by eliminating non-matches earlier to read less data
and save the cost of processing them
sql("select address.city from homes where facts.price > 2000000”)
• With [SPARK-25556]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
val areaUdf = udf{ (city: String, state: String, zipCode: String) =>
s"$city, $state $zipCode"
}
val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(address#58.city, address#58.state, address#58.zipCode) AS
UDF(address.city, address.state, address.zipCode)#70]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#58]
+- *(1) FileScan parquet [address#58] Format: Parquet,
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:string>>
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Problems in Supporting Nested Structures in Spark
• Root level columns are represented by Attribute which is base of
leaf named expressions
• To get a nested field from a root level column, a GetStructField
expression with child of Attribute has to be used
• All column pruning logics are done in Attribute levels, resulting
either the entire root column is taken or pruned
• No easy way to cherry pick couple nested columns in this model
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
[SPARK-25603] Generalize Nested Column Pruning
• [SPARK-4502], [SPARK-25363] are foundation to support nested
structures better with Parquet in Spark
• If an operator such as Repartition, Sample, or Limit are applied after
Parquet FileScan, nested column pruning will not work
• We address this by flattening the nested fields using Alias right after data
read
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(_gen_alias_84#84, _gen_alias_85#85, _gen_alias_86#86) AS UDF(address.city,
address.state, address.zipCode)#64]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#55.city AS _gen_alias_84#84, address#55.state AS _gen_alias_85#85,
address#55.zipCode AS _gen_alias_86#86]
+- *(1) FileScan parquet [address#55]
ReadSchema: struct<address:struct<city:string,state:string,zipCode:string>>
• Nested fields are replaced by Alias with flatten structures
• Only three used nested fields are read from Parquet files
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Production Query - Finding a Needle in a Haystack
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.3.1
1.2h 7.1TB
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-25556]
3.3min 840GB
7.1TB1.2h
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
• 21x faster in wall clock time

• 8x less data being read

• More power efficient
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Other work
• Enhance the Dataset performance by analyzing JVM bytecode
and turn closures into Catalyst expressions
• Please check our other presentation tomorrow at 11am for more
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Conclusions
With some work, engineering rigor and some optimizations
Spark can run at very large scale in lightning speed
• [SPARK-4502]
• [SPARK-25363]
• [SPARK-25556]
• [SPARK-25603]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Thank you

More Related Content

What's hot

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Taro L. Saito
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Julian Hyde
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Spark
SparkSpark
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Edureka!
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 

What's hot (20)

Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Spark
SparkSpark
Spark
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similar to Making Nested Columns as First Citizen in Apache Spark SQL

Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB TsaiPitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
Databricks
 
MySQL 8.0 Released Update
MySQL 8.0 Released UpdateMySQL 8.0 Released Update
MySQL 8.0 Released Update
Keith Hollman
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
Chris Bailey
 
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1
Cisco Canada
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on Read
Kent Graziano
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
InfluxData
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Using a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge GapUsing a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Precisely
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Databricks
 
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco Canada
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data Driven
Tammy Bednar
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamData Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
Precisely
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
Torsten Steinbach
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
eddiebaggott
 
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
Tammy Bednar
 

Similar to Making Nested Columns as First Citizen in Apache Spark SQL (20)

Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB TsaiPitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
 
MySQL 8.0 Released Update
MySQL 8.0 Released UpdateMySQL 8.0 Released Update
MySQL 8.0 Released Update
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
 
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on Read
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Using a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge GapUsing a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
 
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data Driven
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamData Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 

Recently uploaded (20)

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 

Making Nested Columns as First Citizen in Apache Spark SQL

  • 1. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark+AI SF • Making Nested Columns as First Citizens in Apache Spark SQL Cesar Delgado @hpcfarmer DB Tsai @dbtsai
  • 2. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri The world’s most popular intelligent assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
  • 3. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Open Source Team • We’re Spark, Hadoop, HBase PMCs / Committers / Contributors • We’re the advocate for Open Source • Pushing our internal changes back to the upstreams • Working with the communities to review pull requests, develop new features and bug fixes
  • 4. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Data • Machine learning is used to personalize your experience throughout your day • We believe privacy is a fundamental human right
  • 5. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Scale • Large amounts of requests, Data Centers all over the world • Hadoop / Yarn Cluster has thousands of nodes • HDFS has hundred of PB • 100's TB of raw event data per day • More than 90% of jobs are Spark • Less than 10% are legacy Pig and MapReduce jobs
  • 6. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Details about our data • Deeply nested relational model data with couple top level columns • The total nested fields are more than 2k • Stored in Parquet format partitioned by UTC day • Most queries are only for a small subset of the data
  • 7. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. An Example of Hierarchically Organized Table Real estate information can be naturally modeled by case class Address(houseNumber: Int, streetAddress: String, city: String, state: String, zipCode: String) case class Facts(price: Int, size: Int, yearBuilt: Int) case class School(name: String) case class Home(address: Address, facts: Facts, schools: List[School])
  • 8. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. root |-- address: struct (nullable = true) | |-- houseNumber: integer (nullable = true) | |-- streetAddress: string (nullable = true) | |-- city: string (nullable = true) | |-- state: string (nullable = true) | |-- zipCode: string (nullable = true) |-- facts: struct (nullable = true) | |-- price: integer (nullable = true) | |-- size: integer (nullable = true) | |-- yearBuilt: integer (nullable = true) |-- schools: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) sql("select * from homes”).printSchema() Nested SQL Schema
  • 9. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. sql("select address.city from homes where facts.price > 2000000”) .explain(true) Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#75] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56], DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string, city:string,state:string,zipCode:strin…, facts:struct(address:int…)> • We only need two nested columns, address.city and facts.prices • But entire address and facts are read
  • 10. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. [SPARK-4502], [SPARK-25363] Parquet with Nested Columns • Parquet is a columnar storage format with complex nested data structures in mind • Support very efficient compression and encoding schemes • As a columnar format, each nested column is stored separately as if it's a flattened table • No easy way to cherry pick couple nested columns in Spark • Foundation - Allow reading subset of nested columns right after Parquet FileScan
  • 11. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. sql("select address.city from homes where facts.price > 2000000”) Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Only two nested columns are read! • With [SPARK-4502], [SPARK-25363]
  • 12. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Parquet predicate pushdown are not working for nested fields in Spark sql("select address.city from homes where facts.price > 2000000”) • With [SPARK-4502], [SPARK-25363]
  • 13. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Predicate Pushdown in Parquet for nested fields provides significant performance gain by eliminating non-matches earlier to read less data and save the cost of processing them sql("select address.city from homes where facts.price > 2000000”) • With [SPARK-25556]
  • 14. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. val areaUdf = udf{ (city: String, state: String, zipCode: String) => s"$city, $state $zipCode" } val query = sql("select * from homes").repartition(1).select( areaUdf(col("address.city"), col("address.state"), col("address.zipCode")) ).explain() Applying an UDF after repartition == Physical Plan == *(2) Project [UDF(address#58.city, address#58.state, address#58.zipCode) AS UDF(address.city, address.state, address.zipCode)#70] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [address#58] +- *(1) FileScan parquet [address#58] Format: Parquet, ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string, city:string,state:string,zipCode:string>>
  • 15. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Problems in Supporting Nested Structures in Spark • Root level columns are represented by Attribute which is base of leaf named expressions • To get a nested field from a root level column, a GetStructField expression with child of Attribute has to be used • All column pruning logics are done in Attribute levels, resulting either the entire root column is taken or pruned • No easy way to cherry pick couple nested columns in this model
  • 16. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. [SPARK-25603] Generalize Nested Column Pruning • [SPARK-4502], [SPARK-25363] are foundation to support nested structures better with Parquet in Spark • If an operator such as Repartition, Sample, or Limit are applied after Parquet FileScan, nested column pruning will not work • We address this by flattening the nested fields using Alias right after data read
  • 17. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. val query = sql("select * from homes").repartition(1).select( areaUdf(col("address.city"), col("address.state"), col("address.zipCode")) ).explain() Applying an UDF after repartition == Physical Plan == *(2) Project [UDF(_gen_alias_84#84, _gen_alias_85#85, _gen_alias_86#86) AS UDF(address.city, address.state, address.zipCode)#64] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [address#55.city AS _gen_alias_84#84, address#55.state AS _gen_alias_85#85, address#55.zipCode AS _gen_alias_86#86] +- *(1) FileScan parquet [address#55] ReadSchema: struct<address:struct<city:string,state:string,zipCode:string>> • Nested fields are replaced by Alias with flatten structures • Only three used nested fields are read from Parquet files
  • 18. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Production Query - Finding a Needle in a Haystack
  • 19. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.3.1 1.2h 7.1TB
  • 20. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-25556] 3.3min 840GB 7.1TB1.2h
  • 21. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. • 21x faster in wall clock time
 • 8x less data being read
 • More power efficient
  • 22. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Other work • Enhance the Dataset performance by analyzing JVM bytecode and turn closures into Catalyst expressions • Please check our other presentation tomorrow at 11am for more
  • 23. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Conclusions With some work, engineering rigor and some optimizations Spark can run at very large scale in lightning speed
  • 24. • [SPARK-4502] • [SPARK-25363] • [SPARK-25556] • [SPARK-25603]
  • 25. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Thank you