SlideShare a Scribd company logo
1 of 25
Download to read offline
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark+AI SF
•
Making Nested Columns as First
Citizens in Apache Spark SQL
Cesar Delgado @hpcfarmer
DB Tsai @dbtsai
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri
The world’s most popular intelligent assistant service powering
every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Open Source Team
• We’re Spark, Hadoop, HBase PMCs / Committers / Contributors
• We’re the advocate for Open Source
• Pushing our internal changes back to the upstreams
• Working with the communities to review pull requests, develop
new features and bug fixes
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Data
• Machine learning is used to personalize your experience
throughout your day
• We believe privacy is a fundamental human right
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Siri Scale
• Large amounts of requests, Data Centers all over the world
• Hadoop / Yarn Cluster has thousands of nodes
• HDFS has hundred of PB
• 100's TB of raw event data per day
• More than 90% of jobs are Spark
• Less than 10% are legacy Pig and MapReduce jobs
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Details about our data
• Deeply nested relational model data with couple top level columns
• The total nested fields are more than 2k
• Stored in Parquet format partitioned by UTC day
• Most queries are only for a small subset of the data
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
An Example of Hierarchically Organized Table
Real estate information can be naturally modeled by
case class Address(houseNumber: Int,
streetAddress: String,
city: String,
state: String,
zipCode: String)
case class Facts(price: Int,
size: Int,
yearBuilt: Int)
case class School(name: String)
case class Home(address: Address,
facts: Facts,
schools: List[School])
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
root
|-- address: struct (nullable = true)
| |-- houseNumber: integer (nullable = true)
| |-- streetAddress: string (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
| |-- zipCode: string (nullable = true)
|-- facts: struct (nullable = true)
| |-- price: integer (nullable = true)
| |-- size: integer (nullable = true)
| |-- yearBuilt: integer (nullable = true)
|-- schools: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
sql("select * from homes”).printSchema()
Nested SQL Schema
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
sql("select address.city from homes where facts.price > 2000000”)
.explain(true)
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#75]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56],
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:strin…,
facts:struct(address:int…)>
• We only need two nested columns, address.city and facts.prices
• But entire address and facts are read
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
[SPARK-4502], [SPARK-25363] Parquet with Nested Columns
• Parquet is a columnar storage format with complex nested data
structures in mind
• Support very efficient compression and encoding schemes
• As a columnar format, each nested column is stored separately as if it's a
flattened table
• No easy way to cherry pick couple nested columns in Spark
• Foundation - Allow reading subset of nested columns right after Parquet
FileScan
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
sql("select address.city from homes where facts.price > 2000000”)
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Only two nested columns are read!
• With [SPARK-4502], [SPARK-25363]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Parquet predicate pushdown are not working for nested fields in Spark
sql("select address.city from homes where facts.price > 2000000”)
• With [SPARK-4502], [SPARK-25363]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Find cities with houses worth more than 2M
== Physical Plan ==
*(1) Project [address#55.city AS city#77]
+- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000))
+- *(1) FileScan parquet [address#55,facts#56]
DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)],
Format: Parquet,
PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)],
ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>>
• Predicate Pushdown in Parquet for nested fields provides significant
performance gain by eliminating non-matches earlier to read less data
and save the cost of processing them
sql("select address.city from homes where facts.price > 2000000”)
• With [SPARK-25556]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
val areaUdf = udf{ (city: String, state: String, zipCode: String) =>
s"$city, $state $zipCode"
}
val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(address#58.city, address#58.state, address#58.zipCode) AS
UDF(address.city, address.state, address.zipCode)#70]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#58]
+- *(1) FileScan parquet [address#58] Format: Parquet,
ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string,
city:string,state:string,zipCode:string>>
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Problems in Supporting Nested Structures in Spark
• Root level columns are represented by Attribute which is base of
leaf named expressions
• To get a nested field from a root level column, a GetStructField
expression with child of Attribute has to be used
• All column pruning logics are done in Attribute levels, resulting
either the entire root column is taken or pruned
• No easy way to cherry pick couple nested columns in this model
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
[SPARK-25603] Generalize Nested Column Pruning
• [SPARK-4502], [SPARK-25363] are foundation to support nested
structures better with Parquet in Spark
• If an operator such as Repartition, Sample, or Limit are applied after
Parquet FileScan, nested column pruning will not work
• We address this by flattening the nested fields using Alias right after data
read
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
val query = sql("select * from homes").repartition(1).select(
areaUdf(col("address.city"),
col("address.state"),
col("address.zipCode"))
).explain()
Applying an UDF after repartition
== Physical Plan ==
*(2) Project [UDF(_gen_alias_84#84, _gen_alias_85#85, _gen_alias_86#86) AS UDF(address.city,
address.state, address.zipCode)#64]
+- Exchange RoundRobinPartitioning(1)
+- *(1) Project [address#55.city AS _gen_alias_84#84, address#55.state AS _gen_alias_85#85,
address#55.zipCode AS _gen_alias_86#86]
+- *(1) FileScan parquet [address#55]
ReadSchema: struct<address:struct<city:string,state:string,zipCode:string>>
• Nested fields are replaced by Alias with flatten structures
• Only three used nested fields are read from Parquet files
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Production Query - Finding a Needle in a Haystack
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.3.1
1.2h 7.1TB
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-25556]
3.3min 840GB
7.1TB1.2h
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
• 21x faster in wall clock time

• 8x less data being read

• More power efficient
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Other work
• Enhance the Dataset performance by analyzing JVM bytecode
and turn closures into Catalyst expressions
• Please check our other presentation tomorrow at 11am for more
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Conclusions
With some work, engineering rigor and some optimizations
Spark can run at very large scale in lightning speed
• [SPARK-4502]
• [SPARK-25363]
• [SPARK-25556]
• [SPARK-25603]
© 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple.
Thank you

More Related Content

What's hot

Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsDatabricks
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Traceoysteing
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 

What's hot (20)

Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
A Rusty introduction to Apache Arrow and how it applies to a  time series dat...A Rusty introduction to Apache Arrow and how it applies to a  time series dat...
A Rusty introduction to Apache Arrow and how it applies to a time series dat...
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
The MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer TraceThe MySQL Query Optimizer Explained Through Optimizer Trace
The MySQL Query Optimizer Explained Through Optimizer Trace
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Time-Travel.pptx
Time-Travel.pptxTime-Travel.pptx
Time-Travel.pptx
 

Similar to Making Nested Columns as First Citizen in Apache Spark SQL

Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB TsaiPitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB TsaiDatabricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesDatabricks
 
MySQL 8.0 Released Update
MySQL 8.0 Released UpdateMySQL 8.0 Released Update
MySQL 8.0 Released UpdateKeith Hollman
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaChris Bailey
 
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1Cisco Canada
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on ReadKent Graziano
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData InfluxData
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Using a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge GapUsing a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge GapPrecisely
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuDatabricks
 
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco Canada
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenTammy Bednar
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamData Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamPrecisely
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionTorsten Steinbach
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015eddiebaggott
 
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the MapTammy Bednar
 

Similar to Making Nested Columns as First Citizen in Apache Spark SQL (20)

Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB TsaiPitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
Pitfalls of Apache Spark at Scale with Cesar Delgado and DB Tsai
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Bridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFramesBridging the Gap Between Datasets and DataFrames
Bridging the Gap Between Datasets and DataFrames
 
MySQL 8.0 Released Update
MySQL 8.0 Released UpdateMySQL 8.0 Released Update
MySQL 8.0 Released Update
 
JavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient JavaJavaOne 2013: Memory Efficient Java
JavaOne 2013: Memory Efficient Java
 
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1Cisco Connect Toronto 2018   model-driven programmability for cisco ios xr-v1
Cisco Connect Toronto 2018 model-driven programmability for cisco ios xr-v1
 
Making Sense of Schema on Read
Making Sense of Schema on ReadMaking Sense of Schema on Read
Making Sense of Schema on Read
 
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
Creating and Using the Flux SQL Datasource | Katy Farmer | InfluxData
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Using a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge GapUsing a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
Using a Data Model to Bridge the Mainframe-Splunk Knowledge Gap
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai YuAn Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
 
Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2Cisco connect montreal 2018 saalvare md-program-xr-v2
Cisco connect montreal 2018 saalvare md-program-xr-v2
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data Driven
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of IronstreamData Model for Mainframe in Splunk: The Newest Feature of Ironstream
Data Model for Mainframe in Splunk: The Newest Feature of Ironstream
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015Dublin Ireland Spark Meetup October 15, 2015
Dublin Ireland Spark Meetup October 15, 2015
 
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
#dbhouseparty - Spatial Technologies - @Home and Everywhere Else on the Map
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...gajnagarg
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...amitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...gajnagarg
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 

Recently uploaded (20)

Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

Making Nested Columns as First Citizen in Apache Spark SQL

  • 1. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark+AI SF • Making Nested Columns as First Citizens in Apache Spark SQL Cesar Delgado @hpcfarmer DB Tsai @dbtsai
  • 2. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri The world’s most popular intelligent assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod
  • 3. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Open Source Team • We’re Spark, Hadoop, HBase PMCs / Committers / Contributors • We’re the advocate for Open Source • Pushing our internal changes back to the upstreams • Working with the communities to review pull requests, develop new features and bug fixes
  • 4. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Data • Machine learning is used to personalize your experience throughout your day • We believe privacy is a fundamental human right
  • 5. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Siri Scale • Large amounts of requests, Data Centers all over the world • Hadoop / Yarn Cluster has thousands of nodes • HDFS has hundred of PB • 100's TB of raw event data per day • More than 90% of jobs are Spark • Less than 10% are legacy Pig and MapReduce jobs
  • 6. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Details about our data • Deeply nested relational model data with couple top level columns • The total nested fields are more than 2k • Stored in Parquet format partitioned by UTC day • Most queries are only for a small subset of the data
  • 7. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. An Example of Hierarchically Organized Table Real estate information can be naturally modeled by case class Address(houseNumber: Int, streetAddress: String, city: String, state: String, zipCode: String) case class Facts(price: Int, size: Int, yearBuilt: Int) case class School(name: String) case class Home(address: Address, facts: Facts, schools: List[School])
  • 8. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. root |-- address: struct (nullable = true) | |-- houseNumber: integer (nullable = true) | |-- streetAddress: string (nullable = true) | |-- city: string (nullable = true) | |-- state: string (nullable = true) | |-- zipCode: string (nullable = true) |-- facts: struct (nullable = true) | |-- price: integer (nullable = true) | |-- size: integer (nullable = true) | |-- yearBuilt: integer (nullable = true) |-- schools: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) sql("select * from homes”).printSchema() Nested SQL Schema
  • 9. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. sql("select address.city from homes where facts.price > 2000000”) .explain(true) Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#75] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56], DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string, city:string,state:string,zipCode:strin…, facts:struct(address:int…)> • We only need two nested columns, address.city and facts.prices • But entire address and facts are read
  • 10. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. [SPARK-4502], [SPARK-25363] Parquet with Nested Columns • Parquet is a columnar storage format with complex nested data structures in mind • Support very efficient compression and encoding schemes • As a columnar format, each nested column is stored separately as if it's a flattened table • No easy way to cherry pick couple nested columns in Spark • Foundation - Allow reading subset of nested columns right after Parquet FileScan
  • 11. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. sql("select address.city from homes where facts.price > 2000000”) Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Only two nested columns are read! • With [SPARK-4502], [SPARK-25363]
  • 12. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Parquet predicate pushdown are not working for nested fields in Spark sql("select address.city from homes where facts.price > 2000000”) • With [SPARK-4502], [SPARK-25363]
  • 13. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Find cities with houses worth more than 2M == Physical Plan == *(1) Project [address#55.city AS city#77] +- *(1) Filter (isnotnull(facts#56) && (facts#56.price > 2000000)) +- *(1) FileScan parquet [address#55,facts#56] DataFilters: [isnotnull(facts#56), (facts#56.price > 2000000)], Format: Parquet, PushedFilters: [IsNotNull(facts), GreaterThan(facts.price,2000000)], ReadSchema: struct<address:struct<city:string>,facts:struct<price:int>> • Predicate Pushdown in Parquet for nested fields provides significant performance gain by eliminating non-matches earlier to read less data and save the cost of processing them sql("select address.city from homes where facts.price > 2000000”) • With [SPARK-25556]
  • 14. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. val areaUdf = udf{ (city: String, state: String, zipCode: String) => s"$city, $state $zipCode" } val query = sql("select * from homes").repartition(1).select( areaUdf(col("address.city"), col("address.state"), col("address.zipCode")) ).explain() Applying an UDF after repartition == Physical Plan == *(2) Project [UDF(address#58.city, address#58.state, address#58.zipCode) AS UDF(address.city, address.state, address.zipCode)#70] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [address#58] +- *(1) FileScan parquet [address#58] Format: Parquet, ReadSchema: struct<address:struct<houseNumber:int,streetAddress:string, city:string,state:string,zipCode:string>>
  • 15. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Problems in Supporting Nested Structures in Spark • Root level columns are represented by Attribute which is base of leaf named expressions • To get a nested field from a root level column, a GetStructField expression with child of Attribute has to be used • All column pruning logics are done in Attribute levels, resulting either the entire root column is taken or pruned • No easy way to cherry pick couple nested columns in this model
  • 16. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. [SPARK-25603] Generalize Nested Column Pruning • [SPARK-4502], [SPARK-25363] are foundation to support nested structures better with Parquet in Spark • If an operator such as Repartition, Sample, or Limit are applied after Parquet FileScan, nested column pruning will not work • We address this by flattening the nested fields using Alias right after data read
  • 17. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. val query = sql("select * from homes").repartition(1).select( areaUdf(col("address.city"), col("address.state"), col("address.zipCode")) ).explain() Applying an UDF after repartition == Physical Plan == *(2) Project [UDF(_gen_alias_84#84, _gen_alias_85#85, _gen_alias_86#86) AS UDF(address.city, address.state, address.zipCode)#64] +- Exchange RoundRobinPartitioning(1) +- *(1) Project [address#55.city AS _gen_alias_84#84, address#55.state AS _gen_alias_85#85, address#55.zipCode AS _gen_alias_86#86] +- *(1) FileScan parquet [address#55] ReadSchema: struct<address:struct<city:string,state:string,zipCode:string>> • Nested fields are replaced by Alias with flatten structures • Only three used nested fields are read from Parquet files
  • 18. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Production Query - Finding a Needle in a Haystack
  • 19. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.3.1 1.2h 7.1TB
  • 20. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Spark 2.4 with [SPARK-4502], [SPARK-25363], and [SPARK-25556] 3.3min 840GB 7.1TB1.2h
  • 21. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. • 21x faster in wall clock time
 • 8x less data being read
 • More power efficient
  • 22. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Other work • Enhance the Dataset performance by analyzing JVM bytecode and turn closures into Catalyst expressions • Please check our other presentation tomorrow at 11am for more
  • 23. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Conclusions With some work, engineering rigor and some optimizations Spark can run at very large scale in lightning speed
  • 24. • [SPARK-4502] • [SPARK-25363] • [SPARK-25556] • [SPARK-25603]
  • 25. © 2018 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission from Apple. Thank you