SlideShare a Scribd company logo
1 of 31
Download to read offline
Higher Order Functions
Herman van Hövell @westerflyer
2018-10-03, London Spark Summit EU 2018
2
About Me
- Software Engineer @Databricks
Amsterdam office
- Apache Spark Committer and
PMC member
- In a previous life Data Engineer &
Data Analyst.
3
Complex Data
Complex data types in Spark SQL
- Struct. For example: struct(a: Int, b: String)
- Array. For example: array(a: Int)
- Map. For example: map(key: String, value: Int)
This provides primitives to build tree-based data models
- High expressiveness. Often alleviates the need for ‘flat-earth’ multi-table
designs.
- More natural, reality like data models
4
Complex Data - Tweet JSON
{
"created_at": "Wed Oct 03 11:41:57 +0000 2018",
"id_str": "994633657141813248",
"text": "Looky nested data #spark #sseu",
"display_text_range": [0, 140],
"user": {
"id_str": "12343453",
"screen_name": "Westerflyer"
},
"extended_tweet": {
"full_text": "Looky nested data #spark #sseu",
"display_text_range": [0, 249],
"entities": {
"hashtags": [{
"text": "spark",
"indices": [211, 225]
}, {
"text": "sseu",
"indices": [239, 249]
}]
}
}
}
adapted from: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html
root
|-- created_at: string (nullable = true)
|-- id_str: string (nullable = true)
|-- text: string (nullable = true)
|-- user: struct (nullable = true)
| |-- id_str: string (nullable = true)
| |-- screen_name: string (nullable = true)
|-- display_text_range: array (nullable = true)
| |-- element: long (containsNull = true)
|-- extended_tweet: struct (nullable = true)
| |-- full_text: string (nullable = true)
| |-- display_text_range: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- entities: struct (nullable = true)
| | |-- hashtags: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- indices: array (nullable = true)
| | | | | |-- element: long (containsNull = true)
| | | | |-- text: string (nullable = true)
5
Manipulating Complex Data
Structs are easy :)
Maps/Arrays not so much...
- Easy to read single values/retrieve keys
- Hard to transform or to summarize
6
Transforming an Array
Let’s say we want to add 1 to every element of the vals field of
every row in an input table.
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
How would we do this?
7
Transforming an Array
Option 1 - Explode and Collect
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
8
Transforming an Array
Option 1 - Explode and Collect - Explode
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
1. Explode
9
Transforming an Array
Option 1 - Explode and Collect - Explode
Id Vals
1 [1, 2, 3]
2 [4, 5, 6]
Id Val
1 1
1 2
1 3
2 4
2 5
2 6
10
Transforming an Array
Option 1 - Explode and Collect - Collect
select id,
collect_list(val + 1) as vals
from (select id,
explode(vals) as val
from input_tbl) x
group by id
1. Explode
2. Collect
11
Transforming an Array
Option 1 - Explode and Collect - Collect
Id Val
1 1 + 1
1 2 + 1
1 3 + 1
2 4 + 1
2 5 + 1
2 6 + 1
Id Vals
1 [2, 3, 4]
2 [5, 6, 7]
12
Transforming an Array
Option 1 - Explode and Collect - Complexity
== Physical Plan ==
ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Exchange hashpartitioning(id, 200)
+- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Generate explode(vals), [id], false, [val]
+- FileScan parquet default.input_tbl
13
Transforming an Array
Option 1 - Explode and Collect - Complexity
• Shuffles the data around, which is very expensive
• collect_list does not respect pre-existing ordering
== Physical Plan ==
ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Exchange hashpartitioning(id, 200)
+- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)])
+- Generate explode(vals), [id], false, [val]
+- FileScan parquet default.input_tbl
14
Transforming an Array
Option 1 - Explode and Collect - Pitfalls
Id Vals
1 [1, 2, 3]
1 [4, 5, 6]
Id Vals
1 [5, 6, 7, 2, 3, 4]
Id Vals
1 null
2 [4, 5, 6]
Id Vals
2 [5, 6, 7]
Values need to have data
Keys need to be unique
15
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
16
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
17
Transforming an Array
Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
18
Transforming an Array
Pros
- Is faster than Explode & Collect
- Does not suffer from correctness pitfalls
Cons
- Is relatively slow, we need to do a lot serialization
- You need to register UDFs per type
- Does not work for SQL
- Clunky
Option 2 - Scala UDF
19
When are you going to talk
about
Higher Order Functions?
20
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
21
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order Function
22
Higher Order Functions
Let’s take another look at Option 2 - Scala UDF
Can we do the same for Spark SQL?
def addOne(values: Seq[Int]): Seq[Int] = {
values.map(value => value + 1)
}
val plusOneInt = spark.udf.register("plusOneInt",
addOne(_:Seq[Int]):Seq[Int])
val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
Higher Order Function
Anonymous ‘Lambda’ Function
23
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl
- Spark SQL native code: fast & no serialization needed
- Works for SQL
24
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl
Higher Order Function
transform is the Higher Order Function. It takes an input array
and an expression, it applies this expression to each element in
the array
25
Higher Order Functions in Spark SQL
select id, transform(vals, val -> val + 1) as vals
from input_tbl Anonymous ‘Lambda’ Function
val -> val + 1 is the lambda function. It is the operation
that is applied to each value in the array. This function is divided
into two components separated by a -> symbol:
1. The Argument list.
2. The expression used to calculate the new value.
26
Higher Order Functions in Spark SQL
Nesting
select id,
transform(vals, val ->
transform(val, e -> e + 1)) as vals
from nested_input_tbl
Capture
select id,
ref_value,
transform(vals, val -> ref_value + val) as vals
from nested_input_tbl
27
Didn’t you say these were
faster?
28
Performance
29
Higher Order Functions in Spark SQL
Spark 2.4 will ship with following higher order functions:
Array
- transform
- filter
- exists
- aggregate/reduce
- zip_with
Map
- transform_keys
- transform_values
- map_filter
- map_zip_with
A lot of new collection based expression were also added...
30
Future work
Disclaimer: All of this is speculative and has not been discussed on the Dev list!
Arrays and Maps have received a lot of love. However working
with wide structs fields is still non-trivial (a lot of typing). We can
do better here:
- The following dataset functions should work for nested fields:
- withColumn()
- withColumnRenamed()
- The following functions should be added for struct fields:
- select()
- withColumn()
- withColumnRenamed()
Questions?

More Related Content

What's hot

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 

What's hot (20)

Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Improving Spark SQL at LinkedIn
Improving Spark SQL at LinkedInImproving Spark SQL at LinkedIn
Improving Spark SQL at LinkedIn
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 

Similar to An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell

Scala introduction
Scala introductionScala introduction
Scala introductionvito jeng
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghStuart Roebuck
 
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperSF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperChester Chen
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scalashinolajla
 
(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?Tomasz Wrobel
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With ScalaKnoldus Inc.
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Languageleague
 
Scala for Java Programmers
Scala for Java ProgrammersScala for Java Programmers
Scala for Java ProgrammersEric Pederson
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with ScalaNeelkanth Sachdeva
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data opsmnacos
 
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkAngelo Leto
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraJim Hatcher
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...DataStax
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like SparkAlpine Data
 
Extractors & Implicit conversions
Extractors & Implicit conversionsExtractors & Implicit conversions
Extractors & Implicit conversionsKnoldus Inc.
 
Kotlin boost yourproductivity
Kotlin boost yourproductivityKotlin boost yourproductivity
Kotlin boost yourproductivitynklmish
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02Nguyen Tuan
 

Similar to An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
Scala introduction
Scala introductionScala introduction
Scala introduction
 
Scala @ TechMeetup Edinburgh
Scala @ TechMeetup EdinburghScala @ TechMeetup Edinburgh
Scala @ TechMeetup Edinburgh
 
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapperSF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
SF Scala meet up, lighting talk: SPA -- Scala JDBC wrapper
 
Taxonomy of Scala
Taxonomy of ScalaTaxonomy of Scala
Taxonomy of Scala
 
(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?(How) can we benefit from adopting scala?
(How) can we benefit from adopting scala?
 
Functional Programming With Scala
Functional Programming With ScalaFunctional Programming With Scala
Functional Programming With Scala
 
The Scala Programming Language
The Scala Programming LanguageThe Scala Programming Language
The Scala Programming Language
 
Scala for Java Programmers
Scala for Java ProgrammersScala for Java Programmers
Scala for Java Programmers
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with Scala
 
Scala for curious
Scala for curiousScala for curious
Scala for curious
 
Erlang for data ops
Erlang for data opsErlang for data ops
Erlang for data ops
 
Introduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with sparkIntroduction to parallel and distributed computation with spark
Introduction to parallel and distributed computation with spark
 
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into CassandraUsing Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra
 
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
Extractors & Implicit conversions
Extractors & Implicit conversionsExtractors & Implicit conversions
Extractors & Implicit conversions
 
Kotlin boost yourproductivity
Kotlin boost yourproductivityKotlin boost yourproductivity
Kotlin boost yourproductivity
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell

  • 1. Higher Order Functions Herman van Hövell @westerflyer 2018-10-03, London Spark Summit EU 2018
  • 2. 2 About Me - Software Engineer @Databricks Amsterdam office - Apache Spark Committer and PMC member - In a previous life Data Engineer & Data Analyst.
  • 3. 3 Complex Data Complex data types in Spark SQL - Struct. For example: struct(a: Int, b: String) - Array. For example: array(a: Int) - Map. For example: map(key: String, value: Int) This provides primitives to build tree-based data models - High expressiveness. Often alleviates the need for ‘flat-earth’ multi-table designs. - More natural, reality like data models
  • 4. 4 Complex Data - Tweet JSON { "created_at": "Wed Oct 03 11:41:57 +0000 2018", "id_str": "994633657141813248", "text": "Looky nested data #spark #sseu", "display_text_range": [0, 140], "user": { "id_str": "12343453", "screen_name": "Westerflyer" }, "extended_tweet": { "full_text": "Looky nested data #spark #sseu", "display_text_range": [0, 249], "entities": { "hashtags": [{ "text": "spark", "indices": [211, 225] }, { "text": "sseu", "indices": [239, 249] }] } } } adapted from: https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json.html root |-- created_at: string (nullable = true) |-- id_str: string (nullable = true) |-- text: string (nullable = true) |-- user: struct (nullable = true) | |-- id_str: string (nullable = true) | |-- screen_name: string (nullable = true) |-- display_text_range: array (nullable = true) | |-- element: long (containsNull = true) |-- extended_tweet: struct (nullable = true) | |-- full_text: string (nullable = true) | |-- display_text_range: array (nullable = true) | | |-- element: long (containsNull = true) | |-- entities: struct (nullable = true) | | |-- hashtags: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- indices: array (nullable = true) | | | | | |-- element: long (containsNull = true) | | | | |-- text: string (nullable = true)
  • 5. 5 Manipulating Complex Data Structs are easy :) Maps/Arrays not so much... - Easy to read single values/retrieve keys - Hard to transform or to summarize
  • 6. 6 Transforming an Array Let’s say we want to add 1 to every element of the vals field of every row in an input table. Id Vals 1 [1, 2, 3] 2 [4, 5, 6] Id Vals 1 [2, 3, 4] 2 [5, 6, 7] How would we do this?
  • 7. 7 Transforming an Array Option 1 - Explode and Collect select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id
  • 8. 8 Transforming an Array Option 1 - Explode and Collect - Explode select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id 1. Explode
  • 9. 9 Transforming an Array Option 1 - Explode and Collect - Explode Id Vals 1 [1, 2, 3] 2 [4, 5, 6] Id Val 1 1 1 2 1 3 2 4 2 5 2 6
  • 10. 10 Transforming an Array Option 1 - Explode and Collect - Collect select id, collect_list(val + 1) as vals from (select id, explode(vals) as val from input_tbl) x group by id 1. Explode 2. Collect
  • 11. 11 Transforming an Array Option 1 - Explode and Collect - Collect Id Val 1 1 + 1 1 2 + 1 1 3 + 1 2 4 + 1 2 5 + 1 2 6 + 1 Id Vals 1 [2, 3, 4] 2 [5, 6, 7]
  • 12. 12 Transforming an Array Option 1 - Explode and Collect - Complexity == Physical Plan == ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
  • 13. 13 Transforming an Array Option 1 - Explode and Collect - Complexity • Shuffles the data around, which is very expensive • collect_list does not respect pre-existing ordering == Physical Plan == ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Exchange hashpartitioning(id, 200) +- ObjectHashAggregate(keys=[id], functions=[collect_list(val + 1)]) +- Generate explode(vals), [id], false, [val] +- FileScan parquet default.input_tbl
  • 14. 14 Transforming an Array Option 1 - Explode and Collect - Pitfalls Id Vals 1 [1, 2, 3] 1 [4, 5, 6] Id Vals 1 [5, 6, 7, 2, 3, 4] Id Vals 1 null 2 [4, 5, 6] Id Vals 2 [5, 6, 7] Values need to have data Keys need to be unique
  • 15. 15 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) }
  • 16. 16 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int])
  • 17. 17 Transforming an Array Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
  • 18. 18 Transforming an Array Pros - Is faster than Explode & Collect - Does not suffer from correctness pitfalls Cons - Is relatively slow, we need to do a lot serialization - You need to register UDFs per type - Does not work for SQL - Clunky Option 2 - Scala UDF
  • 19. 19 When are you going to talk about Higher Order Functions?
  • 20. 20 Higher Order Functions Let’s take another look at Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals"))
  • 21. 21 Higher Order Functions Let’s take another look at Option 2 - Scala UDF def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals")) Higher Order Function
  • 22. 22 Higher Order Functions Let’s take another look at Option 2 - Scala UDF Can we do the same for Spark SQL? def addOne(values: Seq[Int]): Seq[Int] = { values.map(value => value + 1) } val plusOneInt = spark.udf.register("plusOneInt", addOne(_:Seq[Int]):Seq[Int]) val newDf = spark.table("input_tbl").select($"id", plusOneInt($"vals")) Higher Order Function Anonymous ‘Lambda’ Function
  • 23. 23 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl - Spark SQL native code: fast & no serialization needed - Works for SQL
  • 24. 24 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl Higher Order Function transform is the Higher Order Function. It takes an input array and an expression, it applies this expression to each element in the array
  • 25. 25 Higher Order Functions in Spark SQL select id, transform(vals, val -> val + 1) as vals from input_tbl Anonymous ‘Lambda’ Function val -> val + 1 is the lambda function. It is the operation that is applied to each value in the array. This function is divided into two components separated by a -> symbol: 1. The Argument list. 2. The expression used to calculate the new value.
  • 26. 26 Higher Order Functions in Spark SQL Nesting select id, transform(vals, val -> transform(val, e -> e + 1)) as vals from nested_input_tbl Capture select id, ref_value, transform(vals, val -> ref_value + val) as vals from nested_input_tbl
  • 27. 27 Didn’t you say these were faster?
  • 29. 29 Higher Order Functions in Spark SQL Spark 2.4 will ship with following higher order functions: Array - transform - filter - exists - aggregate/reduce - zip_with Map - transform_keys - transform_values - map_filter - map_zip_with A lot of new collection based expression were also added...
  • 30. 30 Future work Disclaimer: All of this is speculative and has not been discussed on the Dev list! Arrays and Maps have received a lot of love. However working with wide structs fields is still non-trivial (a lot of typing). We can do better here: - The following dataset functions should work for nested fields: - withColumn() - withColumnRenamed() - The following functions should be added for struct fields: - select() - withColumn() - withColumnRenamed()