SlideShare a Scribd company logo
1 of 36
Download to read offline
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Alfonso Roa, Habla Computing
Working with Complex
Types in DataFrames:
Optics to the Rescue
#UnifiedDataAnalytics #SparkAISummit
Who am I
3#UnifiedDataAnalytics #SparkAISummit
Alfonso Roa
● Scala 👍
● Spark 👍
● Functional Programing 👍
● Open source (what i can) 👍
● Big data 👍
Where I work
4#UnifiedDataAnalytics #SparkAISummit
info@hablapps.com
Agenda
(Live code session)
• The problem working with complex types
• How to solve it in a no Spark world
• How to solve it in a Spark world
• …
• Profits
5
Notebook used
Spark optics
https://github.com/hablapps/sparkOptics
6#UnifiedDataAnalytics #SparkAISummit
Binder
7
Complex types are complex
case class Street(number: Int, name: String)
case class Address(city: String, street: Street)
case class Company(name: String, address: Address)
case class Employee(name: String, company: Company)
8#UnifiedDataAnalytics #SparkAISummit
Our example for the talk
val employee =
Employee("john",
Company("awesome inc",
Address("london",
Street(23, "high street")
)))
9#UnifiedDataAnalytics #SparkAISummit
How we see it in DF’s
import sparkSession.implicits._
val df = List(employee).toDF
df.show
df.printSchema
10#UnifiedDataAnalytics #SparkAISummit
+----+--------------------+
|name| company|
+----+--------------------+
|john|[awesome inc, [lo...|
+----+--------------------+
root
|-- name: string (nullable = true)
|-- company: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = true)
| | | |-- number: integer (nullable = false)
| | | |-- name: string (nullable = true)
Changes in DF
val employeeNameChanged = df.select(
concat(df("name"),lit("!!!")).as("name")
,
df("company")
)
employeeNameChanged.show
employeeNameChanged.printSchema
11#UnifiedDataAnalytics #SparkAISummit
+-------+--------------------+
| name| company|
+-------+--------------------+
|john!!!|[awesome inc, [lo...|
+-------+--------------------+
root
|-- name: string (nullable = true)
|-- company: struct (nullable = true)
| ...
Changes in complex structs
val companyNameChanged = df.select(
df("name"),
struct(
concat(df("company.name"),lit("!!!")).as("name"),
df("company.address")
).as("company")
)
12#UnifiedDataAnalytics #SparkAISummit
Even more complex structs
df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
13#UnifiedDataAnalytics #SparkAISummit
How this is made with case class
employee.copy(name = employee.name+"!!!")
employee.copy(company =
employee.company.copy(name =
employee.company.name+"!!!")
)
14#UnifiedDataAnalytics #SparkAISummit
Employee(
"john!!!",
Company("awesome inc", Address("london", Street(23,
"high street")))
)
Employee(
"john",
Company("awesome inc!!!", Address("london",
Street(23, "high street")))
)
Immutability is hard
Very similar...
BUT WE HAVE OPTICS!
Monocle
Scala optics library
https://julien-truffaut.github.io/Monocle/
15#UnifiedDataAnalytics #SparkAISummit
Lenses used to focus in a
element
import monocle.Lens
import monocle.macros.GenLens
val employeeName : Lens[Employee, String] = GenLens[Employee](_.name)
16#UnifiedDataAnalytics #SparkAISummit
The context The element to focus on
Macro generator for the
lens
Lenses used to focus in a
element
employeeName.get(employee)
17#UnifiedDataAnalytics #SparkAISummit
val f: Employee => Employee =
employeeName.set(“James”)
f(employee)
returns "john"
Employee(
"James",
Company("awesome inc", Address("london",
Street(23, "high street")))
)
val f: Employee => Employee =
employeeName.modify(a => a + “!!!”)
f(employee)
Employee(
"john!!!",
Company("awesome inc", Address("london",
Street(23, "high street")))
)
Optics can be merged
import monocle.Lens
import monocle.macros.GenLens
val company : Lens[Employee, Company] = GenLens[Employee](_.company)
val address : Lens[Company , Address] = GenLens[Company](_.address)
val street : Lens[Address , Street] = GenLens[Address](_.street)
val streetName: Lens[Street , String] = GenLens[Street](_.name)
val employeeStreet: Lens[Employee, String] = company composeLens address composeLens street composeLens streetName
18#UnifiedDataAnalytics #SparkAISummit
They are composable
Functionality
val streetChanger:Employee => Employee = employeeStreet.modify(_ + "!!!")
streetChanger(employee)
Employee(
"john",
Company("awesome inc", Address("london", Street(23, "high street!!!")))
)
19#UnifiedDataAnalytics #SparkAISummit
How lucky they are
So easy
Wish there was something
like this for spark dataframes…
Spark optics!
https://github.com/hablapps/sparkOptics
20#UnifiedDataAnalytics #SparkAISummit
Similar to typed optics
import org.hablapps.sparkOptics.Lens
import org.hablapps.sparkOptics.syntax._
val lens = Lens("name")(df.schema)
21#UnifiedDataAnalytics #SparkAISummit
The contextThe element to focus on
Same methods, included modify
val lens = Lens("name")(df.schema)
val column: Column = lens.get(df)
val transformedDF = df.select(lens.modify(c =>
concat(c,lit("!!!"))):_*)
transformedDF.printSchema
transformedDF.as[Employee].head
22#UnifiedDataAnalytics #SparkAISummit
(Column => Column) => Array[Columns]
Same methods, included modify
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
Employee(
"john!!!",
Company("awesome inc", Address("london", Street(23, "high street")))
)
23
Creating the lenses
But not as easy as the Typed optics to get the
context in inner elements.
import org.apache.spark.sql.types.StructType
val companyL: Lens = Lens("company")(df.schema)
val companySchema = df.schema.fields.find(_.name == "company").get.dataType.asInstanceOf[StructType]
val addressL = Lens("address")(companySchema)
val addressSchema = companySchema.fields.find(_.name == "address").get.dataType.asInstanceOf[StructType]
val streetL = Lens("street")(addressSchema)
val streetSchema = addressSchema.fields.find(_.name == "street").get.dataType.asInstanceOf[StructType]
val streetNameL = Lens("name")(streetSchema)
24#UnifiedDataAnalytics #SparkAISummit
Get the schema of the inner
element
And again and again… 😔
Composable
But they are still composable
val employeeCompanyStreetName =
companyL composeLens addressL composeLens streetL composeLens streetNameL
val modifiedDF = df.select(employeeCompanyStreetName.set(lit("new street
name")):_*)
modifiedDF.as[Employee].head
Employee(
"john",
Company("awesome inc", Address("london", Street(23, "new street name")))
)
25#UnifiedDataAnalytics #SparkAISummit
Creating easier lenses
Intro the proto lens, a lens without context (yet)
val companyL: Lens = Lens("company")(df.schema)
val addressProtolens: ProtoLens = Lens("address")
val composedLens: Lens = companyL composeProtoLens addressProtolens
val composedLens: ProtoLens = Lens("a") composeProtoLens Lens("b")
26#UnifiedDataAnalytics #SparkAISummit
Checks that the schema of companyL has
the address element, or it will throw an error
No schema in any element?
Still a valid protolens
Sugar in composition
Similar syntax to spark sql
val sweetLens = Lens("company.address.street.name")(df.schema)
val sourLens = Lens("company")(df.schema) composeProtoLens
Lens("address") composeProtoLens
Lens("street") composeProtoLens
Lens("name")
27#UnifiedDataAnalytics #SparkAISummit
Comparation
val flashLens = Lens("company.address.street.name")(df.schema)
val modifiedDF = df.select(flashLens.modify(upper):_*)
Much better than
val mDF = df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
And lenses function are reusable
28#UnifiedDataAnalytics #SparkAISummit
Extra functionality
Schema changing functions
29#UnifiedDataAnalytics #SparkAISummit
Prune
Deletes elements inside of a struct
val flashLens = Lens("company.address.street.name")(df.schema)
df.select(flashLens.prune(Vector.empty):_*).printSchema
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
30#UnifiedDataAnalytics #SparkAISummit
Deleted
Rename
Deletes elements inside of a struct
val flashLens = Lens("company.address.street.name")(df.schema)
df.select(flashLens.rename("newName"):_*).printSchema
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- newName: string (nullable = true)
31#UnifiedDataAnalytics #SparkAISummit
Future Work
New types of optics (traversable)
Make them with spark inner model, not
with the public API (If is worth it).
Compatibility with other APIS
(Frameless)
32
Thanks for your interest
Links:
Monocle
https://julien-truffaut.github.io/Monocle/
Spark optics
https://github.com/hablapps/sparkOptics
33#UnifiedDataAnalytics #SparkAISummit
Social networks
Habla computing:
www.hablapps.com
@hablapps
Alfonso Roa
https://linkedin.com/in/roaalfonso
@saco_pepe
34#UnifiedDataAnalytics #SparkAISummit
¿QUESTIONS?
Thanks for attending
35
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
1.5 binary search tree
1.5 binary search tree1.5 binary search tree
1.5 binary search treeKrish_ver2
 
Introduction to structured query language (sql)
Introduction to structured query language (sql)Introduction to structured query language (sql)
Introduction to structured query language (sql)Sabana Maharjan
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Serverprogrammings guru
 
SQL Queries
SQL QueriesSQL Queries
SQL QueriesNilt1234
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
 
Retrieving data using the sql select statement
Retrieving data using the sql select statementRetrieving data using the sql select statement
Retrieving data using the sql select statementSyed Zaid Irshad
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Scala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To KnowScala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To KnowLightbend
 

What's hot (20)

Spark sql
Spark sqlSpark sql
Spark sql
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
1.5 binary search tree
1.5 binary search tree1.5 binary search tree
1.5 binary search tree
 
Introduction to structured query language (sql)
Introduction to structured query language (sql)Introduction to structured query language (sql)
Introduction to structured query language (sql)
 
MYSQL join
MYSQL joinMYSQL join
MYSQL join
 
Types Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql ServerTypes Of Join In Sql Server - Join With Example In Sql Server
Types Of Join In Sql Server - Join With Example In Sql Server
 
SQL Queries
SQL QueriesSQL Queries
SQL Queries
 
Indexes in postgres
Indexes in postgresIndexes in postgres
Indexes in postgres
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Trees.pptx
Trees.pptxTrees.pptx
Trees.pptx
 
Mysql
MysqlMysql
Mysql
 
Retrieving data using the sql select statement
Retrieving data using the sql select statementRetrieving data using the sql select statement
Retrieving data using the sql select statement
 
Data Structures - Lecture 10 [Graphs]
Data Structures - Lecture 10 [Graphs]Data Structures - Lecture 10 [Graphs]
Data Structures - Lecture 10 [Graphs]
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
SQL Commands
SQL Commands SQL Commands
SQL Commands
 
Mysql joins
Mysql joinsMysql joins
Mysql joins
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Python and MongoDB
Python and MongoDBPython and MongoDB
Python and MongoDB
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Scala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To KnowScala 3 Is Coming: Martin Odersky Shares What To Know
Scala 3 Is Coming: Martin Odersky Shares What To Know
 

Similar to Working with Complex Types in DataFrames: Optics to the Rescue

Extreme Swift
Extreme SwiftExtreme Swift
Extreme SwiftMovel
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming PatternsHao Chen
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query LanguagesJay Coskey
 
Spark DataFrames for Data Munging
Spark DataFrames for Data MungingSpark DataFrames for Data Munging
Spark DataFrames for Data Munging(Susan) Xinh Huynh
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesCHOOSE
 
Real life-coffeescript
Real life-coffeescriptReal life-coffeescript
Real life-coffeescriptDavid Furber
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks
 
Creating Domain Specific Languages in Python
Creating Domain Specific Languages in PythonCreating Domain Specific Languages in Python
Creating Domain Specific Languages in PythonSiddhi
 
ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersKazuhiro Sera
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
FITC '14 Toronto - Technology, a means to an end
FITC '14 Toronto - Technology, a means to an endFITC '14 Toronto - Technology, a means to an end
FITC '14 Toronto - Technology, a means to an endThibault Imbert
 
Technology: A Means to an End with Thibault Imbert
Technology: A Means to an End with Thibault ImbertTechnology: A Means to an End with Thibault Imbert
Technology: A Means to an End with Thibault ImbertFITC
 

Similar to Working with Complex Types in DataFrames: Optics to the Rescue (20)

Extreme Swift
Extreme SwiftExtreme Swift
Extreme Swift
 
Go Programming Patterns
Go Programming PatternsGo Programming Patterns
Go Programming Patterns
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
Spark DataFrames for Data Munging
Spark DataFrames for Data MungingSpark DataFrames for Data Munging
Spark DataFrames for Data Munging
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
 
DataMapper
DataMapperDataMapper
DataMapper
 
Real life-coffeescript
Real life-coffeescriptReal life-coffeescript
Real life-coffeescript
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
SCALA - Functional domain
SCALA -  Functional domainSCALA -  Functional domain
SCALA - Functional domain
 
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Creating Domain Specific Languages in Python
Creating Domain Specific Languages in PythonCreating Domain Specific Languages in Python
Creating Domain Specific Languages in Python
 
ScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for BeginnersScalikeJDBC Tutorial for Beginners
ScalikeJDBC Tutorial for Beginners
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Scala @ TomTom
Scala @ TomTomScala @ TomTom
Scala @ TomTom
 
FITC '14 Toronto - Technology, a means to an end
FITC '14 Toronto - Technology, a means to an endFITC '14 Toronto - Technology, a means to an end
FITC '14 Toronto - Technology, a means to an end
 
Technology: A Means to an End with Thibault Imbert
Technology: A Means to an End with Thibault ImbertTechnology: A Means to an End with Thibault Imbert
Technology: A Means to an End with Thibault Imbert
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 

Recently uploaded (20)

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 

Working with Complex Types in DataFrames: Optics to the Rescue

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Alfonso Roa, Habla Computing Working with Complex Types in DataFrames: Optics to the Rescue #UnifiedDataAnalytics #SparkAISummit
  • 3. Who am I 3#UnifiedDataAnalytics #SparkAISummit Alfonso Roa ● Scala 👍 ● Spark 👍 ● Functional Programing 👍 ● Open source (what i can) 👍 ● Big data 👍
  • 4. Where I work 4#UnifiedDataAnalytics #SparkAISummit info@hablapps.com
  • 5. Agenda (Live code session) • The problem working with complex types • How to solve it in a no Spark world • How to solve it in a Spark world • … • Profits 5
  • 8. Complex types are complex case class Street(number: Int, name: String) case class Address(city: String, street: Street) case class Company(name: String, address: Address) case class Employee(name: String, company: Company) 8#UnifiedDataAnalytics #SparkAISummit
  • 9. Our example for the talk val employee = Employee("john", Company("awesome inc", Address("london", Street(23, "high street") ))) 9#UnifiedDataAnalytics #SparkAISummit
  • 10. How we see it in DF’s import sparkSession.implicits._ val df = List(employee).toDF df.show df.printSchema 10#UnifiedDataAnalytics #SparkAISummit +----+--------------------+ |name| company| +----+--------------------+ |john|[awesome inc, [lo...| +----+--------------------+ root |-- name: string (nullable = true) |-- company: struct (nullable = true) | |-- name: string (nullable = true) | |-- address: struct (nullable = true) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = true) | | | |-- number: integer (nullable = false) | | | |-- name: string (nullable = true)
  • 11. Changes in DF val employeeNameChanged = df.select( concat(df("name"),lit("!!!")).as("name") , df("company") ) employeeNameChanged.show employeeNameChanged.printSchema 11#UnifiedDataAnalytics #SparkAISummit +-------+--------------------+ | name| company| +-------+--------------------+ |john!!!|[awesome inc, [lo...| +-------+--------------------+ root |-- name: string (nullable = true) |-- company: struct (nullable = true) | ...
  • 12. Changes in complex structs val companyNameChanged = df.select( df("name"), struct( concat(df("company.name"),lit("!!!")).as("name"), df("company.address") ).as("company") ) 12#UnifiedDataAnalytics #SparkAISummit
  • 13. Even more complex structs df.select(df("name"),struct( df("company.name").as("name"), struct( df("company.address.city").as("city"), struct( df("company.address.street.number").as("number"), upper(df("company.address.street.name")).as("name") ).as("street") ).as("address") ).as("company")) 13#UnifiedDataAnalytics #SparkAISummit
  • 14. How this is made with case class employee.copy(name = employee.name+"!!!") employee.copy(company = employee.company.copy(name = employee.company.name+"!!!") ) 14#UnifiedDataAnalytics #SparkAISummit Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) ) Employee( "john", Company("awesome inc!!!", Address("london", Street(23, "high street"))) )
  • 15. Immutability is hard Very similar... BUT WE HAVE OPTICS! Monocle Scala optics library https://julien-truffaut.github.io/Monocle/ 15#UnifiedDataAnalytics #SparkAISummit
  • 16. Lenses used to focus in a element import monocle.Lens import monocle.macros.GenLens val employeeName : Lens[Employee, String] = GenLens[Employee](_.name) 16#UnifiedDataAnalytics #SparkAISummit The context The element to focus on Macro generator for the lens
  • 17. Lenses used to focus in a element employeeName.get(employee) 17#UnifiedDataAnalytics #SparkAISummit val f: Employee => Employee = employeeName.set(“James”) f(employee) returns "john" Employee( "James", Company("awesome inc", Address("london", Street(23, "high street"))) ) val f: Employee => Employee = employeeName.modify(a => a + “!!!”) f(employee) Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) )
  • 18. Optics can be merged import monocle.Lens import monocle.macros.GenLens val company : Lens[Employee, Company] = GenLens[Employee](_.company) val address : Lens[Company , Address] = GenLens[Company](_.address) val street : Lens[Address , Street] = GenLens[Address](_.street) val streetName: Lens[Street , String] = GenLens[Street](_.name) val employeeStreet: Lens[Employee, String] = company composeLens address composeLens street composeLens streetName 18#UnifiedDataAnalytics #SparkAISummit They are composable
  • 19. Functionality val streetChanger:Employee => Employee = employeeStreet.modify(_ + "!!!") streetChanger(employee) Employee( "john", Company("awesome inc", Address("london", Street(23, "high street!!!"))) ) 19#UnifiedDataAnalytics #SparkAISummit
  • 20. How lucky they are So easy Wish there was something like this for spark dataframes… Spark optics! https://github.com/hablapps/sparkOptics 20#UnifiedDataAnalytics #SparkAISummit
  • 21. Similar to typed optics import org.hablapps.sparkOptics.Lens import org.hablapps.sparkOptics.syntax._ val lens = Lens("name")(df.schema) 21#UnifiedDataAnalytics #SparkAISummit The contextThe element to focus on
  • 22. Same methods, included modify val lens = Lens("name")(df.schema) val column: Column = lens.get(df) val transformedDF = df.select(lens.modify(c => concat(c,lit("!!!"))):_*) transformedDF.printSchema transformedDF.as[Employee].head 22#UnifiedDataAnalytics #SparkAISummit (Column => Column) => Array[Columns]
  • 23. Same methods, included modify root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- name: string (nullable = true) Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) ) 23
  • 24. Creating the lenses But not as easy as the Typed optics to get the context in inner elements. import org.apache.spark.sql.types.StructType val companyL: Lens = Lens("company")(df.schema) val companySchema = df.schema.fields.find(_.name == "company").get.dataType.asInstanceOf[StructType] val addressL = Lens("address")(companySchema) val addressSchema = companySchema.fields.find(_.name == "address").get.dataType.asInstanceOf[StructType] val streetL = Lens("street")(addressSchema) val streetSchema = addressSchema.fields.find(_.name == "street").get.dataType.asInstanceOf[StructType] val streetNameL = Lens("name")(streetSchema) 24#UnifiedDataAnalytics #SparkAISummit Get the schema of the inner element And again and again… 😔
  • 25. Composable But they are still composable val employeeCompanyStreetName = companyL composeLens addressL composeLens streetL composeLens streetNameL val modifiedDF = df.select(employeeCompanyStreetName.set(lit("new street name")):_*) modifiedDF.as[Employee].head Employee( "john", Company("awesome inc", Address("london", Street(23, "new street name"))) ) 25#UnifiedDataAnalytics #SparkAISummit
  • 26. Creating easier lenses Intro the proto lens, a lens without context (yet) val companyL: Lens = Lens("company")(df.schema) val addressProtolens: ProtoLens = Lens("address") val composedLens: Lens = companyL composeProtoLens addressProtolens val composedLens: ProtoLens = Lens("a") composeProtoLens Lens("b") 26#UnifiedDataAnalytics #SparkAISummit Checks that the schema of companyL has the address element, or it will throw an error No schema in any element? Still a valid protolens
  • 27. Sugar in composition Similar syntax to spark sql val sweetLens = Lens("company.address.street.name")(df.schema) val sourLens = Lens("company")(df.schema) composeProtoLens Lens("address") composeProtoLens Lens("street") composeProtoLens Lens("name") 27#UnifiedDataAnalytics #SparkAISummit
  • 28. Comparation val flashLens = Lens("company.address.street.name")(df.schema) val modifiedDF = df.select(flashLens.modify(upper):_*) Much better than val mDF = df.select(df("name"),struct( df("company.name").as("name"), struct( df("company.address.city").as("city"), struct( df("company.address.street.number").as("number"), upper(df("company.address.street.name")).as("name") ).as("street") ).as("address") ).as("company")) And lenses function are reusable 28#UnifiedDataAnalytics #SparkAISummit
  • 29. Extra functionality Schema changing functions 29#UnifiedDataAnalytics #SparkAISummit
  • 30. Prune Deletes elements inside of a struct val flashLens = Lens("company.address.street.name")(df.schema) df.select(flashLens.prune(Vector.empty):_*).printSchema root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- name: string (nullable = true) 30#UnifiedDataAnalytics #SparkAISummit Deleted
  • 31. Rename Deletes elements inside of a struct val flashLens = Lens("company.address.street.name")(df.schema) df.select(flashLens.rename("newName"):_*).printSchema root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- newName: string (nullable = true) 31#UnifiedDataAnalytics #SparkAISummit
  • 32. Future Work New types of optics (traversable) Make them with spark inner model, not with the public API (If is worth it). Compatibility with other APIS (Frameless) 32
  • 33. Thanks for your interest Links: Monocle https://julien-truffaut.github.io/Monocle/ Spark optics https://github.com/hablapps/sparkOptics 33#UnifiedDataAnalytics #SparkAISummit
  • 34. Social networks Habla computing: www.hablapps.com @hablapps Alfonso Roa https://linkedin.com/in/roaalfonso @saco_pepe 34#UnifiedDataAnalytics #SparkAISummit
  • 36. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT