This document summarizes Alfonso Roa's presentation on using optics to work with complex types in Spark DataFrames. The presentation introduces the problem of manipulating nested structures in DataFrames and demonstrates how optics libraries like Monocle can be used to focus on specific elements. It then shows how Spark optics provides a similar lens-based API for DataFrames, allowing changes to nested elements to be made easily through composition of lenses. The presentation concludes by discussing additional lens functionality for schema changes and future work to improve Spark optics.
5. Agenda
(Live code session)
• The problem working with complex types
• How to solve it in a no Spark world
• How to solve it in a Spark world
• …
• Profits
5
8. Complex types are complex
case class Street(number: Int, name: String)
case class Address(city: String, street: Street)
case class Company(name: String, address: Address)
case class Employee(name: String, company: Company)
8#UnifiedDataAnalytics #SparkAISummit
9. Our example for the talk
val employee =
Employee("john",
Company("awesome inc",
Address("london",
Street(23, "high street")
)))
9#UnifiedDataAnalytics #SparkAISummit
12. Changes in complex structs
val companyNameChanged = df.select(
df("name"),
struct(
concat(df("company.name"),lit("!!!")).as("name"),
df("company.address")
).as("company")
)
12#UnifiedDataAnalytics #SparkAISummit
13. Even more complex structs
df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
13#UnifiedDataAnalytics #SparkAISummit
14. How this is made with case class
employee.copy(name = employee.name+"!!!")
employee.copy(company =
employee.company.copy(name =
employee.company.name+"!!!")
)
14#UnifiedDataAnalytics #SparkAISummit
Employee(
"john!!!",
Company("awesome inc", Address("london", Street(23,
"high street")))
)
Employee(
"john",
Company("awesome inc!!!", Address("london",
Street(23, "high street")))
)
15. Immutability is hard
Very similar...
BUT WE HAVE OPTICS!
Monocle
Scala optics library
https://julien-truffaut.github.io/Monocle/
15#UnifiedDataAnalytics #SparkAISummit
16. Lenses used to focus in a
element
import monocle.Lens
import monocle.macros.GenLens
val employeeName : Lens[Employee, String] = GenLens[Employee](_.name)
16#UnifiedDataAnalytics #SparkAISummit
The context The element to focus on
Macro generator for the
lens
17. Lenses used to focus in a
element
employeeName.get(employee)
17#UnifiedDataAnalytics #SparkAISummit
val f: Employee => Employee =
employeeName.set(“James”)
f(employee)
returns "john"
Employee(
"James",
Company("awesome inc", Address("london",
Street(23, "high street")))
)
val f: Employee => Employee =
employeeName.modify(a => a + “!!!”)
f(employee)
Employee(
"john!!!",
Company("awesome inc", Address("london",
Street(23, "high street")))
)
18. Optics can be merged
import monocle.Lens
import monocle.macros.GenLens
val company : Lens[Employee, Company] = GenLens[Employee](_.company)
val address : Lens[Company , Address] = GenLens[Company](_.address)
val street : Lens[Address , Street] = GenLens[Address](_.street)
val streetName: Lens[Street , String] = GenLens[Street](_.name)
val employeeStreet: Lens[Employee, String] = company composeLens address composeLens street composeLens streetName
18#UnifiedDataAnalytics #SparkAISummit
They are composable
20. How lucky they are
So easy
Wish there was something
like this for spark dataframes…
Spark optics!
https://github.com/hablapps/sparkOptics
20#UnifiedDataAnalytics #SparkAISummit
21. Similar to typed optics
import org.hablapps.sparkOptics.Lens
import org.hablapps.sparkOptics.syntax._
val lens = Lens("name")(df.schema)
21#UnifiedDataAnalytics #SparkAISummit
The contextThe element to focus on
22. Same methods, included modify
val lens = Lens("name")(df.schema)
val column: Column = lens.get(df)
val transformedDF = df.select(lens.modify(c =>
concat(c,lit("!!!"))):_*)
transformedDF.printSchema
transformedDF.as[Employee].head
22#UnifiedDataAnalytics #SparkAISummit
(Column => Column) => Array[Columns]
24. Creating the lenses
But not as easy as the Typed optics to get the
context in inner elements.
import org.apache.spark.sql.types.StructType
val companyL: Lens = Lens("company")(df.schema)
val companySchema = df.schema.fields.find(_.name == "company").get.dataType.asInstanceOf[StructType]
val addressL = Lens("address")(companySchema)
val addressSchema = companySchema.fields.find(_.name == "address").get.dataType.asInstanceOf[StructType]
val streetL = Lens("street")(addressSchema)
val streetSchema = addressSchema.fields.find(_.name == "street").get.dataType.asInstanceOf[StructType]
val streetNameL = Lens("name")(streetSchema)
24#UnifiedDataAnalytics #SparkAISummit
Get the schema of the inner
element
And again and again… 😔
25. Composable
But they are still composable
val employeeCompanyStreetName =
companyL composeLens addressL composeLens streetL composeLens streetNameL
val modifiedDF = df.select(employeeCompanyStreetName.set(lit("new street
name")):_*)
modifiedDF.as[Employee].head
Employee(
"john",
Company("awesome inc", Address("london", Street(23, "new street name")))
)
25#UnifiedDataAnalytics #SparkAISummit
26. Creating easier lenses
Intro the proto lens, a lens without context (yet)
val companyL: Lens = Lens("company")(df.schema)
val addressProtolens: ProtoLens = Lens("address")
val composedLens: Lens = companyL composeProtoLens addressProtolens
val composedLens: ProtoLens = Lens("a") composeProtoLens Lens("b")
26#UnifiedDataAnalytics #SparkAISummit
Checks that the schema of companyL has
the address element, or it will throw an error
No schema in any element?
Still a valid protolens
27. Sugar in composition
Similar syntax to spark sql
val sweetLens = Lens("company.address.street.name")(df.schema)
val sourLens = Lens("company")(df.schema) composeProtoLens
Lens("address") composeProtoLens
Lens("street") composeProtoLens
Lens("name")
27#UnifiedDataAnalytics #SparkAISummit
28. Comparation
val flashLens = Lens("company.address.street.name")(df.schema)
val modifiedDF = df.select(flashLens.modify(upper):_*)
Much better than
val mDF = df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
And lenses function are reusable
28#UnifiedDataAnalytics #SparkAISummit
32. Future Work
New types of optics (traversable)
Make them with spark inner model, not
with the public API (If is worth it).
Compatibility with other APIS
(Frameless)
32
33. Thanks for your interest
Links:
Monocle
https://julien-truffaut.github.io/Monocle/
Spark optics
https://github.com/hablapps/sparkOptics
33#UnifiedDataAnalytics #SparkAISummit