Successfully reported this slideshow.
Your SlideShare is downloading. ×

Working with Complex Types in DataFrames: Optics to the Rescue

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 36 Ad

Working with Complex Types in DataFrames: Optics to the Rescue

Download to read offline

Working with complex types shouldn’t be a complex job. DataFrames provide a great SQL-oriented API for data transformation, but it doesn’t help much when the time comes to update elements of complex types like structs or arrays. In such cases, your program quickly turns into a humongous code of struct words and parenthesis, while trying to make transformations over inner elements, and reconstructing your column. This is exactly the sample problem that we encounter when working with immutable data structures in functional programming, and to solve that problem, optics were invented. Couldn’t we use something similar to optics in the DataFrame realm?

In this talk, we will show how we can enrich the DataFrame API with design patterns that lenses, one of the most common type of optic, put forward to manipulate immutable data structures. We will show how these patterns are implemented through the spark-optics library, an analogue to the Scala Monocle library, and will illustrate its use with several examples. Last but not least, we will take advantage of the dynamic type system of DataFrames to do more than transforming sub-columns, like pruning elements, and renaming them.

Working with complex types shouldn’t be a complex job. DataFrames provide a great SQL-oriented API for data transformation, but it doesn’t help much when the time comes to update elements of complex types like structs or arrays. In such cases, your program quickly turns into a humongous code of struct words and parenthesis, while trying to make transformations over inner elements, and reconstructing your column. This is exactly the sample problem that we encounter when working with immutable data structures in functional programming, and to solve that problem, optics were invented. Couldn’t we use something similar to optics in the DataFrame realm?

In this talk, we will show how we can enrich the DataFrame API with design patterns that lenses, one of the most common type of optic, put forward to manipulate immutable data structures. We will show how these patterns are implemented through the spark-optics library, an analogue to the Scala Monocle library, and will illustrate its use with several examples. Last but not least, we will take advantage of the dynamic type system of DataFrames to do more than transforming sub-columns, like pruning elements, and renaming them.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Working with Complex Types in DataFrames: Optics to the Rescue (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Working with Complex Types in DataFrames: Optics to the Rescue

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Alfonso Roa, Habla Computing Working with Complex Types in DataFrames: Optics to the Rescue #UnifiedDataAnalytics #SparkAISummit
  3. 3. Who am I 3#UnifiedDataAnalytics #SparkAISummit Alfonso Roa ● Scala 👍 ● Spark 👍 ● Functional Programing 👍 ● Open source (what i can) 👍 ● Big data 👍
  4. 4. Where I work 4#UnifiedDataAnalytics #SparkAISummit info@hablapps.com
  5. 5. Agenda (Live code session) • The problem working with complex types • How to solve it in a no Spark world • How to solve it in a Spark world • … • Profits 5
  6. 6. Notebook used Spark optics https://github.com/hablapps/sparkOptics 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. Binder 7
  8. 8. Complex types are complex case class Street(number: Int, name: String) case class Address(city: String, street: Street) case class Company(name: String, address: Address) case class Employee(name: String, company: Company) 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. Our example for the talk val employee = Employee("john", Company("awesome inc", Address("london", Street(23, "high street") ))) 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. How we see it in DF’s import sparkSession.implicits._ val df = List(employee).toDF df.show df.printSchema 10#UnifiedDataAnalytics #SparkAISummit +----+--------------------+ |name| company| +----+--------------------+ |john|[awesome inc, [lo...| +----+--------------------+ root |-- name: string (nullable = true) |-- company: struct (nullable = true) | |-- name: string (nullable = true) | |-- address: struct (nullable = true) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = true) | | | |-- number: integer (nullable = false) | | | |-- name: string (nullable = true)
  11. 11. Changes in DF val employeeNameChanged = df.select( concat(df("name"),lit("!!!")).as("name") , df("company") ) employeeNameChanged.show employeeNameChanged.printSchema 11#UnifiedDataAnalytics #SparkAISummit +-------+--------------------+ | name| company| +-------+--------------------+ |john!!!|[awesome inc, [lo...| +-------+--------------------+ root |-- name: string (nullable = true) |-- company: struct (nullable = true) | ...
  12. 12. Changes in complex structs val companyNameChanged = df.select( df("name"), struct( concat(df("company.name"),lit("!!!")).as("name"), df("company.address") ).as("company") ) 12#UnifiedDataAnalytics #SparkAISummit
  13. 13. Even more complex structs df.select(df("name"),struct( df("company.name").as("name"), struct( df("company.address.city").as("city"), struct( df("company.address.street.number").as("number"), upper(df("company.address.street.name")).as("name") ).as("street") ).as("address") ).as("company")) 13#UnifiedDataAnalytics #SparkAISummit
  14. 14. How this is made with case class employee.copy(name = employee.name+"!!!") employee.copy(company = employee.company.copy(name = employee.company.name+"!!!") ) 14#UnifiedDataAnalytics #SparkAISummit Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) ) Employee( "john", Company("awesome inc!!!", Address("london", Street(23, "high street"))) )
  15. 15. Immutability is hard Very similar... BUT WE HAVE OPTICS! Monocle Scala optics library https://julien-truffaut.github.io/Monocle/ 15#UnifiedDataAnalytics #SparkAISummit
  16. 16. Lenses used to focus in a element import monocle.Lens import monocle.macros.GenLens val employeeName : Lens[Employee, String] = GenLens[Employee](_.name) 16#UnifiedDataAnalytics #SparkAISummit The context The element to focus on Macro generator for the lens
  17. 17. Lenses used to focus in a element employeeName.get(employee) 17#UnifiedDataAnalytics #SparkAISummit val f: Employee => Employee = employeeName.set(“James”) f(employee) returns "john" Employee( "James", Company("awesome inc", Address("london", Street(23, "high street"))) ) val f: Employee => Employee = employeeName.modify(a => a + “!!!”) f(employee) Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) )
  18. 18. Optics can be merged import monocle.Lens import monocle.macros.GenLens val company : Lens[Employee, Company] = GenLens[Employee](_.company) val address : Lens[Company , Address] = GenLens[Company](_.address) val street : Lens[Address , Street] = GenLens[Address](_.street) val streetName: Lens[Street , String] = GenLens[Street](_.name) val employeeStreet: Lens[Employee, String] = company composeLens address composeLens street composeLens streetName 18#UnifiedDataAnalytics #SparkAISummit They are composable
  19. 19. Functionality val streetChanger:Employee => Employee = employeeStreet.modify(_ + "!!!") streetChanger(employee) Employee( "john", Company("awesome inc", Address("london", Street(23, "high street!!!"))) ) 19#UnifiedDataAnalytics #SparkAISummit
  20. 20. How lucky they are So easy Wish there was something like this for spark dataframes… Spark optics! https://github.com/hablapps/sparkOptics 20#UnifiedDataAnalytics #SparkAISummit
  21. 21. Similar to typed optics import org.hablapps.sparkOptics.Lens import org.hablapps.sparkOptics.syntax._ val lens = Lens("name")(df.schema) 21#UnifiedDataAnalytics #SparkAISummit The contextThe element to focus on
  22. 22. Same methods, included modify val lens = Lens("name")(df.schema) val column: Column = lens.get(df) val transformedDF = df.select(lens.modify(c => concat(c,lit("!!!"))):_*) transformedDF.printSchema transformedDF.as[Employee].head 22#UnifiedDataAnalytics #SparkAISummit (Column => Column) => Array[Columns]
  23. 23. Same methods, included modify root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- name: string (nullable = true) Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) ) 23
  24. 24. Creating the lenses But not as easy as the Typed optics to get the context in inner elements. import org.apache.spark.sql.types.StructType val companyL: Lens = Lens("company")(df.schema) val companySchema = df.schema.fields.find(_.name == "company").get.dataType.asInstanceOf[StructType] val addressL = Lens("address")(companySchema) val addressSchema = companySchema.fields.find(_.name == "address").get.dataType.asInstanceOf[StructType] val streetL = Lens("street")(addressSchema) val streetSchema = addressSchema.fields.find(_.name == "street").get.dataType.asInstanceOf[StructType] val streetNameL = Lens("name")(streetSchema) 24#UnifiedDataAnalytics #SparkAISummit Get the schema of the inner element And again and again… 😔
  25. 25. Composable But they are still composable val employeeCompanyStreetName = companyL composeLens addressL composeLens streetL composeLens streetNameL val modifiedDF = df.select(employeeCompanyStreetName.set(lit("new street name")):_*) modifiedDF.as[Employee].head Employee( "john", Company("awesome inc", Address("london", Street(23, "new street name"))) ) 25#UnifiedDataAnalytics #SparkAISummit
  26. 26. Creating easier lenses Intro the proto lens, a lens without context (yet) val companyL: Lens = Lens("company")(df.schema) val addressProtolens: ProtoLens = Lens("address") val composedLens: Lens = companyL composeProtoLens addressProtolens val composedLens: ProtoLens = Lens("a") composeProtoLens Lens("b") 26#UnifiedDataAnalytics #SparkAISummit Checks that the schema of companyL has the address element, or it will throw an error No schema in any element? Still a valid protolens
  27. 27. Sugar in composition Similar syntax to spark sql val sweetLens = Lens("company.address.street.name")(df.schema) val sourLens = Lens("company")(df.schema) composeProtoLens Lens("address") composeProtoLens Lens("street") composeProtoLens Lens("name") 27#UnifiedDataAnalytics #SparkAISummit
  28. 28. Comparation val flashLens = Lens("company.address.street.name")(df.schema) val modifiedDF = df.select(flashLens.modify(upper):_*) Much better than val mDF = df.select(df("name"),struct( df("company.name").as("name"), struct( df("company.address.city").as("city"), struct( df("company.address.street.number").as("number"), upper(df("company.address.street.name")).as("name") ).as("street") ).as("address") ).as("company")) And lenses function are reusable 28#UnifiedDataAnalytics #SparkAISummit
  29. 29. Extra functionality Schema changing functions 29#UnifiedDataAnalytics #SparkAISummit
  30. 30. Prune Deletes elements inside of a struct val flashLens = Lens("company.address.street.name")(df.schema) df.select(flashLens.prune(Vector.empty):_*).printSchema root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- name: string (nullable = true) 30#UnifiedDataAnalytics #SparkAISummit Deleted
  31. 31. Rename Deletes elements inside of a struct val flashLens = Lens("company.address.street.name")(df.schema) df.select(flashLens.rename("newName"):_*).printSchema root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- newName: string (nullable = true) 31#UnifiedDataAnalytics #SparkAISummit
  32. 32. Future Work New types of optics (traversable) Make them with spark inner model, not with the public API (If is worth it). Compatibility with other APIS (Frameless) 32
  33. 33. Thanks for your interest Links: Monocle https://julien-truffaut.github.io/Monocle/ Spark optics https://github.com/hablapps/sparkOptics 33#UnifiedDataAnalytics #SparkAISummit
  34. 34. Social networks Habla computing: www.hablapps.com @hablapps Alfonso Roa https://linkedin.com/in/roaalfonso @saco_pepe 34#UnifiedDataAnalytics #SparkAISummit
  35. 35. ¿QUESTIONS? Thanks for attending 35
  36. 36. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT

×