Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Type Checking Scala Spark Datasets: Dataset Transforms

2,266 views

Published on

A library to do compile-time type checking of Scala code that uses Spark Datasets while also enabling Spark query optimization.

Published in: Software
  • If u need a hand in making your writing assignments - visit ⇒ www.HelpWriting.net ⇐ for more detailed information.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Type Checking Scala Spark Datasets: Dataset Transforms

  1. 1. Type Checking Scala Spark Datasets: 
 Data Set Transforms John Nestor 47 Degrees www.47deg.com Seattle Spark Meetup September 22, 2016 147deg.com
  2. 2. 47deg.com © Copyright 2016 47 Degrees Outline • Introduction • Transforms • Demos • Implementation • Getting the Code 2
  3. 3. Introduction 3
  4. 4. 47deg.com © Copyright 2016 47 Degrees Spark Scala APIs • RDD (pass closures) • Functional programming model • Types checked at compile time • DataFrame (pass SQL) • SQL programming model (can be optimized) • Types checked at run time • Dataset (pass SQL) • Combines best of RDDs and DataFrames • Some (not all) types checked at compile time 4
  5. 5. 47deg.com © Copyright 2016 47 Degrees Run-Time Scala Checking • Field/column names • Names specified as strings • RT error if no such field • Field/column types • Specified via casting to expected type • RT error if not of expected type 5
  6. 6. 47deg.com © Copyright 2016 47 Degrees Dataset Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but must pass closure and can’t optimize */
 val ds1 = ds.map(abc => CA(abc.b, abc.a * 2 + abc.a))
 
 /* Can be query optimized; but run-time type and field name checking */
 val ds2 = ds.select($"b" as "c", ($"a" * 2 + $"a") as "a").as[CA] 6
  7. 7. Transforms 7
  8. 8. 47deg.com © Copyright 2016 47 Degrees Goal • Add strong typing to Scala Spark Datasets • Check field names at compile time • Check field types at compile time • Each transform maps one of more Datasets to a new Dataset. • Dataset rows are compile-time types: Scala case classes 8
  9. 9. 47deg.com © Copyright 2016 47 Degrees Transform Example case class ABC(a: Int, b: String, c: String)
 case class CA(c: String, a: Int) 
 val abc = ABC(3, "foo", "test")
 val abc1 = ABC(5, "xxx", "alpha")
 val abc3 = ABC(10, "aaa", "aaa")
 val abcs = Seq(abc, abc1, abc3)
 val ds = abcs.toDS() /* Compile time type checking; but can do query optimization */ 
 val smap = SqlMap[ABC, CA] .act(cols => (cols.b, cols.a * 2 + cols.a)) val ds3 = smap(ds) 
 9
  10. 10. 47deg.com © Copyright 2016 47 Degrees Current Transforms • Filter • Map • Sort • Join (combines 2 DataSets) • Aggregate (sum, count, max) 10
  11. 11. Demos 11
  12. 12. 47deg.com © Copyright 2016 47 Degrees Demo • Dataset example • map • select • Transform examples • Map • Sort • Join • Filter • Aggregate 12
  13. 13. Implementation 13
  14. 14. 47deg.com © Copyright 2016 47 Degrees Scala Macros • Scala code executed at compile time • Kinds • Black box - single result type specified • * White box - result type computed 14
  15. 15. 47deg.com © Copyright 2016 47 Degrees Transform Implementation • case class Person(name:String,age:Int)
 val p = Person(“Sam”,30) • Scala macro converts • from: an arbitrary case class type • classOf[p] • to: a meta structure that encodes field names and types • case class PersonM(name:StringCol,age:IntCol)
 val cols = PersonM(name:StringCol(“name”),age:IntCol(“age”)) 15
  16. 16. 47deg.com © Copyright 2016 47 Degrees Column Operations • StrCol(“A”) === StrCol(“B”) => BoolCol(“A === B”) • IntCol(“A”) + IntCol(“B”) => IntCol(“A + B”) • IntCol(“A”).max => IntCol(“A.max”) 16
  17. 17. 47deg.com © Copyright 2016 47 Degrees White Box Macro Restrictions • Works fine in SBT and Eclipse • Not supported in Intellij but can use • Reports type errors • Does not show available completions 17
  18. 18. Getting the Code 18
  19. 19. 47deg.com © Copyright 2016 47 Degrees Transforms Code • https://github.com/nestorpersist/dataset-transform • Code • Documentation • Examples • "com.persist" % "dataset-transforms_2.11" % "0.0.5" 19
  20. 20. Questions 20

×