More expressive types for spark with frameless

More expressive types for Spark with
Frameless
Miguel Pérez Pasalodos
@Kamugo

Raise your hand if...
● You use Spark in production

● You use Spark with Scala

● You know what the typeclass pattern is

● You know what generic programming or Shapeless is

● You know what generic programming or Shapeless is
● You’ve used Spark with Frameless before

RDDs
trait Person { val name: String }
case class Teacher(id: Int, name: String, salary: Double) extends Person
case class Student(id: Int, name: String) extends Person

RDDs
trait Person { val name: String }
case class Teacher(id: Int, name: String, salary: Double) extends Person
case class Student(id: Int, name: String) extends Person
val people: RDD[Person] = sc.parallelize(List(
Teacher(1, "Emma", 60000),
Student(2, "Steve"),
Student(3, "Arnold")
))

Lambdas are (almost) type-safe
val names = people.map(person => person.name)
val names = people.map {
case Teacher(_, name, _) => s"Teacher $name"
case Student(_, name) => s"Student $name"
}

Lambdas are (almost) type-safe
val names = people.map(person => person.name)
val names = people.map {
case Teacher(_, name, _) => s"Teacher $name"
case Student(_, name) => s"Student $name"
}
Possible MatchError
at runtime

RDDs
● Basically, a lazy distributed immutable collection
● Compile-time type-safe
● Schema-less
● How-to non-optimized transformations
● Limited datasources

Our model from now on
case class Person(id: Int, name: String, age: Short)

DataFrames
val people: DataFrame = List(
Person(1, "Miguel", 26),
Person(2, "Sarah", 28),
Person(2, "John", 32)
).toDF()

Mandatory schema
scala> people.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- age: short (nullable = false)

scala> people.filter($"age" !== 26).filter($"age" !== 27).explain(true)
== Parsed Logical Plan ==
'Filter NOT ('age = 27)
+- Filter NOT (cast(age#133 as int) = 26)
+- LocalRelation [id#131, name#132, age#133
== Optimized Logical Plan ==
Filter (NOT (cast(age#133 as int) = 26) && NOT (cast(age#133 as int) = 27))
+- LocalRelation [id#131, name#132, age#133
Query optimization

They’re not type-safe :(
val names: DataFrame = people.select("namee")

They’re not type-safe :(
AnalysisException: cannot resolve '`namee`'
given input columns: [id, name, age]
Runtime

DataFrames
● Mandatory schema
● Optimized what-to specification
● Compatible with SQL
● Not type-safe
● Extensible DataSource API

Datasets
val people: Dataset[Person] = List(
Person(1, "Miguel", 26),
Person(2, "Sarah", 28),
Person(2, "John", 32)
).toDS()

Datasets
● Try to get the best of both worlds
● We can use lambdas as in RDDs!
○ What about performance?
● Full DataFrame API as DataFrame = Dataset[Row]
● They seem type-safe

We can use the DataFrame API

Still not type-safe :(
AnalysisException: cannot resolve '`namee`'
given input columns: [id, name, age]
Runtime

But… we can cast them!
val names: Dataset[Int] = people.select("name").as[Int]

But… we can cast them! ...and fail :(
AnalysisException: Cannot up cast `name` from
string to int as it may truncate
Runtime
val names: Dataset[Int] = people.select("name").as[Int]

Lambdas...
val names: Dataset[String] = people.map(_.namee)

Lambdas… are type-safe!
Error: value namee is not a member of PersonCompile
val names: Dataset[String] = people.map(_.namee)

What about performance?
● 2²⁵ random generated people
● 20 parquet files
● 4 cores
people.filter(_.age == 26).count() VS people.filter($"age" === 26).count()

What about performance?
filter(_.age == 26) filter($"age" === 26)

Encoders?
class Car(name: String)
spark.createDataset(List(
new Car("Tesla Model S")
))

Encoders?
Unable to find encoder for type stored in DatasetCompile
spark.createDataset(List(
new Car("Tesla Model S")
))

Encoders?
case class PersonCar(personId: Int, car: Car)
val cars: Dataset[PersonCar] = spark.createDataset(List(
PersonCar(1, new Car("Tesla Model S"))
))

Encoders?
UnsupportedOperationException: No Encoder found for
Car
- field (class: "Car", name: "car")
- root class: "PersonCar"
case class PersonCar(personId: Int, car: Car)
val cars: Dataset[PersonCar] = spark.createDataset(List(
PersonCar(1, new Car("Tesla Model S"))
))
Runtime

Frameless
● Wraps the Spark API
● Type-safe non-lambda methods
● No run-time performance differences
● Provides a way to define custom encoders
● Actions are also lazy

Typed Datasets
val peopleFL: TypedDataset[Person] = people.typed
val names: TypedDataset[String] = peopleFL.select(peopleFL('namee))

Typed Datasets
No column Symbol with
shapeless.tag.Tagged[String("namee")] of type A in
Person
Compile
val peopleFL: TypedDataset[Person] = people.typed
val names: TypedDataset[String] = peopleFL.select(peopleFL('namee))

Column operations are also supported
scala> val agesDivided = peopleFL.select(peopleFL('age)/2)
agesDivided: TypedDataset[Double]

Column operations are also supported
scala> val agesDivided = peopleFL.select(peopleFL('age)/2)
agesDivided: TypedDataset[Double]
val intToString = (x: Int) => x.toString
val udf = peopleFL.makeUDF(intToString)
scala> val result = peopleFL.select(udf(peopleFL('age)))
result: TypedDataset[String]

Aggregations
case class AvgAge(name: String, age: Double)
val ageByName: TypedDataset[AvgAge] = {
peopleFL.groupBy(peopleFL('name)).agg(avg(peopleFL('age)))
}.as[AvgAge]

Custom type encoders: Injection
sealed trait Gender
case object Female extends Gender
case object Male extends Gender
case object Other extends Gender
case class PersonGender(id: Int, gender: Gender)
TypedDataset.create(peopleGender)

Custom encoders: Injection
sealed trait Gender
case object Female extends Gender
case object Male extends Gender
case object Other extends Gender
case class PersonGender(id: Int, gender: Gender)
TypedDataset.create(peopleGender)
Compile Cannot find implicit value for value encoder

Custom encoders: Injection
implicit val genderToInt: Injection[Gender, Int] = Injection(
{
case Female => 1; case Male => 2; case Other => 3
},{
case 1 => Female; case 2 => Male; case 3 => Other
}
)
scala> TypedDataset.create(peopleGender)
res0: TypedDataset[PersonGender] = [id: int, gender: int]

Lazy actions
val numPeopleJob: Job[Long] = people.count().withDescription("...")
val num: Long = numPeopleJob.run()

Lazy actions
val numPeopleJob: Job[Long] = people.count().withDescription("...")
val num: Long = numPeopleJob.run()
val sampleJob = for {
num <- people.count()
sample <- people.take((num/10).toInt)
} yield sample

Encoders are typeclasses
val peopleList = List(Person(1, "Miguel", 26))
val people = spark.createDataset(peopleList)
def createDataset[T : Encoder](data: Seq[T]): Dataset[T]

val peopleList = List(Person(1, "Miguel", 26))
val people = spark.createDataset(peopleList)
def createDataset[T : Encoder](data: Seq[T]): Dataset[T]
// It’s the same as
def createDataset[T](data: Seq[T])(implicit encoder: Encoder[T])

● Instances provided by SQLImplicits class
● That’s why we need import spark.implicits._ everywhere!
implicit def newSequenceEncoder[T <: Seq[_] : TypeTag]: Encoder[T] =
ExpressionEncoder() // <- Reflection at runtime!

Reflection is not our friend
val cars = Seq(Car("Tesla"))
val ds: Dataset[Car] = spark.createDataset(cars)
Compile Unable to find encoder for type stored in a Dataset.

Reflection is not our friend
val cars = Seq(Car("Tesla"))
val ds: Dataset[Car] = spark.createDataset(cars)
val ds: Dataset[Seq[Cars]] = spark.createDataset(Seq(cars))
Runtime
Compile
No encoder found for Car
Unable to find encoder for type stored in a Dataset.

How different are the Frameless encoders?
def create[A](data: Seq[A])(
implicit
encoder: TypedEncoder[A],
sqlContext: SQLContext
): TypedDataset[A]

Recursive implicit resolution!
implicit def mapEncoder[A: NotCatalystNullable, B](
implicit
encodeA: TypedEncoder[A],
encodeB: TypedEncoder[B]
): TypedEncoder[Map[A, B]]

How to know if our class has a column?
// We were calling people(‘name)
def TypedDataset[T] {
def apply[A](column: Witness.Lt[Symbol])(
implicit
exists: TypedColumn.Exists[T, column.T, A],
encoder: TypedEncoder[A]
): TypedColumn[T, A]
}

How to know if our class has a column?
object TypedColumn.Exists[T, K, V] {
implicit def deriveRecord[T, H <: HList, K, V](
implicit
lgen: LabelledGeneric.Aux[T, H],
selector: Selector.Aux[H, K, V]
): Exists[T, K, V] = new Exists[T, K, V] {}
}

Concepts we need to understand first
● Generic programming and HList
● Literal types
● Phantom types
● Type tagging
● Dependent types

Generic programming!
HList = HNil | ::[A, H <: HList]

Generic programming!
val genericMe = 1 :: "Miguel" :: (26: Short) :: HNil
scala> :type genericMe
::[Int, ::[String, ::[Short, HNil]]]
HList = HNil | ::[A, H <: HList]

Shapeless Generic typeclass
val genericPerson = Generic[Person]
val genericMe = 1 :: "Miguel" :: (26: Short) :: HNil
scala> val me = genericPerson.from(genericMe)
me: Person = Person(1,Miguel,26)
scala> val genericMeAgain = genericPerson.to(me)
gemericMeAgain: genericPerson.Repr = 1 :: Miguel :: 26 :: HNil

Literal types
● A type for each value!
● Gives the compiler power to know about values
var three = 3.narrow
three: Int(3) = 3

Literal types
scala> three+three
res8: Int = 6
scala> three = 4
<console>:38: error: type mismatch;
found : Int(4)
required: Int(3)

trait Increasable
def inc(x: Int with Increasable) = x+1
inc(3.asInstanceOf[Int with Increasable]): Int = 4
inc(3)
error: type mismatch; found: Int(3); required: Int with Increasable
Phantom types and type tagging
● Phantom type: no runtime behaviour
● Type tagging: assign a phantom type to other types

All combined with Shapeless!
"name" ->> 1
res1: Int with KeyTag[String("name"),Int] = 1

All combined with Shapeless!
"name" ->> 1
res1: Int with KeyTag[String("name"),Int] = 1
val me = ("id" ->> 1) :: ("name" ->> "Miguel") :: ("age" ->> 26) :: HNil
::Int with KeyTag[String("id"),Int],
::String with KeyTag[String("name"),String],
::Short with KeyTag[String("age"),Short],
::HNil

LabelledGeneric
val genericPerson = LabelledGeneric[Person]
::Int with KeyTag[Symbol with Tagged[String("id")],Int],
::String with KeyTag[Symbol with Tagged[String("name")],String],
::Short with KeyTag[Symbol with Tagged[String("age")],Short],
HNil

Dependent types
trait Generic[A] {
type Repr
def to(value: A): Repr
}
def getRepr[A](v: A)(gen: Generic[A]): gen.Repr = gen.to(v)
// Is it not the same as this?
def getRepr[A, R](v: A)(gen: Generic2[A, R]): R = ???

Shapeless Witness
trait Witness {
type T
val value: T
}
def getField[A,K,V](value: A with KeyTag[K,V])
(implicit witness: Witness.Aux[K]) = witness.value
// Aux[K] = Witness { type T = K }
>scala getField("name" ->> 1)
res0: String("name") = name

Shapeless Witness
Witness.Aux[A] = Witness { type T = A }
>scala val witness = Witness(‘name)
witness: Witness.Aux[Symbol with Tagged[String("name")]
Witness.Lt[A] = Witness { type T <: A }
// Tagged Symbol is a subtype of Symbol. So previous line is also...
witness: Witness.Lt[Symbol]

Back to Frameless
// We were calling people(‘name)
def TypedDataset[T] {
def apply[A](column: Witness.Lt[Symbol])(
implicit
exists: TypedColumn.Exists[T, column.T, A],
encoder: TypedEncoder[A]
): TypedColumn[T, A]
}

Back to Frameless
object TypedColumn.Exists[T, K, V] {
implicit def deriveRecord[T, H <: HList, K, V](
implicit
lgen: LabelledGeneric.Aux[T, H],
selector: Selector.Aux[H, K, V]
): Exists[T, K, V] = new Exists[T, K, V] {}
}

To use it or not to use it
Type-safe with the same performance
Injections for custom types
Lazy jobs with descriptions
Slower compilation
Not yet stable. No official Spark backward compatibility

More expressive types for spark with frameless

More Related Content

What's hot

Similar to More expressive types for spark with frameless

Recently uploaded

More expressive types for spark with frameless