Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark schema for free with David Szakallas

1,673 views

Published on

DataFrames are essential for high-performance code, but sadly lag behind in development experience in Scala. When we started migrating our existing Spark application from RDDs to DataFrames at Whitepages, we had to scratch our heads real hard to come up with a good solution. DataFrames come at a loss of compile-time type safety and there is limited support for encoding JVM types.

We wanted more descriptive types without the overhead of Dataset operations. The data binding API should be extendable. Schema for input files should be generated from classes when we don’t want inference. UDFs should be more type-safe. Spark does not provide these natively, but with the help of shapeless and type-level programming we found a solution to nearly all of our wishes. We migrated the RDD code without any of the following: changing our domain entities, writing schema description or breaking binary compatibility with our existing formats. Instead we derived schema, data binding and UDFs, and tried to sacrifice the least amount of type safety while still enjoying the performance of DataFrames.

Published in: Data & Analytics
  • Login to see the comments

Spark schema for free with David Szakallas

  1. 1. Dávid Szakállas, Whitepages @szdavid92 Spark Schema for Free #schema4free
  2. 2. 2 Result Subgraph IGraph Retrieval Service Search Service API
  3. 3. 3 4B+ entities 6B+ links Whitepages Identity Graph™
  4. 4. Our story in a nutshell 4 PROVIDER BATCHES IDENTITY GRAPH > 30 000 SLOC RDDs RICH DOMAIN TYPES 3rd PARTY LIBRARIES FUNCTIONAL STYLE
  5. 5. 5 distributed code bugs platform shortcomings increasing resource usage unreliability incorrect results bugs in domain logic infrastructure problems wasteful recomputation non-deterministic operations network heavy operations memory heavy encoding CPU heavy object serialization
  6. 6. 6 distributed code bugs platform shortcomings bugs in domain logic infrastructure problems wasteful recomputation non-deterministic operations network heavy operations memory heavy encoding CPU heavy object serialization Switching to Spark SQL 20% speedup just by s/RDD/Dataset/g
  7. 7. Additional requirements • Keep compatibility with existing output schema • Use Scala • Retain compile-time type safety • Reuse existing domain types – ~ 30 core types as case classes • Leverage the query optimizer • Minimize memory footprint – spare object allocations where possible • Reduce boilerplate 7
  8. 8. 8 Two and a half APIs Dataset[T] DataFrame ( Dataset[Row] ) SQL ! only logic errors mistyped column name syntax error " # $ # #
  9. 9. : Dataset[T] S: Serialization cost D: Deserialization cost O: Optimization barrier U: Untyped column referencing C: Checked cast 9 filter(T => Boolean) DO map(T => U) SDO mapPartitions(T => U) SDO flatMap SDO groupByKey SDO reduce SDO joinWith U dropDuplicates(c: Column) U orderBy, ... U as[T] C
  10. 10. 1. operations with performance problems 2. operations with type safety problems 3. creating a dataset of an arbitrary type 10
  11. 11. Encoders • JVM ó Tungsten • out of the box we get: 11 integral floating datetime string product option seq
  12. 12. 12 spark.emptyDataset[java.util.UUID] error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. spark.emptyDataset[java.util.UUID] COMPILE TIME ERROR
  13. 13. 13 val jvmRepr = ObjectType(classOf[UUID]) val serializer = CreateNamedStruct(Seq( Literal("msb"), Invoke(BoundReference(0, jvmRepr, false), "getMostSignificantBits", LongType), Literal("lsb"), Invoke(BoundReference(0, jvmRepr, false), "getLeastSignificantBits", LongType) )).flatten val deserializer = NewInstance(classOf[UUID], Seq(GetColumnByOrdinal(0, LongType), GetColumnByOrdinal(1, LongType)), ObjectType(classOf[UUID])) implicit val uuidEncoder = new ExpressionEncoder[UUID]( schema = StructType(Array( StructField("msb", LongType, nullable = false), StructField("lsb", LongType, nullable = false) )), flat = false, serializer = serializer, deserializer = deserializer, clsTag = classTag[UUID])
  14. 14. 14 spark.emptyDataset[UUID] ! spark.emptyDataset[(UUID, UUID)] Message: No Encoder found for java.util.UUID - field (class: "java.util.UUID", name: "_1") - root class: "scala.Tuple2" StackTrace: - field (class: "java.util.UUID", name: "_1") - root class: "scala.Tuple2" at org.apache.spark.sql.catalyst.ScalaReflection (...) RUNTIME EXCEPTION! "
  15. 15. 15 integral floating datetime string product option seq UUID /** * An encoder for Scala's product type (tuples, case classes, etc). * @since 2.0.0 */ def product[T <: Product : TypeTag]: Encoder[T] = ExpressionEncoder()
  16. 16. 16 integral floating datetime string product option seq UUID /** * An encoder for Scala's product type (tuples, case classes, etc). * @since 2.0.0 */ def product[T <: Product : TypeTag]: Encoder[T] = ExpressionEncoder() object ExpressionEncoder { def apply[T : TypeTag](): ExpressionEncoder[T] = { // ... val serializer = ScalaReflection.serializerFor[T](nullSafeInput) val deserializer = ScalaReflection.deserializerFor[T]
  17. 17. 17 integral floating datetime string product option seq UUID /** * An encoder for Scala's product type (tuples, case classes, etc). * @since 2.0.0 */ def product[T <: Product : TypeTag]: Encoder[T] = ExpressionEncoder() object ExpressionEncoder { def apply[T : TypeTag](): ExpressionEncoder[T] = { // ... val serializer = ScalaReflection.serializerFor[T](nullSafeInput) val deserializer = ScalaReflection.deserializerFor[T] tpe.dealias match { case t if !dataTypeFor(t).isInstanceOf[ObjectType] => getPath case t if t <:< localTypeOf[Option[_]] => val TypeRef(_, _, Seq(optType)) = t val className = getClassNameFromType(optType) val newTypePath = s"""- option value class: "$className"""" +: walkedTypePath WrapOption(deserializerFor(optType, path, newTypePath), dataTypeFor(optType)) case t if t <:< localTypeOf[java.lang.Integer] => val boxedType = classOf[java.lang.Integer] val objectType = ObjectType(boxedType) StaticInvoke(boxedType, objectType, "valueOf", getPath :: Nil, returnNullable = false) case t if t <:< localTypeOf[java.lang.Long] => val boxedType = classOf[java.lang.Long] val objectType = ObjectType(boxedType) StaticInvoke(boxedType, objectType, "valueOf", getPath :: Nil, returnNullable = false) case t if t <:< localTypeOf[java.lang.Double] => val boxedType = classOf[java.lang.Double] val objectType = ObjectType(boxedType) StaticInvoke(boxedType, objectType, "valueOf", getPath :: Nil, returnNullable = false)
  18. 18. 18 integral floating datetime string product option seq UUID Encoders should be additive • rewrite encoders from scratch! • doesn’t require changing Spark internals Do it in a robust, type-safe way! • generic programming with shapeless
  19. 19. 19 trait TypedEncoder[T] object TypedEncoder { // derive Spark Encoder implicit def apply[T: TypedEncoder]: ExpressionEncoder[T] = ??? } implicit val intEncoder: TypedEncoder[Int] = ??? implicit val longEncoder: TypedEncoder[Long] = ??? implicit val uuidEncoder: TypedEncoder[UUID] = ??? implicit def optionTypedEncoder[T](implicit enc: TypedEncoder[T]): TypedEncoder[Option[T]] = ??? implicit def seqTypedEncoder[C[X] <: Seq[X], T](implicit enc: TypedEncoder[T], classTag: ClassTag[C[T]] ): TypedEncoder[C[T]] = ??? implicit def productEncoder[F, G <: HList, H <: HList](implicit i0: LabelledGeneric.Aux[F, G], i1: DropUnitValues.Aux[G, H], i2: IsHCons[H], i3: Lazy[RecordEncoderFields[H]], i4: Lazy[NewInstanceExprs[G]], i5: ClassTag[F] ): TypedEncoder[F] = ??? Primitive types Recursive types
  20. 20. 20 implicit def productEnc[Repr <: HList]: TypedEncoder[T] implicit val uuidEnc: TypedEncoder[UUID] (implicit enc: TypedEncoder[(UUID, UUID)]) (implicit enc: TypedEncoder[UUID]) implicit val intEnc: TypedEncoder[Int] implicit def productEnc[Repr <: HList]: TypedEncoder[T] implicit val uuidEnc: TypedEncoder[UUID] implicit val intEnc: TypedEncoder[Int] DONE
  21. 21. 21 • Typesafe columns referencing • Enhanced type signature for built-in functions • Customizable, type safe encoders – Not fully compatible semantics L
  22. 22. Multiple encoders 22 UUID struct<msb: bigint, lsb: bigint> string • 2.25x smaller • not for humans (or SQL) • large • easy to inspect • needed for compatibility ! "98d406e2-1bc6-4e75-abfa-2abe530d53cb" [-7434309516683489675, -6054479752120675381] RAM
  23. 23. Strict parsing • Spark infers input schema by default • Specify the schema – validation – spare inference cost • Encoder schema can be used • Parsing extensions 23 import org.apache.spark.sql.types._ val schema = StructType( StructField("id", LongType) :: StructField("name", StringType) :: StructField("gender", StringType) :: StructField("age", IntegerType) :: Nil ) val persons = spark .read .schema(schema) .csv("data.csv") .as[Person]
  24. 24. Strict parsing • Spark infers input schema by default • Specify the schema – validation – spare inference cost • Encoder schema can be used • Parsing extensions 24 import org.apache.spark.sql.types._ val schema = StructType( StructField("id", LongType) :: StructField("name", StringType) :: StructField("gender", StringType) :: StructField("age", IntegerType) :: Nil ) val persons = spark .read .schema(schema) .csv("data.csv") .as[Person] import org.apache.spark.sql.Encoder import org.apache.spark.sql.types._ def encoder[T: Encoder] = implicitly[Encoder[T]] val persons = spark.read .schema(encoder[Person].schema) .csv("data.csv") .as[Person]
  25. 25. Parsing extensions • Mostly for flat CSV files • Nest repeated columns into sequences • Group columns into objects 25 GitHub: BenFradet/struct-type-encoder import ste._ case class ProviderFile( @Flatten person: Person, @Flatten(times=4) telephones: Seq[String] )id | name | gender | age | t1 | t2 | t3 | t4
  26. 26. 1. operations with performance problems 2. operations with type safety problems 3. creating a dataset of an arbitrary type – custom encoder framework 26
  27. 27. 27 trait MatchingGrouperDS def partitionByMatchingPairs[T: TypedEncoder]( ds: Dataset[(Long, T)], pairs: Dataset[(Long, Long)]) (implicit spark: SparkSession): (Dataset[(Long, T)], Dataset[(Long, T)]) = { val tcs = MatchPartitioner.createPartitionsDS(pairs).toDF("ids") .withColumn("groupId", monotonically_increasing_id()) .parquetCheckpoint("matchingPairsWithGroupId") .select($"groupId", explode($"ids").as("id")) ds.toDF("id", "matches").join(tcs, Seq("id"), "leftouter") .selectTuple($"groupId", $"matches") .as[(Long, T)] .partition($"_1".isNotNull) } } typed dataset in untyped operations cast to dataset
  28. 28. 28 • Typesafe columns referencing • Enhanced type signature for built-in functions • Customizable, type safe encoders – Not fully compatible semantics L filter select join groupBy orderBy withColumn ...
  29. 29. 29 case class Person(id: Long, name: String, gender: String, age: Int) spark.read.parquet("data").as[Person] .select($"name", $"age") .filter($"age" >= 18) .filter($”agd" <= 49) Name: org.apache.spark.sql.AnalysisException Message: cannot resolve ‘`agd`' given input columns: (...) RUNTIME ERROR case class NameAge(name: String, age: Int) val tds = TypedDataset.create(spark.read.parquet("data").as[Person]) val pTds = tds.project[NameAge] val fTds = pTds.filter(pTds('age) >= 18) .filter(pTds(‘agd) <= 49) COMPILE TIME ERROR error: No column Symbol with shapeless.tag.Tagged[String("agd")] of type A in NameAge .filter(pTds(‘agd) <= 49)
  30. 30. 30 val tds = TypedDataset.create( spark.read.parquet("data").as[Person]) val pTds = tds.project[NameAge] val fTds = pTds.filter(pTds('age) >= 18) .filter(pTds(‘age) <= 49) spark.read.parquet("data").as[Person] .select($"name", $"age") .filter($"age" >= 18) .filter($"age" <= 49) == Physical Plan == *(1) Project +- *(1) Filter +- *(1) FileScan parquet PushedFilters: [IsNotNull(age), GreaterThanOrEqual(age,18), LessThanOrEqual(age,49)], ReadSchema: struct<name:string,age:int>
  31. 31. 1. operations with performance problems 2. operations with type safety problems – frameless.TypedDataset 3. creating a dataset of an arbitrary type – custom encoder framework 31
  32. 32. λs to Catalyst exprs Compile simple closures to Catalyst expressions [SPARK-14083] 32 ds.map(_.name) // ~ df.select($"name") ds.groupByKey(_.gender) // ~ df.groupBy($"gender") ds.filter(_.age > 18) // ~ df.where($"age” > 18) ds.map(_.age + 1) // ~ df.select($"age” + 1)
  33. 33. λs to Catalyst exprs 33 spark.read.parquet("data").as[Person] .map(x => NameAge(x.name, x.age)) .filter(_.age >= 18) .filter(_.age <= 49) == Physical Plan == *(1) SerializeFromObject +- *(1) Filter +- *(1) MapElements +- *(1) DeserializeToObject +- *(1) FileScan parquet PushedFilters: [], ReadSchema: struct<id:bigint,name:string,gender:string,age:int> == Physical Plan == *(1) Project +- *(1) Filter +- *(1) FileScan parquet PushedFilters: [IsNotNull(age), GreaterThanOrEqual(age,18), LessThanOrEqual(age,49)], ReadSchema: struct<name:string,age:int> spark.read.parquet("data") .select($"name", $"age") .filter($"age" >= 18) .filter($"age" <= 49) CURRENTLY[SPARK-14083] NOT IMPLEMENTED
  34. 34. UDF + tuples = ! 34 import org.apache.spark.sql.functions._ val processPerson = udf((p: Person) => p.name) providerFile .withColumn("results", processPerson($"person")) .collect() Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Person at (...) ... 16 more RUNTIME ERROR [SPARK-12823]
  35. 35. UDF + tuples 35 import typedudf.TypedUdf import typedudf.ParamEncoder._ val processPerson = TypedUdf((p : Person) => p.name) spark.read.parquet("provider").as[ProviderFile] .withColumn("results", processPerson($"person")) .collect() GitHub: lesbroot/typedudf
  36. 36. 1. operations with performance problems – only use if no alternatives exist 2. operations with type safety problems – frameless.TypedDataset 3. encoders are not extendable – custom encoder framework 36
  37. 37. Takeaways • RDD -> Spark SQL: – offering same level of type safety is hard • need to dig into Catalyst • compile time overhead – for Spark 2.4 will support Scala 2.12 • check out frameless, etc 37
  38. 38. Questions #schema4free @szdavid92 38 Thank you for listening! https://www.whitepages.com/

×