SlideShare a Scribd company logo
Dávid Szakállas, Whitepages
@szdavid92
Spark Schema for Free
#schema4free
2
Result Subgraph
IGraph Retrieval
Service
Search Service
API
3
4B+ entities
6B+ links
Whitepages
Identity Graph™
Our story in a nutshell
4
PROVIDER
BATCHES
IDENTITY
GRAPH
> 30 000 SLOC
RDD
RICH DOMAIN
TYPES
3rd PARTY
LIBRARIES
5
bad quality
distributed code
platform
shortcomings
performance
degradation
unreliability incorrect results
bugs in domain
logic
infrastructure
problems
wasteful
recomputation
non-deterministic
operations
network heavy
operations
memory heavy
encoding
CPU heavy object
serialization
6
distributed code
bugs
platform
shortcomings
bugs in domain
logic
infrastructure
problems
wasteful
recomputation
non-deterministic
operations
network heavy
operations
memory heavy
encoding
CPU heavy object
serialization
Moving to Databricks
Switching to Spark SQL
Fixing logic and perf bugs
Requirements
• Use Scala
• Keep compatibility with
existing output format
• Retain compile-time type
safety
• Reuse existing domain types
– ~ 30 core types in case
classes
• Leverage the query optimizer
• Minimize memory footprint
– spare object allocations where
possible
• Reduce boilerplate
7
8
Two and a half APIs
Dataset[T] Dataset[Row] SQL
! in logic
mistype the column
name
syntax error
"
# $ % %
: Dataset[T]
S: Serialization cost
D: Deserialization cost
O: Optimization barrier
U: Untyped column
referencing
9
filter(T => Boolean) DO
map(T => U) SDO
mapPartitions(T => U) SDO
flatMap SDO
groupByKey SDO
reduce SDO
joinWith U
dropDuplicates U
orderBy, ... U
1. operations with performance problems
2. operations with type safety problems
3. encoders are not extendable
10
11
spark.emptyDataset[java.util.UUID]
error: Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case
classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future
releases. spark.emptyDataset[java.util.UUID]
COMPILE TIME ERROR
12
val jvmRepr = ObjectType(classOf[UUID])
val serializer = CreateNamedStruct(Seq(
Literal("msb"),
Invoke(BoundReference(0, jvmRepr, false), "getMostSignificantBits", LongType),
Literal("lsb"),
Invoke(BoundReference(0, jvmRepr, false), "getLeastSignificantBits", LongType)
)).flatten
val deserializer = NewInstance(classOf[UUID],
Seq(GetColumnByOrdinal(0, LongType), GetColumnByOrdinal(1, LongType)),
ObjectType(classOf[UUID]))
implicit val uuidEncoder = new ExpressionEncoder[UUID](
schema = StructType(Array(
StructField("msb", LongType, nullable = false),
StructField("lsb", LongType, nullable = false)
)),
flat = false,
serializer = serializer,
deserializer = deserializer,
clsTag = classTag[UUID])
13
spark.emptyDataset[java.util.UUID]
!
spark.emptyDataset[(UUID, UUID)]
Message: No Encoder found for java.util.UUID
- field (class: "java.util.UUID", name: "_1")
- root class: "scala.Tuple2"
StackTrace: - field (class: "java.util.UUID", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection (...)
RUNTIME
EXCEPTION!
"
Creating extendable encoders
• Types are trees of products, sequences, maps
with primitive serializable types as leaves
• Problem similar to JSON serialization
• Idea: use generic programming
– Generate schema, serializers and deserializers
• Type-level programming with shapeless
14
15
trait ComposableEncoder[T] {
// ???
}
object ComposableEncoder {
// derive Spark Encoder
implicit def getEncoder[T: ComposableEncoder]: Encoder[T] = ???
}
implicit val intEncoder: ComposableEncoder[Int] = ???
implicit val longEncoder: ComposableEncoder[Long] = ???
implicit val uuidEncoder: ComposableEncoder[UUID] = ???
// ...
// other primitive types
// compound types
implicit def productEncoder[G, Repr <: HList]: ComposableEncoder[T] =
???
implicit def arrayEncoder[T: ClassTag]: ComposableEncoder[Array[T]] =
???
// Option, Either, etc.
16
implicit def productEnc[Repr <:
HList]: ComposableEncoder[T]
implicit val uuidEnc:
ComposableEncoder[UUID]
ComposableEncoder[(UUID, UUID)]
ComposableEncoder[UUID]
implicit val intEnc:
ComposableEncoder[Int]
implicit def productEnc[Repr <:
HList]: ComposableEncoder[T]
implicit val uuidEnc:
ComposableEncoder[UUID]
implicit val intEnc:
ComposableEncoder[Int]
DONE
17
• Typesafe columns
referencing
• Enhanced type signature
for built-in functions
• Customizable, type safe
encoders
– Not fully compatible
semantics L
Multiple encoders
18
UUID
struct<msb: bigint,
lsb: bigint>
string
• 2.25x smaller
• not for humans
(or SQL)
• large
• easy to inspect
• needed for
compatibility
!
"98d406e2-1bc6-4e75-abfa-2abe530d53cb"
[-7434309516683489675,
-6054479752120675381]
1. operations with performance problems
2. operations with type safety problems
3. encoders are not extendable
19
20
• Typesafe columns
referencing
• Enhanced type signature
for built-in functions
• Customizable, type safe
encoders
– Not fully compatible
semantics L
21
case class Person(id: Long, name: String, gender: String, age: Int)
spark.read.parquet("data").as[Person]
.select($"name", $"age")
.filter($"age" >= 18)
.filter($"ag" <= 49)
Name: org.apache.spark.sql.AnalysisException
Message: cannot resolve ‘`ag`' given input
columns: (...)
RUNTIME ERROR
case class NameAge(name: String, age: Int)
val tds = TypedDataset.create(spark.read.parquet("data").as[Person])
val pTds = tds.project[NameAge]
val fTds = pTds.filter(pTds('age) >= 18)
.filter(pTds('ag) <= 49) COMPILE TIME ERROR
error: No column Symbol with shapeless.tag.Tagged[String("ag")]
of type A in NameAge
.filter(pTds('ag) <= 49)
22
val tds = TypedDataset.create(
spark.read.parquet("data").as[Person])
val pTds = tds.project[NameAge]
val fTds = pTds.filter(pTds('age) >= 18)
.filter(pTds(‘age) <= 49)
spark.read.parquet("data").as[Person]
.select($"name", $"age")
.filter($"age" >= 18)
.filter($"age" <= 49)
== Physical Plan ==
*(1) Project
+- *(1) Filter
+- *(1) FileScan parquet PushedFilters: [IsNotNull(age),
GreaterThanOrEqual(age,18), LessThanOrEqual(age,49)], ReadSchema:
struct<name:string,age:int>
23
filter
select
project
join
groupBy
orderBy
withColumn ...
1. operations with performance problems
2. operations with type safety problems
3. encoders are not extendable
24
λs to Catalyst exprs
Compile simple closures to Catalyst expressions
[SPARK-14083]
25
ds.map(_.name) // ~ df.select($"name")
ds.groupByKey(_.gender) // ~ df.groupBy($"gender")
ds.filter(_.age > 18) // ~ df.where($"age” > 18)
ds.map(_.age + 1) // ~ df.select($"age” + 1)
λs to Catalyst exprs
26
spark.read.parquet("data").as[Person]
.map(x => NameAge(x.name, x.age))
.filter(_.age >= 18)
.filter(_.age <= 49)
== Physical Plan ==
*(1) SerializeFromObject
+- *(1) Filter
+- *(1) MapElements
+- *(1) DeserializeToObject
+- *(1) FileScan parquet PushedFilters: [], ReadSchema:
struct<id:bigint,name:string,gender:string,age:int>
CURRENTLY
λs to Catalyst exprs
27
spark.read.parquet("data").as[Person]
.map(x => Name(x.name, x.age))
.filter(_.age >= 18)
.filter(_.age <= 49)
== Physical Plan ==
*(1) SerializeFromObject
+- *(1) Filter
+- *(1) MapElements
+- *(1) DeserializeToObject
+- *(1) FileScan parquet PushedFilters: [], ReadSchema:
struct<id:bigint,name:string,gender:string,age:int>
== Physical Plan ==
*(1) Project
+- *(1) Filter
+- *(1) FileScan parquet PushedFilters: [IsNotNull(age),
GreaterThanOrEqual(age,18), LessThanOrEqual(age,49)], ReadSchema:
struct<name:string,age:int>
spark.read.parquet("data")
.select($"name", $"age")
.filter($"age" >= 18)
.filter($"age" <= 49)
CURRENTLY CURRENTLY[SPARK-14083]
Inlining deserialization into λs
28
J. Wróblewski, K. Ishizaki, H. Inoue
and M. Ohara, "Accelerating Spark
Datasets by Inlining Deserialization,"
Spark Summit SF17 - Demystifying
DataFrame and Dataset
== Physical Plan ==
*(1) SerializeFromObject
+- *(1) Filter
+- *(1) DeserializeToObject
+- *(1) FileScan
ds.filter(p: Person => p.age == 20)
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- gender: string (nullable = true)
|-- age: integer (nullable = true)
age: Int => age == 20
Analyzed and rewritten on bytecode level
param inlining
+ type safe
interface
+ no refactor
needed
+ more complex
closures
o only elides deser
cost
- no optimization
of body
29
λ → expr
+ type safe
interface
+ catalyst
analyzation &
optimization
+ no refactor
needed
- only simple
closures
NOT
IMPLEMENTED
NOT
IM
PLEM
ENTED
Requirements
• Use Scala
• Keep compatibility with
existing output format
• Retain compile-time type
safety
• Reuse existing domain types
– ~ 30 core types in case
classes
• Leverage the query optimizer
• Minimize memory footprint
– spare object allocations where
possible
• Reduce boilerplate
30
Miscellaneous features
31
Strict parsing
• Spark infers input
schema by default
• Specify the schema
– validation
– spare inference cost
• Schema could be
generated
• Parsing extensions
32
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", LongType) ::
StructField("name", StringType) ::
StructField("gender", StringType) ::
StructField("age", IntegerType) ::
Nil
)
val persons = spark
.read
.schema(schema)
.csv("data.csv")
.as[Person]
import frameless._
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.types._
implicit def encoder[T: TypedEncoder] =
TypedExpressionEncoder[T]
val persons = spark.read
.schema(encoder[Person].schema)
.csv("data.csv")
.as[Person]
UDF + tuples = !
33
import org.apache.spark.sql.functions._
val processPerson = udf((p: Person) => p.name)
providerFile
.withColumn("results", processPerson($"person"))
.collect()
Caused by: java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to Person at (...)
... 16 more
RUNTIME ERROR
[SPARK-12823]
UDF + tuples = !
34
import typedudf.TypedUdf
import typedudf.ParamEncoder._
val processPerson = TypedUdf((p : Person) => p.name)
spark.read.parquet("provider").as[ProviderFile]
.withColumn("results", processPerson($"person"))
.collect()
GitHub: lesbroot/typedudf
Takeaways
• RDD -> Spark SQL: hard work
– (with our requirements)
• needed to dig into Catalyst
• compile time overhead
– Scala 2.12
• check out frameless, etc.
35
Questions
#schema4free
@szdavid92
36
Thank you for listening!
https://www.whitepages.com/

More Related Content

What's hot

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Taro L. Saito
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
Łukasz Bałamut
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
Sigmoid
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
Namgee Lee
 
Scala in a nutshell by venkat
Scala in a nutshell by venkatScala in a nutshell by venkat
Scala in a nutshell by venkat
Venkateswaran Kandasamy
 
Scala 2013 review
Scala 2013 reviewScala 2013 review
Scala 2013 review
Sagie Davidovich
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
johnynek
 
Java8 stream
Java8 streamJava8 stream
Java8 streamkoji lin
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
AnsviaLab
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde
 
A Tour Of Scala
A Tour Of ScalaA Tour Of Scala
A Tour Of Scala
fanf42
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
Apache Spark: Moving on from Hadoop
Apache Spark: Moving on from HadoopApache Spark: Moving on from Hadoop
Apache Spark: Moving on from Hadoop
Victor Sanchez Anguix
 
An introduction to scala
An introduction to scalaAn introduction to scala
An introduction to scala
Xing
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
scalaconfjp
 
Scala for Java programmers
Scala for Java programmersScala for Java programmers
Scala for Java programmers
輝 子安
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 World
BTI360
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
DataWorks Summit
 

What's hot (20)

Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, TokyoWeaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Scala in a nutshell by venkat
Scala in a nutshell by venkatScala in a nutshell by venkat
Scala in a nutshell by venkat
 
Scala 2013 review
Scala 2013 reviewScala 2013 review
Scala 2013 review
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Java8 stream
Java8 streamJava8 stream
Java8 stream
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
A Tour Of Scala
A Tour Of ScalaA Tour Of Scala
A Tour Of Scala
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Apache Spark: Moving on from Hadoop
Apache Spark: Moving on from HadoopApache Spark: Moving on from Hadoop
Apache Spark: Moving on from Hadoop
 
An introduction to scala
An introduction to scalaAn introduction to scala
An introduction to scala
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
 
Scala for Java programmers
Scala for Java programmersScala for Java programmers
Scala for Java programmers
 
Scala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 WorldScala vs Java 8 in a Java 8 World
Scala vs Java 8 in a Java 8 World
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 

Similar to Spark Schema For Free with David Szakallas

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
Databricks
 
More expressive types for spark with frameless
More expressive types for spark with framelessMore expressive types for spark with frameless
More expressive types for spark with frameless
Miguel Pérez Pasalodos
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
Matthias Niehoff
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
Jon Haddad
 
Stress test data pipeline
Stress test data pipelineStress test data pipeline
Stress test data pipeline
Marina Grechuhin
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Don Drake
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
Victoria Malaya
 

Similar to Spark Schema For Free with David Szakallas (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
More expressive types for spark with frameless
More expressive types for spark with framelessMore expressive types for spark with frameless
More expressive types for spark with frameless
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Scala on Android
Scala on AndroidScala on Android
Scala on Android
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Enter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy SparkEnter the Snake Pit for Fast and Easy Spark
Enter the Snake Pit for Fast and Easy Spark
 
Stress test data pipeline
Stress test data pipelineStress test data pipeline
Stress test data pipeline
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 

Recently uploaded (20)

一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 

Spark Schema For Free with David Szakallas

  • 1. Dávid Szakállas, Whitepages @szdavid92 Spark Schema for Free #schema4free
  • 4. Our story in a nutshell 4 PROVIDER BATCHES IDENTITY GRAPH > 30 000 SLOC RDD RICH DOMAIN TYPES 3rd PARTY LIBRARIES
  • 5. 5 bad quality distributed code platform shortcomings performance degradation unreliability incorrect results bugs in domain logic infrastructure problems wasteful recomputation non-deterministic operations network heavy operations memory heavy encoding CPU heavy object serialization
  • 6. 6 distributed code bugs platform shortcomings bugs in domain logic infrastructure problems wasteful recomputation non-deterministic operations network heavy operations memory heavy encoding CPU heavy object serialization Moving to Databricks Switching to Spark SQL Fixing logic and perf bugs
  • 7. Requirements • Use Scala • Keep compatibility with existing output format • Retain compile-time type safety • Reuse existing domain types – ~ 30 core types in case classes • Leverage the query optimizer • Minimize memory footprint – spare object allocations where possible • Reduce boilerplate 7
  • 8. 8 Two and a half APIs Dataset[T] Dataset[Row] SQL ! in logic mistype the column name syntax error " # $ % %
  • 9. : Dataset[T] S: Serialization cost D: Deserialization cost O: Optimization barrier U: Untyped column referencing 9 filter(T => Boolean) DO map(T => U) SDO mapPartitions(T => U) SDO flatMap SDO groupByKey SDO reduce SDO joinWith U dropDuplicates U orderBy, ... U
  • 10. 1. operations with performance problems 2. operations with type safety problems 3. encoders are not extendable 10
  • 11. 11 spark.emptyDataset[java.util.UUID] error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. spark.emptyDataset[java.util.UUID] COMPILE TIME ERROR
  • 12. 12 val jvmRepr = ObjectType(classOf[UUID]) val serializer = CreateNamedStruct(Seq( Literal("msb"), Invoke(BoundReference(0, jvmRepr, false), "getMostSignificantBits", LongType), Literal("lsb"), Invoke(BoundReference(0, jvmRepr, false), "getLeastSignificantBits", LongType) )).flatten val deserializer = NewInstance(classOf[UUID], Seq(GetColumnByOrdinal(0, LongType), GetColumnByOrdinal(1, LongType)), ObjectType(classOf[UUID])) implicit val uuidEncoder = new ExpressionEncoder[UUID]( schema = StructType(Array( StructField("msb", LongType, nullable = false), StructField("lsb", LongType, nullable = false) )), flat = false, serializer = serializer, deserializer = deserializer, clsTag = classTag[UUID])
  • 13. 13 spark.emptyDataset[java.util.UUID] ! spark.emptyDataset[(UUID, UUID)] Message: No Encoder found for java.util.UUID - field (class: "java.util.UUID", name: "_1") - root class: "scala.Tuple2" StackTrace: - field (class: "java.util.UUID", name: "_1") - root class: "scala.Tuple2" at org.apache.spark.sql.catalyst.ScalaReflection (...) RUNTIME EXCEPTION! "
  • 14. Creating extendable encoders • Types are trees of products, sequences, maps with primitive serializable types as leaves • Problem similar to JSON serialization • Idea: use generic programming – Generate schema, serializers and deserializers • Type-level programming with shapeless 14
  • 15. 15 trait ComposableEncoder[T] { // ??? } object ComposableEncoder { // derive Spark Encoder implicit def getEncoder[T: ComposableEncoder]: Encoder[T] = ??? } implicit val intEncoder: ComposableEncoder[Int] = ??? implicit val longEncoder: ComposableEncoder[Long] = ??? implicit val uuidEncoder: ComposableEncoder[UUID] = ??? // ... // other primitive types // compound types implicit def productEncoder[G, Repr <: HList]: ComposableEncoder[T] = ??? implicit def arrayEncoder[T: ClassTag]: ComposableEncoder[Array[T]] = ??? // Option, Either, etc.
  • 16. 16 implicit def productEnc[Repr <: HList]: ComposableEncoder[T] implicit val uuidEnc: ComposableEncoder[UUID] ComposableEncoder[(UUID, UUID)] ComposableEncoder[UUID] implicit val intEnc: ComposableEncoder[Int] implicit def productEnc[Repr <: HList]: ComposableEncoder[T] implicit val uuidEnc: ComposableEncoder[UUID] implicit val intEnc: ComposableEncoder[Int] DONE
  • 17. 17 • Typesafe columns referencing • Enhanced type signature for built-in functions • Customizable, type safe encoders – Not fully compatible semantics L
  • 18. Multiple encoders 18 UUID struct<msb: bigint, lsb: bigint> string • 2.25x smaller • not for humans (or SQL) • large • easy to inspect • needed for compatibility ! "98d406e2-1bc6-4e75-abfa-2abe530d53cb" [-7434309516683489675, -6054479752120675381]
  • 19. 1. operations with performance problems 2. operations with type safety problems 3. encoders are not extendable 19
  • 20. 20 • Typesafe columns referencing • Enhanced type signature for built-in functions • Customizable, type safe encoders – Not fully compatible semantics L
  • 21. 21 case class Person(id: Long, name: String, gender: String, age: Int) spark.read.parquet("data").as[Person] .select($"name", $"age") .filter($"age" >= 18) .filter($"ag" <= 49) Name: org.apache.spark.sql.AnalysisException Message: cannot resolve ‘`ag`' given input columns: (...) RUNTIME ERROR case class NameAge(name: String, age: Int) val tds = TypedDataset.create(spark.read.parquet("data").as[Person]) val pTds = tds.project[NameAge] val fTds = pTds.filter(pTds('age) >= 18) .filter(pTds('ag) <= 49) COMPILE TIME ERROR error: No column Symbol with shapeless.tag.Tagged[String("ag")] of type A in NameAge .filter(pTds('ag) <= 49)
  • 22. 22 val tds = TypedDataset.create( spark.read.parquet("data").as[Person]) val pTds = tds.project[NameAge] val fTds = pTds.filter(pTds('age) >= 18) .filter(pTds(‘age) <= 49) spark.read.parquet("data").as[Person] .select($"name", $"age") .filter($"age" >= 18) .filter($"age" <= 49) == Physical Plan == *(1) Project +- *(1) Filter +- *(1) FileScan parquet PushedFilters: [IsNotNull(age), GreaterThanOrEqual(age,18), LessThanOrEqual(age,49)], ReadSchema: struct<name:string,age:int>
  • 24. 1. operations with performance problems 2. operations with type safety problems 3. encoders are not extendable 24
  • 25. λs to Catalyst exprs Compile simple closures to Catalyst expressions [SPARK-14083] 25 ds.map(_.name) // ~ df.select($"name") ds.groupByKey(_.gender) // ~ df.groupBy($"gender") ds.filter(_.age > 18) // ~ df.where($"age” > 18) ds.map(_.age + 1) // ~ df.select($"age” + 1)
  • 26. λs to Catalyst exprs 26 spark.read.parquet("data").as[Person] .map(x => NameAge(x.name, x.age)) .filter(_.age >= 18) .filter(_.age <= 49) == Physical Plan == *(1) SerializeFromObject +- *(1) Filter +- *(1) MapElements +- *(1) DeserializeToObject +- *(1) FileScan parquet PushedFilters: [], ReadSchema: struct<id:bigint,name:string,gender:string,age:int> CURRENTLY
  • 27. λs to Catalyst exprs 27 spark.read.parquet("data").as[Person] .map(x => Name(x.name, x.age)) .filter(_.age >= 18) .filter(_.age <= 49) == Physical Plan == *(1) SerializeFromObject +- *(1) Filter +- *(1) MapElements +- *(1) DeserializeToObject +- *(1) FileScan parquet PushedFilters: [], ReadSchema: struct<id:bigint,name:string,gender:string,age:int> == Physical Plan == *(1) Project +- *(1) Filter +- *(1) FileScan parquet PushedFilters: [IsNotNull(age), GreaterThanOrEqual(age,18), LessThanOrEqual(age,49)], ReadSchema: struct<name:string,age:int> spark.read.parquet("data") .select($"name", $"age") .filter($"age" >= 18) .filter($"age" <= 49) CURRENTLY CURRENTLY[SPARK-14083]
  • 28. Inlining deserialization into λs 28 J. Wróblewski, K. Ishizaki, H. Inoue and M. Ohara, "Accelerating Spark Datasets by Inlining Deserialization," Spark Summit SF17 - Demystifying DataFrame and Dataset == Physical Plan == *(1) SerializeFromObject +- *(1) Filter +- *(1) DeserializeToObject +- *(1) FileScan ds.filter(p: Person => p.age == 20) root |-- id: long (nullable = true) |-- name: string (nullable = true) |-- gender: string (nullable = true) |-- age: integer (nullable = true) age: Int => age == 20 Analyzed and rewritten on bytecode level
  • 29. param inlining + type safe interface + no refactor needed + more complex closures o only elides deser cost - no optimization of body 29 λ → expr + type safe interface + catalyst analyzation & optimization + no refactor needed - only simple closures NOT IMPLEMENTED NOT IM PLEM ENTED
  • 30. Requirements • Use Scala • Keep compatibility with existing output format • Retain compile-time type safety • Reuse existing domain types – ~ 30 core types in case classes • Leverage the query optimizer • Minimize memory footprint – spare object allocations where possible • Reduce boilerplate 30
  • 32. Strict parsing • Spark infers input schema by default • Specify the schema – validation – spare inference cost • Schema could be generated • Parsing extensions 32 import org.apache.spark.sql.types._ val schema = StructType( StructField("id", LongType) :: StructField("name", StringType) :: StructField("gender", StringType) :: StructField("age", IntegerType) :: Nil ) val persons = spark .read .schema(schema) .csv("data.csv") .as[Person] import frameless._ import org.apache.spark.sql.Encoder import org.apache.spark.sql.types._ implicit def encoder[T: TypedEncoder] = TypedExpressionEncoder[T] val persons = spark.read .schema(encoder[Person].schema) .csv("data.csv") .as[Person]
  • 33. UDF + tuples = ! 33 import org.apache.spark.sql.functions._ val processPerson = udf((p: Person) => p.name) providerFile .withColumn("results", processPerson($"person")) .collect() Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Person at (...) ... 16 more RUNTIME ERROR [SPARK-12823]
  • 34. UDF + tuples = ! 34 import typedudf.TypedUdf import typedudf.ParamEncoder._ val processPerson = TypedUdf((p : Person) => p.name) spark.read.parquet("provider").as[ProviderFile] .withColumn("results", processPerson($"person")) .collect() GitHub: lesbroot/typedudf
  • 35. Takeaways • RDD -> Spark SQL: hard work – (with our requirements) • needed to dig into Catalyst • compile time overhead – Scala 2.12 • check out frameless, etc. 35
  • 36. Questions #schema4free @szdavid92 36 Thank you for listening! https://www.whitepages.com/