Apache Spark in your likeness - low and high level customization

Apache Spark à ton image
User-Defined features et session extensions
Bartosz Konieczny
@waitingforcode

First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com
#@waitingforcode
#github.com/bartosz25

Apache Spark Community Feedback initiative
https://www.waitingforcode.com/static/spark-feedback
Why ?
● a single place with all best practices
● community-driven
● open
● interactive
How ?
● fill the form (https://forms.gle/sjSWPKmudhM6a3776)
● validate
● share
● learn

High level customization
● User-Defined Type (UDT)
● User-Defined Function (UDF)
● User-Defined Aggregate Functions (UDAF)
⇒ RDBMS in Apache Spark
⇒ no need of internals

User-Defined Type
● prior to 2.0 only - ongoing effort for 3.0
● UDTRegistration
● Dataset substitution - your class in DataFrame
● examples: VectorUDT, MatrixUDT
def sqlType: DataType
def pyUDT: String = null
def serializedPyClass: String = null
def serialize(obj: UserType): Any
def deserialize(datum: Any): UserType

User-Defined Type - example
@SQLUserDefinedType(udt = classOf[CityUDT])
case class City(name: String, country: Countries) {
def isFrench: Boolean = country == Countries.France
}
class CityUDT extends UserDefinedType[City] {
override def sqlType: DataType = StructType(Seq(StructField("name", StringType),
StructField("country", StringType)))
// ...
}
val cities = Seq(City("Paris", Countries.France), City("London", Countries.England)).toDF("city")

User-Defined Type - expression retrieval
cities.where("city.name == 'Paris'").show()
org.apache.spark.sql.AnalysisException: Can't extract value from city#3: need struct type
but got struct<name:string,region:string>; line 1 pos 0
at
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala
:73)

User-Defined Function
● SQL's: CREATE FUNCTION
● blackbox
● vectorized UDFs for PySpark - ML purpose
case class UserDefinedFunction protected[sql] (f: AnyRef, dataType: DataType, inputTypes:
Option[Seq[DataType]]) {
def nullable: Boolean = _nullable
def deterministic: Boolean = _deterministic
def apply(exprs: Column*): Column = { … }
}

Registration methods:
● SparkSession#udf#registers
sparkSession.udf.register("MyUDF", myUdf _)
● org.apache.spark.sql.functions.udf
udf(myUdf _)

In the query
sparkSession.udf.register("EvenFlagResolver_registerTest", evenFlagResolver
_)
val rows = letterNumbers.selectExpr("letter",
"EvenFlagResolver_registerTest(number) as isEven")
Programmatically
val udfEvenResolver = udf(evenFlagResolver _)
val rows = letterNumbers.select($"letter",
udfEvenResolver($"number") as "isEven")

● if-else == CASE WHEN
● tokenize == LOWER(REGEXP_REPLACE(.....)
● IN clause
● columns equality == abs(col1 - col2) < allowed_precision
● wrapping DataFrame execution == JOIN
val jobnameDF = jobnameSeq.toDF("jobid","jobname")
sqlContext.udf.register("getJobname", (id: String) => (
jobnameDF.filter($"jobid" === id).select($"jobname")
)
)
● testing ML model == MLib
StackOverflow overused

UDF in generated code
/* 047 */ mapelements_funcResult_0 = ((scala.Function1) references[1] /* literal
*/).apply(mapelements_mutableStateArray_0[0]);
/* 094 */ private void serializefromobject_doConsume_0(scala.Tuple2 serializefromobject_expr_0_0, boolean
serializefromobject_exprIsNull_0_0) throws java.io.IOException {
/* 105 */ if (!serializefromobject_isNull_2) {
/* 106 */ Object serializefromobject_funcResult_0 = null;
/* 107 */ serializefromobject_funcResult_0 = serializefromobject_expr_0_0._1();
/* 108 */
/* 109 */ if (serializefromobject_funcResult_0 != null) {
/* 110 */ serializefromobject_value_2 = (java.lang.String) serializefromobject_funcResult_0;
/* 111 */ } else {
/* 112 */ serializefromobject_isNull_2 = true;
/* 113 */ }
/* 114 */
/* 115 */ }
/* 131 */ if (!false) {
/* 132 */ serializefromobject_isNull_5 = false;
/* 133 */ if (!serializefromobject_isNull_5) {
/* 134 */ Object serializefromobject_funcResult_1 = null;
/* 135 */ serializefromobject_funcResult_1 = serializefromobject_expr_0_0._2();
/* 136 */ serializefromobject_value_5 = (Integer) serializefromobject_funcResult_1;
/* 137 */
/* 138 */ }

● CREATE AGGREGATE - RDBMS, but NoSQL too (Apache Cassandra)
● 1 operation on n rows
● "custom MIN, MAX, SUM, …"
● UDF registration
User-Defined Aggregate Function
def inputSchema: StructType
def bufferSchema: StructType
def dataType: DataType
def deterministic: Boolean
def initialize(buffer: MutableAggregationBuffer): Unit
def update(buffer: MutableAggregationBuffer, input: Row): Unit
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
def evaluate(buffer: Row): Any

UDAF physical plan
== Physical Plan ==
HashAggregate(keys=[user#7], functions=[sessiondurationaggregator(cast(time#8 as
bigint), com.waitingforcode.sql.SessionDurationAggregator@3f0c6b3c, 0, 0)],
output=[user#7, sessionlength_registertest(time)#16L])
+- Exchange hashpartitioning(user#7, 200)
+- HashAggregate(keys=[user#7],
functions=[partial_sessiondurationaggregator(cast(time#8 as bigint),
com.waitingforcode.sql.SessionDurationAggregator@3f0c6b3c, 0, 0)], output=[user#7,
first_log_time#44L, last_log_time#45L])
+- LocalTableScan [user#7, time#8]

Low-level customization
● extensions:
○ analyzer rules
○ check analysis rules
○ custom parser
○ optimizations
○ physical execution
● withExtensions(...) or
spark.sql.extensions
def withExtensions(f: SparkSessionExtensions =>
Unit): Builder = synchronized {
f(extensions)
this
}
SparkSession.builder().withExtensions(extensions => {
extensions.injectOptimizerRule(_ => MyRule)
})
-----------------------------------------------------
SparkSession.builder().config("spark.sql.extensions",
"org.example.MyExtensions")
class MyExtensions extends
Function1[SparkSessionExtensions, Unit] {
override def apply(extensions:
SparkSessionExtensions): Unit = {

Back to the basics
● resolution rules
● post-hoc resolution
rules
● check analysis rules
logical optimization rules
parsers
planner strategies

Parser
● String → lexer → parser → AST
● AstBuilder
● dedicated exception: ParseException
● parse*
trait ParserInterface {
def parsePlan(sqlText: String): LogicalPlan
def parseExpression(sqlText: String): Expression
def parseTableIdentifier(sqlText: String): TableIdentifier
}

Parser - from text to AST example
SELECT id, login FROM users WHERE id > 1 AND active = true
"SELECT", "i", "d", ",", "l", "o", "g", "i", "n", "WHERE", "i", "d", ">", "1",
"AND", "a", "c", "t", "i", "v", "e", "=", "t", "r", "u", "e"
(whitespaces omitted for readability)

Resolution rules
● handles unresolved, i.e. unknown becomes known
● Example:
SELECT * FROM dataset_1
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `dataset_1`
== Analyzed Logical Plan ==
letter: string, nr: int, a_flag: int
Project [letter#7, nr#8, a_flag#9]
+- SubqueryAlias `dataset_1`
+- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9]
+- LocalRelation [_1#3, _2#4, _3#5]

Post hoc resolution rules
● after resolution rules (post hoc)
● same as custom optimization rules
● order does matter
PreprocessTableCreation → PreprocessTableInsertion → DataSourceAnalysis
Examples:
● normalization - casting, renaming
● partitioning checks, e.g. "$partKey is not a partition column"
● generic LogicalPlan resolution, e.g. CreateTable(with query) ⇒
CreateDataSourceTableAsSelectCommand,
CreateTable(without query) ⇒
CreateDataSourceTableCommand

Check analysis rules
● plain assertions
Connection conn = getConnection();
assert conn != null : "Connection is null";
● clearer error messages:
"assertion failed: No plan for CreateTable CatalogTable" ⇒ ""Hive
support is required to use CREATE Hive TABLE AS SELECT"
● API:
MyAnalysisRule extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
// throw new AnalysisException("Analysis error message")
}

Check analysis rules - PreWriteCheck
object PreWriteCheck extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
plan.foreach {
case InsertIntoTable(l @ LogicalRelation(relation, _, _, _),
partition, query, _, _) =>
val srcRelations = query.collect {case LogicalRelation(src,
_, _, _) => src}
if (srcRelations.contains(relation)) {
failAnalysis("Cannot insert into table that is also being read
from.")
} else {
// ...

Logical optimization rule
● simplification :
(id > 0 OR login == 'test') AND id > 0 == id > 0
● collapse :
.repartition(10).repartition(20) == .repartition(20)
● dataset reduction :
columns pruning, predicate pushdown
● human mistakes :
trivial filters (2 > 1), execution tree cleaning (identity functions), redundancy
(projection, aliases)

Logical optimization rule - diff transform vs resolve
General template:
def apply(plan: LogicalPlan): LogicalPlan = plan.{{TRANSFORMATION}} {
case agg: Aggregation => …
case projection: Project => ...
}
{{TRANSFORMATION}} = transformUp/transformDown,
resolveOperatorsUp/resolveOperatorsDown

Logical optimization rule
Classes to know:
● Node
● Expression
import sparkSession.implicits._
val dataset1 = Seq(("A", 1, 1), ("B", 2, 1), ("C", 3,
1), ("D", 4, 1), ("E", 5, 1)).toDF("letter", "nr",
"a_flag")
dataset1.filter("nr > 1")
.explain(true)

Physical plan
● physical execution: code generation, RDD
● WholeStageCodeGen
● *Exec: BroadcastHashJoinExec, DataSourceV2ScanExec,
SortExec, … ⇒ SparkPlan {LeafExecNode, UnaryExecNode,
BinaryExecNode} + CodegenSupport
protected def doExecute(): RDD[InternalRow]
protected def doPrepare(): Unit = {}
protected[sql] def doExecuteBroadcast[T](): broadcast.Broadcast[T]
protected def doProduce(ctx: CodegenContext): String
def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String
def inputRDDs(): Seq[RDD[InternalRow]]

Catalog listeners
* Holder for injection points to the [[SparkSession]]. We make NO
guarantee about the stability
* regarding binary compatibility and source compatibility of methods here.
* This current provides the following extension points:
* <ul>
...
* <li>(External) Catalog listeners.</li>
...
class SparkSessionExtensions {

Catalog listeners
● ExternalCatalogWithListener
val catalogEvents = new scala.collection.mutable.ListBuffer[ExternalCatalogEvent]()
TestedSparkSession.sparkContext.addSparkListener(new SparkListener {
override def onOtherEvent(event: SparkListenerEvent): Unit = {
event match {
case externalCatalogEvent: ExternalCatalogEvent =>
catalogEvents.append(externalCatalogEvent)
case _ => {}
}
}
})
// ExternalCatalogEvent = (CreateTablePreEvent, CreateTableEvent, AlterTableEvent, ...)

Lessons learned
● Apache Spark first ⇒ do not write UDF just to write one, prefer native API
● debug & log
● analyze
● disable rules - much easier
● start small
● find inspirations → NoSQL connectors, "extends SparkPlan", "extends Rule[LogicalPlan]"
● test at scale

Before I let you go
© https://static.thenounproject.com/png/159676-200.png
© https://www.kisspng.com/png-gift-festival-clip-art-vector-lovely-hand-painted-498066
https://www.waitingforcode.com/static/spark-meetup

...still not yet
Apache Spark JIRAs mentioned in this talk
https://issues.apache.org/jira/browse/SPARK-14155
StackOverflow questions for UDF examples
https://stackoverflow.com/a/46464610/9726075
https://stackoverflow.com/q/53551000/9726075

...still not yet
ANTLR tutorial
https://tomassetti.me/antlr-mega-tutorial/
Other presentations
https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-
catalyst-optimizer-with-yin-huai
https://www.slideshare.net/SandeepJoshi55/apache-spark-
undocumented-extensions-78929290
https://www.slideshare.net/datamantra/anatomy-of-spark-sql-catalyst-
part-2
My series about Apache Spark custom optimization
https://www.waitingforcode.com/tags/spark-sql-customization
Other resources
https://stackoverflow.com/questions/38296609/spark-functions-vs-
udf-performance/49103325
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-
for-pyspark.html

Apache Spark in your likeness - low and high level customization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark in your likeness - low and high level customization

Similar to Apache Spark in your likeness - low and high level customization (20)

Recently uploaded

Recently uploaded (20)

Apache Spark in your likeness - low and high level customization

Editor's Notes