Apache Spark in your likeness - low and high level customization
1. Apache Spark à ton image
User-Defined features et session extensions
Bartosz Konieczny
@waitingforcode
2. First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com
#@waitingforcode
#github.com/bartosz25
3. Apache Spark Community Feedback initiative
https://www.waitingforcode.com/static/spark-feedback
Why ?
● a single place with all best practices
● community-driven
● open
● interactive
How ?
● fill the form (https://forms.gle/sjSWPKmudhM6a3776)
● validate
● share
● learn
7. High level customization
● User-Defined Type (UDT)
● User-Defined Function (UDF)
● User-Defined Aggregate Functions (UDAF)
⇒ RDBMS in Apache Spark
⇒ no need of internals
8. User-Defined Type
● prior to 2.0 only - ongoing effort for 3.0
● UDTRegistration
● Dataset substitution - your class in DataFrame
● examples: VectorUDT, MatrixUDT
def sqlType: DataType
def pyUDT: String = null
def serializedPyClass: String = null
def serialize(obj: UserType): Any
def deserialize(datum: Any): UserType
9. User-Defined Type - example
@SQLUserDefinedType(udt = classOf[CityUDT])
case class City(name: String, country: Countries) {
def isFrench: Boolean = country == Countries.France
}
class CityUDT extends UserDefinedType[City] {
override def sqlType: DataType = StructType(Seq(StructField("name", StringType),
StructField("country", StringType)))
// ...
}
val cities = Seq(City("Paris", Countries.France), City("London", Countries.England)).toDF("city")
11. User-Defined Type - expression retrieval
cities.where("city.name == 'Paris'").show()
org.apache.spark.sql.AnalysisException: Can't extract value from city#3: need struct type
but got struct<name:string,region:string>; line 1 pos 0
at
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala
:73)
12. User-Defined Function
● SQL's: CREATE FUNCTION
● blackbox
● vectorized UDFs for PySpark - ML purpose
case class UserDefinedFunction protected[sql] (f: AnyRef, dataType: DataType, inputTypes:
Option[Seq[DataType]]) {
def nullable: Boolean = _nullable
def deterministic: Boolean = _deterministic
def apply(exprs: Column*): Column = { … }
}
14. User-Defined Function
In the query
sparkSession.udf.register("EvenFlagResolver_registerTest", evenFlagResolver
_)
val rows = letterNumbers.selectExpr("letter",
"EvenFlagResolver_registerTest(number) as isEven")
Programmatically
val udfEvenResolver = udf(evenFlagResolver _)
val rows = letterNumbers.select($"letter",
udfEvenResolver($"number") as "isEven")
15. ● if-else == CASE WHEN
● tokenize == LOWER(REGEXP_REPLACE(.....)
● IN clause
● columns equality == abs(col1 - col2) < allowed_precision
● wrapping DataFrame execution == JOIN
val jobnameDF = jobnameSeq.toDF("jobid","jobname")
sqlContext.udf.register("getJobname", (id: String) => (
jobnameDF.filter($"jobid" === id).select($"jobname")
)
)
● testing ML model == MLib
User-Defined Function
StackOverflow overused
23. Parser - from text to AST example
SELECT id, login FROM users WHERE id > 1 AND active = true
"SELECT", "i", "d", ",", "l", "o", "g", "i", "n", "WHERE", "i", "d", ">", "1",
"AND", "a", "c", "t", "i", "v", "e", "=", "t", "r", "u", "e"
(whitespaces omitted for readability)
24. Resolution rules
● handles unresolved, i.e. unknown becomes known
● Example:
SELECT * FROM dataset_1
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `dataset_1`
== Analyzed Logical Plan ==
letter: string, nr: int, a_flag: int
Project [letter#7, nr#8, a_flag#9]
+- SubqueryAlias `dataset_1`
+- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9]
+- LocalRelation [_1#3, _2#4, _3#5]
25. Post hoc resolution rules
● after resolution rules (post hoc)
● same as custom optimization rules
● order does matter
PreprocessTableCreation → PreprocessTableInsertion → DataSourceAnalysis
Examples:
● normalization - casting, renaming
● partitioning checks, e.g. "$partKey is not a partition column"
● generic LogicalPlan resolution, e.g. CreateTable(with query) ⇒
CreateDataSourceTableAsSelectCommand,
CreateTable(without query) ⇒
CreateDataSourceTableCommand
26. Check analysis rules
● plain assertions
Connection conn = getConnection();
assert conn != null : "Connection is null";
● clearer error messages:
"assertion failed: No plan for CreateTable CatalogTable" ⇒ ""Hive
support is required to use CREATE Hive TABLE AS SELECT"
● API:
MyAnalysisRule extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
// throw new AnalysisException("Analysis error message")
}
27. Check analysis rules - PreWriteCheck
object PreWriteCheck extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
plan.foreach {
case InsertIntoTable(l @ LogicalRelation(relation, _, _, _),
partition, query, _, _) =>
val srcRelations = query.collect {case LogicalRelation(src,
_, _, _) => src}
if (srcRelations.contains(relation)) {
failAnalysis("Cannot insert into table that is also being read
from.")
} else {
// ...
28. Logical optimization rule
● simplification :
(id > 0 OR login == 'test') AND id > 0 == id > 0
● collapse :
.repartition(10).repartition(20) == .repartition(20)
● dataset reduction :
columns pruning, predicate pushdown
● human mistakes :
trivial filters (2 > 1), execution tree cleaning (identity functions), redundancy
(projection, aliases)
29. Logical optimization rule - diff transform vs resolve
General template:
def apply(plan: LogicalPlan): LogicalPlan = plan.{{TRANSFORMATION}} {
case agg: Aggregation => …
case projection: Project => ...
}
{{TRANSFORMATION}} = transformUp/transformDown,
resolveOperatorsUp/resolveOperatorsDown
32. Catalog listeners
* Holder for injection points to the [[SparkSession]]. We make NO
guarantee about the stability
* regarding binary compatibility and source compatibility of methods here.
* This current provides the following extension points:
* <ul>
...
* <li>(External) Catalog listeners.</li>
...
class SparkSessionExtensions {
33. Catalog listeners
● ExternalCatalogWithListener
val catalogEvents = new scala.collection.mutable.ListBuffer[ExternalCatalogEvent]()
TestedSparkSession.sparkContext.addSparkListener(new SparkListener {
override def onOtherEvent(event: SparkListenerEvent): Unit = {
event match {
case externalCatalogEvent: ExternalCatalogEvent =>
catalogEvents.append(externalCatalogEvent)
case _ => {}
}
}
})
// ExternalCatalogEvent = (CreateTablePreEvent, CreateTableEvent, AlterTableEvent, ...)
34. Lessons learned
● Apache Spark first ⇒ do not write UDF just to write one, prefer native API
● debug & log
● analyze
● disable rules - much easier
● start small
● find inspirations → NoSQL connectors, "extends SparkPlan", "extends Rule[LogicalPlan]"
● test at scale
explain the idea of the talk and how the series about customizing Apache Spark started
I will show it later in details but everything I will present, do it only if you haven't other choice like no existing SQL operator or no existing data source or optimization (give an example of Casandra join optimization I found other day)
Moreover, SparkSessionExtensions are still a @DeveloperApi, so "So more or less "use at your own risk"."
TODO: does it work in PySpark?
3 types, actually 2 but it's worth knowing
if you did RDBMS before, you will retrieve very similar principles
code - a function, a class, no need to deep delve into the details; much easier, but also the risk of an overuse; I will show it later
TODO: explain the purpose of VectorUDT and MatrixUDT (Spark MLib)
was made private in https://issues.apache.org/jira/browse/SPARK-14155 because it supposed to be a new API for UDT supporting vectorized (batch) data and working better with Datasets
e.g. enum
sqlType → only intended to represent the type at Apache Spark storage level. It's not exposed to the end user so you can't do df.filter("myUdt.field_a = 'a'").show() ! See here for more information: You get this errors because schema defined by sqlType is never exposed and is not intended to be accessed directly. It simply provides a way to express a complex data types using native Spark SQL types.To access these properties, either use row.getAs[MyUdt] or an UDF https://stackoverflow.com/questions/33747851/spark-sql-referencing-attributes-of-udt?lq=1
pyUDT = paired Python UDT if exists
as of this saying (16.08.2019), the ticket intending to bring back the API public (https://issues.apache.org/jira/browse/SPARK-7768) is still in progress and there is no information about how far the progress is; targeted release is 3.0 but probably it won't be the case
How to use? You can directly access the property of given type in map or filter function, see an example here https://stackoverflow.com/a/51957666/9726075
UDT - you can use it in `row.getAs[MyType]("column")` methods, so in any mapping, filter, groupBy function
MatrixUDT & VectorUDT - both are private and should be used from org.apache.spark.ml.linalg.SQLDataTypes
https://issues.apache.org/jira/browse/SPARK-14155 https://issues.apache.org/jira/browse/SPARK-7768
I was still coding in Java, the code is a little bit longer but I use shorter version for presentation purposes https://www.waitingforcode.com/apache-spark-sql/used-defined-type/read
Vectorized UDF - normal UDF operates on one row at a time; mostly used in MLib and more exactly, as a @pandas_udf where it applies a function on Panda's Series rather than row by row; for some cases the accelerate rate is about 242 times!
returns a column, so you can't use it for instance, inside an aggregation
determinsitic - sometimes query planer can skip some optimizations and degrade the performance ; if executed multiple times for the same input, always generates the sasme query
TODO: add this to the link with resources https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql?view=sql-server-2017
https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_5009.htm
blackbox ⇒ be careful about the implementation, Apache Spark doesn't know how to optimize them for you
udf.register for PySpark works too
you can later use "MyUDF" in the string expressions
you can't do that for udf(...) which simply transforms a Scala function into the uDF that you can use later in the operations like letterNumbers.select($"letter", udfEvenResolver($"number") as "isEven") for val udfEvenResolver = udf(evenFlagResolver _)
udf.register for PySpark works too
you can later use "MyUDF" in the string expressions
you can't do that for udf(...) which simply transforms a Scala function into the uDF that you can use later in the operations like letterNumbers.select($"letter", udfEvenResolver($"number") as "isEven") for val udfEvenResolver = udf(evenFlagResolver _)
>>> wrapping - a great anti-pattern and proof that UDF will perform worse than native Apache Spark code most of the time - if used in wrong context
not to blame but simply to highlight the fact of simplicity which is good and bad at the same time
I don't know why ? To simplify ? To write a UT ? But we can still write it with Apache Spark
https://stackoverflow.com/questions/46464125/how-to-write-multiple-if-statements-in-spark-udf/46464610 https://stackoverflow.com/questions/55135347/how-to-pass-dataframe-to-spark-udf
https://stackoverflow.com/questions/35905273/using-a-udf-in-spark-data-frame-for-text-mining/35908115
https://stackoverflow.com/questions/47985382/how-to-use-udf-in-where-clause-in-scala-spark?rq=1
https://stackoverflow.com/questions/50760841/spark-sql-udf-cast-return-value?rq=1
ML: https://stackoverflow.com/questions/53551000/spark-create-dataframe-in-udf
In clause: https://stackoverflow.com/questions/57109478/filtering-a-datasetrow-if-month-is-in-list-of-integers
CREATE aggregate = PostgreSQL, SQL Server
use cases - any custom aggregates, like geometric mean, weighted mean
deterministic - if 2 calls of the same function (with the same parameters) always return the same results It's mostly used in the plan optimization and sometimes during the phase of analysis:analysis step, when the child node is not deterministic, then it shouldn't appear in the aggregation:
if (!child.deterministic) {
failAnalysis(
s"nondeterministic expression ${expr.sql} should not " +
s"appear in the arguments of an aggregate function.")
}
failAnalysis(
s"""nondeterministic expressions are only allowed in
|Project, Filter, Aggregate or Window, found:
| ${o.expressions.map(_.sql).mkString(",")}
|in operator ${operator.simpleString}
""".stripMargin)
custom aggregate can be called after groupBy(...) method, exactly like the aggregates like average, sum and so forth
evaluate - final result
bufferSchema → UDAF works on partial aggregates and this schema represents intermediate results. That's why it's different from DataType which is used to return things.
merge - partialr esults
update - adds new value to the buffer
https://www.waitingforcode.com/apache-spark-sql/user-defined-aggregate-functions/read
examples: https://stackoverflow.com/questions/4421768/the-most-useful-user-defined-aggregate-functions
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDA.html
https://issues.apache.org/jira/browse/SPARK-18127 - adds support for extensions
say that "catalog listeners are not really there"
https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
once the plan is parsed, is still unresolved - Apache Spark simply know which SQL operators and corresponding nodes of the logical plan should be executed
later it uses the metadata catalog in order to resolve the data - that creates a fully executable logical plan
that logical plan could be executed as it but before that, the engine also tries to optimize it by applying optimization rules
logical - reduce the # of operations and apply some of them at the data source level; optimizations executed iteratively in batch (spark.sql.optimizer.maxIterations, default = 100)
parser is not called for any methods invoking the API like map(...), select("", "") because it directly constructs logical plan nodes
order is defined in org.apache.spark.sql.catalyst.analysis.Analyzer#batches:
resolution rules are first
post-hoc resolution rules are next
extended check rules (analysis rules)
they're executed after resolving the plan: org.apache.spark.sql.catalyst.analysis.Analyzer#executeAndCheck
org.apache.spark.sql.internal.BaseSessionStateBuilder#sqlParser for parser
org.apache.spark.sql.catalyst.rules.RuleExecutor#execute ⇒ executed logical optimizations; called by lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData) !
org.apache.spark.sql.internal.BaseSessionStateBuilder#analyzer has all rules applied during the analysis stage
org.apache.spark.sql.internal.BaseSessionStateBuilder#optimizer ⇒ all logical optimization rules
parserPlan → SQL query (SELECT * …)
parseExpression → nr > 1 (nr = col)
parseTableIdentifier → converts a table name into a TableIdentifier, e.g. DataFrameWriter.insertInto (.write.insertInto); just a case class holding table name and database attributes
AstBuilder: The AstBuilder converts an ANTLR4 ParseTree into a catalyst Expression, LogicalPlan or
* TableIdentifier.
https://blog.octo.com/mythbuster-apache-spark-parsing-requete-sql/
https://www.slideshare.net/SandeepJoshi55/apache-spark-undocumented-extensions-78929290
"SELECT", "i" ⇒ tokens built from lexer phasis
AST later built from parser phasis
handles unresolved ⇒ I consider it as unknown. If you define an alias in your query, Apache Spark doesn't know whether the columns really exist. It has to resolve them
* - UnresolvedStart, resolves attributes directly from the SubqueryAlias dataset_1 when UnresolvedStart#expand method is called
alias for: dataset1.sqlContext.sql("SELECT nr + 1 + 3 + 4, letter AS letter2, nr AS nr2 FROM dataset_1").explain(true)
relation: UnresolvedRelation, Holds the name of a relation that has yet to be looked up in a catalog. UnresolvedRelation becomes +- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9] +- LocalRelation [_1#3, _2#4, _3#5] for val dataset1 = Seq(("A", 1, 1), ("B", 2, 1), ("C", 3, 1), ("D", 4, 1), ("E", 5, 1)).toDF("letter", "nr", "a_flag")
dataset1.createOrReplaceTempView("dataset_1")
dataset1.sqlContext.sql("SELECT letter AS letter2, nr AS nr2 FROM dataset_1").explain(true)
order does matter, e.g. for DataSourceAnalysis which "must be run after `PreprocessTableCreation` and `PreprocessTableInsertion`." ;
e.g. DataSourceAnalysis - replaces generic operations like InsertIntoTable by more specific (Spark SQL) operations, e.g. InsertIntoDataSourceCommand; another example InsertIntoDir ⇒ InsertIntoDataSourceDirCommand
TODO: show INSERT INTO TABLE tab1 SELECT 1, 2 INSERT INTO TABLE tab1 SELECT 1, 2
TODO: generate an example with RunnableCommand
order does matter ⇒ /**
* Replaces generic operations with specific variants that are designed to work with Spark
* SQL Data Sources.
*
* Note that, this rule must be run after `PreprocessTableCreation` and
* `PreprocessTableInsertion`.
*/
case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with CastSupport {
fail-fast approach - executed before physically running the query
mostly executed as a pattern matching on the LogicalPlan nodes
examples: PreWriteCheck (e.g. Cannot insert into table that is also being read from) , PreReadCheck (input_file_name function in Hive https://issues.apache.org/jira/browse/SPARK-21354 that does not support more than one sources)
see this: https://github.com/apache/spark/commit/2b10ebe6ac1cdc2c723cb47e4b88cfbf39e0de08#diff-73bd90660f41c12a87ee9fe8d35d856a for HiveSupport
override val extendedCheckRules: Seq[LogicalPlan => Unit] =
PreWriteCheck +:
PreReadCheck +:
* A rule to do various checks before inserting into or writing to a data source table.
* A rule to do various checks before reading a table.
e.g. do not allow to write the table used in source (INSERT INTO clause)
e.g. whether you do not execute Hive queries without Hive support enabled: " * A rule to check whether the functions are supported only when Hive support is enabled"
HiveOnlyCheck +:
here org.apache.spark.sql.execution.datasources.PreWriteCheck$#failAnalysis is your friend, only check
if you did some Java assert()) or @
fail-fast approach - executed before physically running the query
mostly executed as a pattern matching on the LogicalPlan nodes
examples: PreWriteCheck (e.g. Cannot insert into table that is also being read from) , PreReadCheck (input_file_name function in Hive https://issues.apache.org/jira/browse/SPARK-21354 that does not support more than one sources)
see this: https://github.com/apache/spark/commit/2b10ebe6ac1cdc2c723cb47e4b88cfbf39e0de08#diff-73bd90660f41c12a87ee9fe8d35d856a for HiveSupport
override val extendedCheckRules: Seq[LogicalPlan => Unit] =
PreWriteCheck +:
PreReadCheck +:
* A rule to do various checks before inserting into or writing to a data source table.
* A rule to do various checks before reading a table.
e.g. do not allow to write the table used in source (INSERT INTO clause)
e.g. whether you do not execute Hive queries without Hive support enabled: " * A rule to check whether the functions are supported only when Hive support is enabled"
HiveOnlyCheck +:
here org.apache.spark.sql.execution.datasources.PreWriteCheck$#failAnalysis is your friend, only check
if you did some Java assert()) or @
rules can be excluded from spark.sql.optimizer.excludedRules property
dataset reduction - plan is rewritten to execute filters on data source side, eg. PushDownPredicate; it reverses filter and project:
case Filter(condition, project @ Project(fields, grandChild))
if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) =>
// Create a map of Aliases to their values from the child projection.
// e.g., 'SELECT a + b AS c, d ...' produces Map(c -> a + b).
val aliasMap = AttributeMap(fields.collect {
case a: Alias => (a.toAttribute, a.child)
})
project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild))
e.g.
// *Project [amount#8L, id#9L]
// +- *Filter (isnotnull(amount#8L) && (amount#8L > 10))insead of Project → Filter (bottom up read
So that filter is executed before
transform - recursively applies the rule on the AST; up or bottom - up goes down to up (children and at the end the current node)eg. operations are reversed for predicatepushdown (filter with project), sometimes the operations can be replaced (e.g. 2 Filter nodes with 1 Filter node containing both conditions, and later you can even remove it when the filter returns always true), removed (e;g when filter is always true, when the same SELECT is called twice)
resolve - similar to transform but skips already analyzed sub-trees ;when resolve* is called, Apache Spark will start by checking the analyzed flag of the plan. In the case of a false value, it will simply skip the rule logicImportant point to note: even though you use resolve*, a transform* can create a completely new plan and invalidate the value of analyzed flagResolve applies mostly on the nodes that can be evaluted only once, like aliases resolution, subtitution methods (inclusion CTE plan, children plan substituted with window spec definitions), relations resolution
Node - kind of container; most of the time it will be interpreted in the custom logical optimizations
Expression - a stringified version of what do you want to do with the operator, e.g. list of columns, filter expression (simple, IN statement). Globally can be considered as a method taking some input and generating some output
different variants: Unary (1 input, 1 output), named (e.g. alias), binary (2 inputs, 1 output), ternary (3 in, 1 out, e.g. months between)
PART OF SPARKPLAN
sequential execution, one tree level at a time ; doExecute can call leftInput.execute() and inside it operates mostly on the RDD functions like map, mapPartitions, foreachPartitions and so on
doExecuteBroadcast - used for intance in BroadcastHasJoinExec to broadcast a part of the query to the rest of executors
doPrepare - if something must be initialized before the physical execution; eg. subquery execution initializes here the subquery which is defined as a lazy val Future → private lazy val relationFuture: Future[Array[InternalRow]] and in doPrepare it's called as protected override def doPrepare(): Unit = {
relationFuture
}
PART OF CODEGENSUPPORT trait
for doProduce ⇒ produces generated code to process
doConsume ⇒ processes rows or columns generated by the physical plan
inputRDDs → input rows for this plan
codegen optimizes CPU usage by generating a single optimized function in bytecode for the set of operators in a SQL query (when possible), instead of generating iterator code for each operator.
https://www.slideshare.net/datamantra/anatomy-of-spark-sql-catalyst-part-2
https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-catalyst-optimizer-with-yin-huai
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
recall information about data catalogs
say that the comment is not true
but promising since catalog federation is an ongoing effort for Apache Spark → https://issues.apache.org/jira/browse/SPARK-15777
but despite the lack of support, you can still extend the catalogs
spark.sql.optimizer.excludedRules
find inspirations ⇒ not clearly documented