SlideShare a Scribd company logo
1 of 38
Apache Spark à ton image
User-Defined features et session extensions
Bartosz Konieczny
@waitingforcode
First things first
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com
#@waitingforcode
#github.com/bartosz25
Apache Spark Community Feedback initiative
https://www.waitingforcode.com/static/spark-feedback
Why ?
● a single place with all best practices
● community-driven
● open
● interactive
How ?
● fill the form (https://forms.gle/sjSWPKmudhM6a3776)
● validate
● share
● learn
Why this talk ?
Do it only if you have to
High-level
customization
High level customization
● User-Defined Type (UDT)
● User-Defined Function (UDF)
● User-Defined Aggregate Functions (UDAF)
⇒ RDBMS in Apache Spark
⇒ no need of internals
User-Defined Type
● prior to 2.0 only - ongoing effort for 3.0
● UDTRegistration
● Dataset substitution - your class in DataFrame
● examples: VectorUDT, MatrixUDT
def sqlType: DataType
def pyUDT: String = null
def serializedPyClass: String = null
def serialize(obj: UserType): Any
def deserialize(datum: Any): UserType
User-Defined Type - example
@SQLUserDefinedType(udt = classOf[CityUDT])
case class City(name: String, country: Countries) {
def isFrench: Boolean = country == Countries.France
}
class CityUDT extends UserDefinedType[City] {
override def sqlType: DataType = StructType(Seq(StructField("name", StringType),
StructField("country", StringType)))
// ...
}
val cities = Seq(City("Paris", Countries.France), City("London", Countries.England)).toDF("city")
User-Defined Type - schema
root
|-- city: city (nullable = true)
cities.select("city").show()
+-------------------------------------+
| city|
+-------------------------------------+
| City(Paris,France) |
|City(London,England) |
+-------------------------------------+
cities.map(row => row.getAs[City]("city").isFrench).show()
+-------+
|value|
+-------+
| true |
|false |
+-------+
User-Defined Type - expression retrieval
cities.where("city.name == 'Paris'").show()
org.apache.spark.sql.AnalysisException: Can't extract value from city#3: need struct type
but got struct<name:string,region:string>; line 1 pos 0
at
org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala
:73)
User-Defined Function
● SQL's: CREATE FUNCTION
● blackbox
● vectorized UDFs for PySpark - ML purpose
case class UserDefinedFunction protected[sql] (f: AnyRef, dataType: DataType, inputTypes:
Option[Seq[DataType]]) {
def nullable: Boolean = _nullable
def deterministic: Boolean = _deterministic
def apply(exprs: Column*): Column = { … }
}
User-Defined Function
Registration methods:
● SparkSession#udf#registers
sparkSession.udf.register("MyUDF", myUdf _)
● org.apache.spark.sql.functions.udf
udf(myUdf _)
User-Defined Function
In the query
sparkSession.udf.register("EvenFlagResolver_registerTest", evenFlagResolver
_)
val rows = letterNumbers.selectExpr("letter",
"EvenFlagResolver_registerTest(number) as isEven")
Programmatically
val udfEvenResolver = udf(evenFlagResolver _)
val rows = letterNumbers.select($"letter",
udfEvenResolver($"number") as "isEven")
● if-else == CASE WHEN
● tokenize == LOWER(REGEXP_REPLACE(.....)
● IN clause
● columns equality == abs(col1 - col2) < allowed_precision
● wrapping DataFrame execution == JOIN
val jobnameDF = jobnameSeq.toDF("jobid","jobname")
sqlContext.udf.register("getJobname", (id: String) => (
jobnameDF.filter($"jobid" === id).select($"jobname")
)
)
● testing ML model == MLib
User-Defined Function
StackOverflow overused
UDF in generated code
/* 047 */ mapelements_funcResult_0 = ((scala.Function1) references[1] /* literal
*/).apply(mapelements_mutableStateArray_0[0]);
/* 094 */ private void serializefromobject_doConsume_0(scala.Tuple2 serializefromobject_expr_0_0, boolean
serializefromobject_exprIsNull_0_0) throws java.io.IOException {
/* 105 */ if (!serializefromobject_isNull_2) {
/* 106 */ Object serializefromobject_funcResult_0 = null;
/* 107 */ serializefromobject_funcResult_0 = serializefromobject_expr_0_0._1();
/* 108 */
/* 109 */ if (serializefromobject_funcResult_0 != null) {
/* 110 */ serializefromobject_value_2 = (java.lang.String) serializefromobject_funcResult_0;
/* 111 */ } else {
/* 112 */ serializefromobject_isNull_2 = true;
/* 113 */ }
/* 114 */
/* 115 */ }
/* 131 */ if (!false) {
/* 132 */ serializefromobject_isNull_5 = false;
/* 133 */ if (!serializefromobject_isNull_5) {
/* 134 */ Object serializefromobject_funcResult_1 = null;
/* 135 */ serializefromobject_funcResult_1 = serializefromobject_expr_0_0._2();
/* 136 */ serializefromobject_value_5 = (Integer) serializefromobject_funcResult_1;
/* 137 */
/* 138 */ }
● CREATE AGGREGATE - RDBMS, but NoSQL too (Apache Cassandra)
● 1 operation on n rows
● "custom MIN, MAX, SUM, …"
● UDF registration
User-Defined Aggregate Function
def inputSchema: StructType
def bufferSchema: StructType
def dataType: DataType
def deterministic: Boolean
def initialize(buffer: MutableAggregationBuffer): Unit
def update(buffer: MutableAggregationBuffer, input: Row): Unit
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
def evaluate(buffer: Row): Any
UDAF physical plan
== Physical Plan ==
HashAggregate(keys=[user#7], functions=[sessiondurationaggregator(cast(time#8 as
bigint), com.waitingforcode.sql.SessionDurationAggregator@3f0c6b3c, 0, 0)],
output=[user#7, sessionlength_registertest(time)#16L])
+- Exchange hashpartitioning(user#7, 200)
+- HashAggregate(keys=[user#7],
functions=[partial_sessiondurationaggregator(cast(time#8 as bigint),
com.waitingforcode.sql.SessionDurationAggregator@3f0c6b3c, 0, 0)], output=[user#7,
first_log_time#44L, last_log_time#45L])
+- LocalTableScan [user#7, time#8]
Low-level
customization
Low-level customization
● extensions:
○ analyzer rules
○ check analysis rules
○ custom parser
○ optimizations
○ physical execution
● withExtensions(...) or
spark.sql.extensions
def withExtensions(f: SparkSessionExtensions =>
Unit): Builder = synchronized {
f(extensions)
this
}
SparkSession.builder().withExtensions(extensions => {
extensions.injectOptimizerRule(_ => MyRule)
})
-----------------------------------------------------
SparkSession.builder().config("spark.sql.extensions",
"org.example.MyExtensions")
class MyExtensions extends
Function1[SparkSessionExtensions, Unit] {
override def apply(extensions:
SparkSessionExtensions): Unit = {
Back to the basics
● resolution rules
● post-hoc resolution
rules
● check analysis rules
logical optimization rules
parsers
planner strategies
Parser
● String → lexer → parser → AST
● AstBuilder
● dedicated exception: ParseException
● parse*
trait ParserInterface {
def parsePlan(sqlText: String): LogicalPlan
def parseExpression(sqlText: String): Expression
def parseTableIdentifier(sqlText: String): TableIdentifier
}
Parser - from text to AST example
SELECT id, login FROM users WHERE id > 1 AND active = true
"SELECT", "i", "d", ",", "l", "o", "g", "i", "n", "WHERE", "i", "d", ">", "1",
"AND", "a", "c", "t", "i", "v", "e", "=", "t", "r", "u", "e"
(whitespaces omitted for readability)
Resolution rules
● handles unresolved, i.e. unknown becomes known
● Example:
SELECT * FROM dataset_1
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `dataset_1`
== Analyzed Logical Plan ==
letter: string, nr: int, a_flag: int
Project [letter#7, nr#8, a_flag#9]
+- SubqueryAlias `dataset_1`
+- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9]
+- LocalRelation [_1#3, _2#4, _3#5]
Post hoc resolution rules
● after resolution rules (post hoc)
● same as custom optimization rules
● order does matter
PreprocessTableCreation → PreprocessTableInsertion → DataSourceAnalysis
Examples:
● normalization - casting, renaming
● partitioning checks, e.g. "$partKey is not a partition column"
● generic LogicalPlan resolution, e.g. CreateTable(with query) ⇒
CreateDataSourceTableAsSelectCommand,
CreateTable(without query) ⇒
CreateDataSourceTableCommand
Check analysis rules
● plain assertions
Connection conn = getConnection();
assert conn != null : "Connection is null";
● clearer error messages:
"assertion failed: No plan for CreateTable CatalogTable" ⇒ ""Hive
support is required to use CREATE Hive TABLE AS SELECT"
● API:
MyAnalysisRule extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
// throw new AnalysisException("Analysis error message")
}
Check analysis rules - PreWriteCheck
object PreWriteCheck extends (LogicalPlan => Unit) {
def apply(plan: LogicalPlan): Unit = {
plan.foreach {
case InsertIntoTable(l @ LogicalRelation(relation, _, _, _),
partition, query, _, _) =>
val srcRelations = query.collect {case LogicalRelation(src,
_, _, _) => src}
if (srcRelations.contains(relation)) {
failAnalysis("Cannot insert into table that is also being read
from.")
} else {
// ...
Logical optimization rule
● simplification :
(id > 0 OR login == 'test') AND id > 0 == id > 0
● collapse :
.repartition(10).repartition(20) == .repartition(20)
● dataset reduction :
columns pruning, predicate pushdown
● human mistakes :
trivial filters (2 > 1), execution tree cleaning (identity functions), redundancy
(projection, aliases)
Logical optimization rule - diff transform vs resolve
General template:
def apply(plan: LogicalPlan): LogicalPlan = plan.{{TRANSFORMATION}} {
case agg: Aggregation => …
case projection: Project => ...
}
{{TRANSFORMATION}} = transformUp/transformDown,
resolveOperatorsUp/resolveOperatorsDown
Logical optimization rule
Classes to know:
● Node
● Expression
import sparkSession.implicits._
val dataset1 = Seq(("A", 1, 1), ("B", 2, 1), ("C", 3,
1), ("D", 4, 1), ("E", 5, 1)).toDF("letter", "nr",
"a_flag")
dataset1.filter("nr > 1")
.explain(true)
Physical plan
● physical execution: code generation, RDD
● WholeStageCodeGen
● *Exec: BroadcastHashJoinExec, DataSourceV2ScanExec,
SortExec, … ⇒ SparkPlan {LeafExecNode, UnaryExecNode,
BinaryExecNode} + CodegenSupport
protected def doExecute(): RDD[InternalRow]
protected def doPrepare(): Unit = {}
protected[sql] def doExecuteBroadcast[T](): broadcast.Broadcast[T]
protected def doProduce(ctx: CodegenContext): String
def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String
def inputRDDs(): Seq[RDD[InternalRow]]
Catalog listeners
* Holder for injection points to the [[SparkSession]]. We make NO
guarantee about the stability
* regarding binary compatibility and source compatibility of methods here.
* This current provides the following extension points:
* <ul>
...
* <li>(External) Catalog listeners.</li>
...
class SparkSessionExtensions {
Catalog listeners
● ExternalCatalogWithListener
val catalogEvents = new scala.collection.mutable.ListBuffer[ExternalCatalogEvent]()
TestedSparkSession.sparkContext.addSparkListener(new SparkListener {
override def onOtherEvent(event: SparkListenerEvent): Unit = {
event match {
case externalCatalogEvent: ExternalCatalogEvent =>
catalogEvents.append(externalCatalogEvent)
case _ => {}
}
}
})
// ExternalCatalogEvent = (CreateTablePreEvent, CreateTableEvent, AlterTableEvent, ...)
Lessons learned
● Apache Spark first ⇒ do not write UDF just to write one, prefer native API
● debug & log
● analyze
● disable rules - much easier
● start small
● find inspirations → NoSQL connectors, "extends SparkPlan", "extends Rule[LogicalPlan]"
● test at scale
Before I let you go
© https://static.thenounproject.com/png/159676-200.png
© https://www.kisspng.com/png-gift-festival-clip-art-vector-lovely-hand-painted-498066
https://www.waitingforcode.com/static/spark-meetup
...still not yet
© https://static.thenounproject.com/png/159676-200.png
Apache Spark JIRAs mentioned in this talk
https://issues.apache.org/jira/browse/SPARK-14155
https://issues.apache.org/jira/browse/SPARK-7768
https://issues.apache.org/jira/browse/SPARK-15777
https://issues.apache.org/jira/browse/SPARK-27969
StackOverflow questions for UDF examples
https://stackoverflow.com/a/46464610/9726075
https://stackoverflow.com/a/55136918/9726075
https://stackoverflow.com/a/35908115/9726075
https://stackoverflow.com/a/57110158/9726075
https://stackoverflow.com/a/48007884/9726075
https://stackoverflow.com/a/50764291/9726075
https://stackoverflow.com/q/53551000/9726075
...still not yet
ANTLR tutorial
https://tomassetti.me/antlr-mega-tutorial/
Other presentations
https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-
catalyst-optimizer-with-yin-huai
https://www.slideshare.net/SandeepJoshi55/apache-spark-
undocumented-extensions-78929290
https://www.slideshare.net/datamantra/anatomy-of-spark-sql-catalyst-
part-2
My series about Apache Spark custom optimization
https://www.waitingforcode.com/tags/spark-sql-customization
Other resources
https://stackoverflow.com/questions/38296609/spark-functions-vs-
udf-performance/49103325
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-
for-pyspark.html
© https://static.thenounproject.com/png/159676-200.png
Thank you !

More Related Content

What's hot

Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 

What's hot (20)

PostgreSQL Terminology
PostgreSQL TerminologyPostgreSQL Terminology
PostgreSQL Terminology
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
 
Presto overview
Presto overviewPresto overview
Presto overview
 
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and GotchasPostgreSQL Procedural Languages: Tips, Tricks and Gotchas
PostgreSQL Procedural Languages: Tips, Tricks and Gotchas
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 
RestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message QueueRestMQ - HTTP/Redis based Message Queue
RestMQ - HTTP/Redis based Message Queue
 
Strategic autovacuum
Strategic autovacuumStrategic autovacuum
Strategic autovacuum
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
PostgreSQL Replication Tutorial
PostgreSQL Replication TutorialPostgreSQL Replication Tutorial
PostgreSQL Replication Tutorial
 
Using Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibanaUsing Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibana
 
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companionPGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
PGConf APAC 2018 - Patroni: Kubernetes-native PostgreSQL companion
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Elk stack
Elk stackElk stack
Elk stack
 
Logstash-Elasticsearch-Kibana
Logstash-Elasticsearch-KibanaLogstash-Elasticsearch-Kibana
Logstash-Elasticsearch-Kibana
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
 
Using ngx_lua in UPYUN
Using ngx_lua in UPYUNUsing ngx_lua in UPYUN
Using ngx_lua in UPYUN
 
Full Text Search in PostgreSQL
Full Text Search in PostgreSQLFull Text Search in PostgreSQL
Full Text Search in PostgreSQL
 

Similar to Apache Spark in your likeness - low and high level customization

PofEAA and SQLAlchemy
PofEAA and SQLAlchemyPofEAA and SQLAlchemy
PofEAA and SQLAlchemy
Inada Naoki
 
External Language Stored Procedures for MySQL
External Language Stored Procedures for MySQLExternal Language Stored Procedures for MySQL
External Language Stored Procedures for MySQL
Antony T Curtis
 

Similar to Apache Spark in your likeness - low and high level customization (20)

RESTful API using scalaz (3)
RESTful API using scalaz (3)RESTful API using scalaz (3)
RESTful API using scalaz (3)
 
PyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorialPyCon 2010 SQLAlchemy tutorial
PyCon 2010 SQLAlchemy tutorial
 
Rntb20200805
Rntb20200805Rntb20200805
Rntb20200805
 
Reactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiReactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz Khambati
 
Threads, Queues, and More: Async Programming in iOS
Threads, Queues, and More: Async Programming in iOSThreads, Queues, and More: Async Programming in iOS
Threads, Queues, and More: Async Programming in iOS
 
Slickdemo
SlickdemoSlickdemo
Slickdemo
 
Introduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-finalIntroduction to-mongo db-execution-plan-optimizer-final
Introduction to-mongo db-execution-plan-optimizer-final
 
Introduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizerIntroduction to Mongodb execution plan and optimizer
Introduction to Mongodb execution plan and optimizer
 
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
TDC2018SP | Trilha Go - Processando analise genetica em background com GoTDC2018SP | Trilha Go - Processando analise genetica em background com Go
TDC2018SP | Trilha Go - Processando analise genetica em background com Go
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
PofEAA and SQLAlchemy
PofEAA and SQLAlchemyPofEAA and SQLAlchemy
PofEAA and SQLAlchemy
 
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-FunctionsIntegration-Monday-Stateful-Programming-Models-Serverless-Functions
Integration-Monday-Stateful-Programming-Models-Serverless-Functions
 
Fun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksFun Teaching MongoDB New Tricks
Fun Teaching MongoDB New Tricks
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheel
 
Apache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster ComputingApache Spark, the Next Generation Cluster Computing
Apache Spark, the Next Generation Cluster Computing
 
Design Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron PattersonDesign Summit - Rails 4 Migration - Aaron Patterson
Design Summit - Rails 4 Migration - Aaron Patterson
 
External Language Stored Procedures for MySQL
External Language Stored Procedures for MySQLExternal Language Stored Procedures for MySQL
External Language Stored Procedures for MySQL
 
Core2 Document - Java SCORE Overview.pptx.pdf
Core2 Document - Java SCORE Overview.pptx.pdfCore2 Document - Java SCORE Overview.pptx.pdf
Core2 Document - Java SCORE Overview.pptx.pdf
 
Parallel and Async Programming With C#
Parallel and Async Programming With C#Parallel and Async Programming With C#
Parallel and Async Programming With C#
 

Recently uploaded

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 

Recently uploaded (20)

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 

Apache Spark in your likeness - low and high level customization

  • 1. Apache Spark à ton image User-Defined features et session extensions Bartosz Konieczny @waitingforcode
  • 2. First things first Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #@waitingforcode #github.com/bartosz25
  • 3. Apache Spark Community Feedback initiative https://www.waitingforcode.com/static/spark-feedback Why ? ● a single place with all best practices ● community-driven ● open ● interactive How ? ● fill the form (https://forms.gle/sjSWPKmudhM6a3776) ● validate ● share ● learn
  • 5. Do it only if you have to
  • 7. High level customization ● User-Defined Type (UDT) ● User-Defined Function (UDF) ● User-Defined Aggregate Functions (UDAF) ⇒ RDBMS in Apache Spark ⇒ no need of internals
  • 8. User-Defined Type ● prior to 2.0 only - ongoing effort for 3.0 ● UDTRegistration ● Dataset substitution - your class in DataFrame ● examples: VectorUDT, MatrixUDT def sqlType: DataType def pyUDT: String = null def serializedPyClass: String = null def serialize(obj: UserType): Any def deserialize(datum: Any): UserType
  • 9. User-Defined Type - example @SQLUserDefinedType(udt = classOf[CityUDT]) case class City(name: String, country: Countries) { def isFrench: Boolean = country == Countries.France } class CityUDT extends UserDefinedType[City] { override def sqlType: DataType = StructType(Seq(StructField("name", StringType), StructField("country", StringType))) // ... } val cities = Seq(City("Paris", Countries.France), City("London", Countries.England)).toDF("city")
  • 10. User-Defined Type - schema root |-- city: city (nullable = true) cities.select("city").show() +-------------------------------------+ | city| +-------------------------------------+ | City(Paris,France) | |City(London,England) | +-------------------------------------+ cities.map(row => row.getAs[City]("city").isFrench).show() +-------+ |value| +-------+ | true | |false | +-------+
  • 11. User-Defined Type - expression retrieval cities.where("city.name == 'Paris'").show() org.apache.spark.sql.AnalysisException: Can't extract value from city#3: need struct type but got struct<name:string,region:string>; line 1 pos 0 at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala :73)
  • 12. User-Defined Function ● SQL's: CREATE FUNCTION ● blackbox ● vectorized UDFs for PySpark - ML purpose case class UserDefinedFunction protected[sql] (f: AnyRef, dataType: DataType, inputTypes: Option[Seq[DataType]]) { def nullable: Boolean = _nullable def deterministic: Boolean = _deterministic def apply(exprs: Column*): Column = { … } }
  • 13. User-Defined Function Registration methods: ● SparkSession#udf#registers sparkSession.udf.register("MyUDF", myUdf _) ● org.apache.spark.sql.functions.udf udf(myUdf _)
  • 14. User-Defined Function In the query sparkSession.udf.register("EvenFlagResolver_registerTest", evenFlagResolver _) val rows = letterNumbers.selectExpr("letter", "EvenFlagResolver_registerTest(number) as isEven") Programmatically val udfEvenResolver = udf(evenFlagResolver _) val rows = letterNumbers.select($"letter", udfEvenResolver($"number") as "isEven")
  • 15. ● if-else == CASE WHEN ● tokenize == LOWER(REGEXP_REPLACE(.....) ● IN clause ● columns equality == abs(col1 - col2) < allowed_precision ● wrapping DataFrame execution == JOIN val jobnameDF = jobnameSeq.toDF("jobid","jobname") sqlContext.udf.register("getJobname", (id: String) => ( jobnameDF.filter($"jobid" === id).select($"jobname") ) ) ● testing ML model == MLib User-Defined Function StackOverflow overused
  • 16. UDF in generated code /* 047 */ mapelements_funcResult_0 = ((scala.Function1) references[1] /* literal */).apply(mapelements_mutableStateArray_0[0]); /* 094 */ private void serializefromobject_doConsume_0(scala.Tuple2 serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) throws java.io.IOException { /* 105 */ if (!serializefromobject_isNull_2) { /* 106 */ Object serializefromobject_funcResult_0 = null; /* 107 */ serializefromobject_funcResult_0 = serializefromobject_expr_0_0._1(); /* 108 */ /* 109 */ if (serializefromobject_funcResult_0 != null) { /* 110 */ serializefromobject_value_2 = (java.lang.String) serializefromobject_funcResult_0; /* 111 */ } else { /* 112 */ serializefromobject_isNull_2 = true; /* 113 */ } /* 114 */ /* 115 */ } /* 131 */ if (!false) { /* 132 */ serializefromobject_isNull_5 = false; /* 133 */ if (!serializefromobject_isNull_5) { /* 134 */ Object serializefromobject_funcResult_1 = null; /* 135 */ serializefromobject_funcResult_1 = serializefromobject_expr_0_0._2(); /* 136 */ serializefromobject_value_5 = (Integer) serializefromobject_funcResult_1; /* 137 */ /* 138 */ }
  • 17. ● CREATE AGGREGATE - RDBMS, but NoSQL too (Apache Cassandra) ● 1 operation on n rows ● "custom MIN, MAX, SUM, …" ● UDF registration User-Defined Aggregate Function def inputSchema: StructType def bufferSchema: StructType def dataType: DataType def deterministic: Boolean def initialize(buffer: MutableAggregationBuffer): Unit def update(buffer: MutableAggregationBuffer, input: Row): Unit def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit def evaluate(buffer: Row): Any
  • 18. UDAF physical plan == Physical Plan == HashAggregate(keys=[user#7], functions=[sessiondurationaggregator(cast(time#8 as bigint), com.waitingforcode.sql.SessionDurationAggregator@3f0c6b3c, 0, 0)], output=[user#7, sessionlength_registertest(time)#16L]) +- Exchange hashpartitioning(user#7, 200) +- HashAggregate(keys=[user#7], functions=[partial_sessiondurationaggregator(cast(time#8 as bigint), com.waitingforcode.sql.SessionDurationAggregator@3f0c6b3c, 0, 0)], output=[user#7, first_log_time#44L, last_log_time#45L]) +- LocalTableScan [user#7, time#8]
  • 20. Low-level customization ● extensions: ○ analyzer rules ○ check analysis rules ○ custom parser ○ optimizations ○ physical execution ● withExtensions(...) or spark.sql.extensions def withExtensions(f: SparkSessionExtensions => Unit): Builder = synchronized { f(extensions) this } SparkSession.builder().withExtensions(extensions => { extensions.injectOptimizerRule(_ => MyRule) }) ----------------------------------------------------- SparkSession.builder().config("spark.sql.extensions", "org.example.MyExtensions") class MyExtensions extends Function1[SparkSessionExtensions, Unit] { override def apply(extensions: SparkSessionExtensions): Unit = {
  • 21. Back to the basics ● resolution rules ● post-hoc resolution rules ● check analysis rules logical optimization rules parsers planner strategies
  • 22. Parser ● String → lexer → parser → AST ● AstBuilder ● dedicated exception: ParseException ● parse* trait ParserInterface { def parsePlan(sqlText: String): LogicalPlan def parseExpression(sqlText: String): Expression def parseTableIdentifier(sqlText: String): TableIdentifier }
  • 23. Parser - from text to AST example SELECT id, login FROM users WHERE id > 1 AND active = true "SELECT", "i", "d", ",", "l", "o", "g", "i", "n", "WHERE", "i", "d", ">", "1", "AND", "a", "c", "t", "i", "v", "e", "=", "t", "r", "u", "e" (whitespaces omitted for readability)
  • 24. Resolution rules ● handles unresolved, i.e. unknown becomes known ● Example: SELECT * FROM dataset_1 == Parsed Logical Plan == 'Project [*] +- 'UnresolvedRelation `dataset_1` == Analyzed Logical Plan == letter: string, nr: int, a_flag: int Project [letter#7, nr#8, a_flag#9] +- SubqueryAlias `dataset_1` +- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9] +- LocalRelation [_1#3, _2#4, _3#5]
  • 25. Post hoc resolution rules ● after resolution rules (post hoc) ● same as custom optimization rules ● order does matter PreprocessTableCreation → PreprocessTableInsertion → DataSourceAnalysis Examples: ● normalization - casting, renaming ● partitioning checks, e.g. "$partKey is not a partition column" ● generic LogicalPlan resolution, e.g. CreateTable(with query) ⇒ CreateDataSourceTableAsSelectCommand, CreateTable(without query) ⇒ CreateDataSourceTableCommand
  • 26. Check analysis rules ● plain assertions Connection conn = getConnection(); assert conn != null : "Connection is null"; ● clearer error messages: "assertion failed: No plan for CreateTable CatalogTable" ⇒ ""Hive support is required to use CREATE Hive TABLE AS SELECT" ● API: MyAnalysisRule extends (LogicalPlan => Unit) { def apply(plan: LogicalPlan): Unit = { // throw new AnalysisException("Analysis error message") }
  • 27. Check analysis rules - PreWriteCheck object PreWriteCheck extends (LogicalPlan => Unit) { def apply(plan: LogicalPlan): Unit = { plan.foreach { case InsertIntoTable(l @ LogicalRelation(relation, _, _, _), partition, query, _, _) => val srcRelations = query.collect {case LogicalRelation(src, _, _, _) => src} if (srcRelations.contains(relation)) { failAnalysis("Cannot insert into table that is also being read from.") } else { // ...
  • 28. Logical optimization rule ● simplification : (id > 0 OR login == 'test') AND id > 0 == id > 0 ● collapse : .repartition(10).repartition(20) == .repartition(20) ● dataset reduction : columns pruning, predicate pushdown ● human mistakes : trivial filters (2 > 1), execution tree cleaning (identity functions), redundancy (projection, aliases)
  • 29. Logical optimization rule - diff transform vs resolve General template: def apply(plan: LogicalPlan): LogicalPlan = plan.{{TRANSFORMATION}} { case agg: Aggregation => … case projection: Project => ... } {{TRANSFORMATION}} = transformUp/transformDown, resolveOperatorsUp/resolveOperatorsDown
  • 30. Logical optimization rule Classes to know: ● Node ● Expression import sparkSession.implicits._ val dataset1 = Seq(("A", 1, 1), ("B", 2, 1), ("C", 3, 1), ("D", 4, 1), ("E", 5, 1)).toDF("letter", "nr", "a_flag") dataset1.filter("nr > 1") .explain(true)
  • 31. Physical plan ● physical execution: code generation, RDD ● WholeStageCodeGen ● *Exec: BroadcastHashJoinExec, DataSourceV2ScanExec, SortExec, … ⇒ SparkPlan {LeafExecNode, UnaryExecNode, BinaryExecNode} + CodegenSupport protected def doExecute(): RDD[InternalRow] protected def doPrepare(): Unit = {} protected[sql] def doExecuteBroadcast[T](): broadcast.Broadcast[T] protected def doProduce(ctx: CodegenContext): String def doConsume(ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String def inputRDDs(): Seq[RDD[InternalRow]]
  • 32. Catalog listeners * Holder for injection points to the [[SparkSession]]. We make NO guarantee about the stability * regarding binary compatibility and source compatibility of methods here. * This current provides the following extension points: * <ul> ... * <li>(External) Catalog listeners.</li> ... class SparkSessionExtensions {
  • 33. Catalog listeners ● ExternalCatalogWithListener val catalogEvents = new scala.collection.mutable.ListBuffer[ExternalCatalogEvent]() TestedSparkSession.sparkContext.addSparkListener(new SparkListener { override def onOtherEvent(event: SparkListenerEvent): Unit = { event match { case externalCatalogEvent: ExternalCatalogEvent => catalogEvents.append(externalCatalogEvent) case _ => {} } } }) // ExternalCatalogEvent = (CreateTablePreEvent, CreateTableEvent, AlterTableEvent, ...)
  • 34. Lessons learned ● Apache Spark first ⇒ do not write UDF just to write one, prefer native API ● debug & log ● analyze ● disable rules - much easier ● start small ● find inspirations → NoSQL connectors, "extends SparkPlan", "extends Rule[LogicalPlan]" ● test at scale
  • 35. Before I let you go © https://static.thenounproject.com/png/159676-200.png © https://www.kisspng.com/png-gift-festival-clip-art-vector-lovely-hand-painted-498066 https://www.waitingforcode.com/static/spark-meetup
  • 36. ...still not yet © https://static.thenounproject.com/png/159676-200.png Apache Spark JIRAs mentioned in this talk https://issues.apache.org/jira/browse/SPARK-14155 https://issues.apache.org/jira/browse/SPARK-7768 https://issues.apache.org/jira/browse/SPARK-15777 https://issues.apache.org/jira/browse/SPARK-27969 StackOverflow questions for UDF examples https://stackoverflow.com/a/46464610/9726075 https://stackoverflow.com/a/55136918/9726075 https://stackoverflow.com/a/35908115/9726075 https://stackoverflow.com/a/57110158/9726075 https://stackoverflow.com/a/48007884/9726075 https://stackoverflow.com/a/50764291/9726075 https://stackoverflow.com/q/53551000/9726075
  • 37. ...still not yet ANTLR tutorial https://tomassetti.me/antlr-mega-tutorial/ Other presentations https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls- catalyst-optimizer-with-yin-huai https://www.slideshare.net/SandeepJoshi55/apache-spark- undocumented-extensions-78929290 https://www.slideshare.net/datamantra/anatomy-of-spark-sql-catalyst- part-2 My series about Apache Spark custom optimization https://www.waitingforcode.com/tags/spark-sql-customization Other resources https://stackoverflow.com/questions/38296609/spark-functions-vs- udf-performance/49103325 https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs- for-pyspark.html © https://static.thenounproject.com/png/159676-200.png

Editor's Notes

  1. explain the idea of the talk and how the series about customizing Apache Spark started
  2. I will show it later in details but everything I will present, do it only if you haven't other choice like no existing SQL operator or no existing data source or optimization (give an example of Casandra join optimization I found other day) Moreover, SparkSessionExtensions are still a @DeveloperApi, so "So more or less "use at your own risk"."
  3. TODO: does it work in PySpark? 3 types, actually 2 but it's worth knowing if you did RDBMS before, you will retrieve very similar principles code - a function, a class, no need to deep delve into the details; much easier, but also the risk of an overuse; I will show it later
  4. TODO: explain the purpose of VectorUDT and MatrixUDT (Spark MLib) was made private in https://issues.apache.org/jira/browse/SPARK-14155 because it supposed to be a new API for UDT supporting vectorized (batch) data and working better with Datasets e.g. enum sqlType → only intended to represent the type at Apache Spark storage level. It's not exposed to the end user so you can't do df.filter("myUdt.field_a = 'a'").show() ! See here for more information: You get this errors because schema defined by sqlType is never exposed and is not intended to be accessed directly. It simply provides a way to express a complex data types using native Spark SQL types. To access these properties, either use row.getAs[MyUdt] or an UDF https://stackoverflow.com/questions/33747851/spark-sql-referencing-attributes-of-udt?lq=1 pyUDT = paired Python UDT if exists as of this saying (16.08.2019), the ticket intending to bring back the API public (https://issues.apache.org/jira/browse/SPARK-7768) is still in progress and there is no information about how far the progress is; targeted release is 3.0 but probably it won't be the case How to use? You can directly access the property of given type in map or filter function, see an example here https://stackoverflow.com/a/51957666/9726075 UDT - you can use it in `row.getAs[MyType]("column")` methods, so in any mapping, filter, groupBy function MatrixUDT & VectorUDT - both are private and should be used from org.apache.spark.ml.linalg.SQLDataTypes https://issues.apache.org/jira/browse/SPARK-14155 https://issues.apache.org/jira/browse/SPARK-7768
  5. I was still coding in Java, the code is a little bit longer but I use shorter version for presentation purposes https://www.waitingforcode.com/apache-spark-sql/used-defined-type/read
  6. https://stackoverflow.com/questions/33747851/spark-sql-referencing-attributes-of-udt?lq=1 https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  7. https://stackoverflow.com/questions/33747851/spark-sql-referencing-attributes-of-udt?lq=1 https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  8. Vectorized UDF - normal UDF operates on one row at a time; mostly used in MLib and more exactly, as a @pandas_udf where it applies a function on Panda's Series rather than row by row; for some cases the accelerate rate is about 242 times! returns a column, so you can't use it for instance, inside an aggregation determinsitic - sometimes query planer can skip some optimizations and degrade the performance ; if executed multiple times for the same input, always generates the sasme query TODO: add this to the link with resources https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html https://docs.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql?view=sql-server-2017 https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_5009.htm blackbox ⇒ be careful about the implementation, Apache Spark doesn't know how to optimize them for you
  9. udf.register for PySpark works too you can later use "MyUDF" in the string expressions you can't do that for udf(...) which simply transforms a Scala function into the uDF that you can use later in the operations like letterNumbers.select($"letter", udfEvenResolver($"number") as "isEven") for val udfEvenResolver = udf(evenFlagResolver _)
  10. udf.register for PySpark works too you can later use "MyUDF" in the string expressions you can't do that for udf(...) which simply transforms a Scala function into the uDF that you can use later in the operations like letterNumbers.select($"letter", udfEvenResolver($"number") as "isEven") for val udfEvenResolver = udf(evenFlagResolver _)
  11. >>> wrapping - a great anti-pattern and proof that UDF will perform worse than native Apache Spark code most of the time - if used in wrong context not to blame but simply to highlight the fact of simplicity which is good and bad at the same time I don't know why ? To simplify ? To write a UT ? But we can still write it with Apache Spark https://stackoverflow.com/questions/46464125/how-to-write-multiple-if-statements-in-spark-udf/46464610 https://stackoverflow.com/questions/55135347/how-to-pass-dataframe-to-spark-udf https://stackoverflow.com/questions/35905273/using-a-udf-in-spark-data-frame-for-text-mining/35908115 https://stackoverflow.com/questions/47985382/how-to-use-udf-in-where-clause-in-scala-spark?rq=1 https://stackoverflow.com/questions/50760841/spark-sql-udf-cast-return-value?rq=1 ML: https://stackoverflow.com/questions/53551000/spark-create-dataframe-in-udf In clause: https://stackoverflow.com/questions/57109478/filtering-a-datasetrow-if-month-is-in-list-of-integers
  12. https://www.waitingforcode.com/apache-spark-sql/user-defined-functions/read
  13. CREATE aggregate = PostgreSQL, SQL Server use cases - any custom aggregates, like geometric mean, weighted mean deterministic - if 2 calls of the same function (with the same parameters) always return the same results It's mostly used in the plan optimization and sometimes during the phase of analysis: analysis step, when the child node is not deterministic, then it shouldn't appear in the aggregation: if (!child.deterministic) { failAnalysis( s"nondeterministic expression ${expr.sql} should not " + s"appear in the arguments of an aggregate function.") } failAnalysis( s"""nondeterministic expressions are only allowed in |Project, Filter, Aggregate or Window, found: | ${o.expressions.map(_.sql).mkString(",")} |in operator ${operator.simpleString} """.stripMargin) custom aggregate can be called after groupBy(...) method, exactly like the aggregates like average, sum and so forth evaluate - final result bufferSchema → UDAF works on partial aggregates and this schema represents intermediate results. That's why it's different from DataType which is used to return things. merge - partialr esults update - adds new value to the buffer https://www.waitingforcode.com/apache-spark-sql/user-defined-aggregate-functions/read examples: https://stackoverflow.com/questions/4421768/the-most-useful-user-defined-aggregate-functions https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDA.html
  14. https://issues.apache.org/jira/browse/SPARK-18127 - adds support for extensions
  15. say that "catalog listeners are not really there"
  16. https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/ once the plan is parsed, is still unresolved - Apache Spark simply know which SQL operators and corresponding nodes of the logical plan should be executed later it uses the metadata catalog in order to resolve the data - that creates a fully executable logical plan that logical plan could be executed as it but before that, the engine also tries to optimize it by applying optimization rules logical - reduce the # of operations and apply some of them at the data source level; optimizations executed iteratively in batch (spark.sql.optimizer.maxIterations, default = 100) parser is not called for any methods invoking the API like map(...), select("", "") because it directly constructs logical plan nodes order is defined in org.apache.spark.sql.catalyst.analysis.Analyzer#batches: resolution rules are first post-hoc resolution rules are next extended check rules (analysis rules) they're executed after resolving the plan: org.apache.spark.sql.catalyst.analysis.Analyzer#executeAndCheck org.apache.spark.sql.internal.BaseSessionStateBuilder#sqlParser for parser org.apache.spark.sql.catalyst.rules.RuleExecutor#execute ⇒ executed logical optimizations; called by lazy val optimizedPlan: LogicalPlan = sparkSession.sessionState.optimizer.execute(withCachedData) ! org.apache.spark.sql.internal.BaseSessionStateBuilder#analyzer has all rules applied during the analysis stage org.apache.spark.sql.internal.BaseSessionStateBuilder#optimizer ⇒ all logical optimization rules
  17. parserPlan → SQL query (SELECT * …) parseExpression → nr > 1 (nr = col) parseTableIdentifier → converts a table name into a TableIdentifier, e.g. DataFrameWriter.insertInto (.write.insertInto); just a case class holding table name and database attributes AstBuilder: The AstBuilder converts an ANTLR4 ParseTree into a catalyst Expression, LogicalPlan or * TableIdentifier. https://blog.octo.com/mythbuster-apache-spark-parsing-requete-sql/ https://www.slideshare.net/SandeepJoshi55/apache-spark-undocumented-extensions-78929290
  18. "SELECT", "i" ⇒ tokens built from lexer phasis AST later built from parser phasis
  19. handles unresolved ⇒ I consider it as unknown. If you define an alias in your query, Apache Spark doesn't know whether the columns really exist. It has to resolve them * - UnresolvedStart, resolves attributes directly from the SubqueryAlias dataset_1 when UnresolvedStart#expand method is called alias for: dataset1.sqlContext.sql("SELECT nr + 1 + 3 + 4, letter AS letter2, nr AS nr2 FROM dataset_1").explain(true) relation: UnresolvedRelation, Holds the name of a relation that has yet to be looked up in a catalog. UnresolvedRelation becomes +- Project [_1#3 AS letter#7, _2#4 AS nr#8, _3#5 AS a_flag#9] +- LocalRelation [_1#3, _2#4, _3#5] for val dataset1 = Seq(("A", 1, 1), ("B", 2, 1), ("C", 3, 1), ("D", 4, 1), ("E", 5, 1)).toDF("letter", "nr", "a_flag") dataset1.createOrReplaceTempView("dataset_1") dataset1.sqlContext.sql("SELECT letter AS letter2, nr AS nr2 FROM dataset_1").explain(true)
  20. order does matter, e.g. for DataSourceAnalysis which "must be run after `PreprocessTableCreation` and `PreprocessTableInsertion`." ; e.g. DataSourceAnalysis - replaces generic operations like InsertIntoTable by more specific (Spark SQL) operations, e.g. InsertIntoDataSourceCommand; another example InsertIntoDir ⇒ InsertIntoDataSourceDirCommand TODO: show INSERT INTO TABLE tab1 SELECT 1, 2 INSERT INTO TABLE tab1 SELECT 1, 2 TODO: generate an example with RunnableCommand order does matter ⇒ /** * Replaces generic operations with specific variants that are designed to work with Spark * SQL Data Sources. * * Note that, this rule must be run after `PreprocessTableCreation` and * `PreprocessTableInsertion`. */ case class DataSourceAnalysis(conf: SQLConf) extends Rule[LogicalPlan] with CastSupport {
  21. fail-fast approach - executed before physically running the query mostly executed as a pattern matching on the LogicalPlan nodes examples: PreWriteCheck (e.g. Cannot insert into table that is also being read from) , PreReadCheck (input_file_name function in Hive https://issues.apache.org/jira/browse/SPARK-21354 that does not support more than one sources) see this: https://github.com/apache/spark/commit/2b10ebe6ac1cdc2c723cb47e4b88cfbf39e0de08#diff-73bd90660f41c12a87ee9fe8d35d856a for HiveSupport override val extendedCheckRules: Seq[LogicalPlan => Unit] = PreWriteCheck +: PreReadCheck +: * A rule to do various checks before inserting into or writing to a data source table. * A rule to do various checks before reading a table. e.g. do not allow to write the table used in source (INSERT INTO clause) e.g. whether you do not execute Hive queries without Hive support enabled: " * A rule to check whether the functions are supported only when Hive support is enabled" HiveOnlyCheck +: here org.apache.spark.sql.execution.datasources.PreWriteCheck$#failAnalysis is your friend, only check if you did some Java assert()) or @
  22. fail-fast approach - executed before physically running the query mostly executed as a pattern matching on the LogicalPlan nodes examples: PreWriteCheck (e.g. Cannot insert into table that is also being read from) , PreReadCheck (input_file_name function in Hive https://issues.apache.org/jira/browse/SPARK-21354 that does not support more than one sources) see this: https://github.com/apache/spark/commit/2b10ebe6ac1cdc2c723cb47e4b88cfbf39e0de08#diff-73bd90660f41c12a87ee9fe8d35d856a for HiveSupport override val extendedCheckRules: Seq[LogicalPlan => Unit] = PreWriteCheck +: PreReadCheck +: * A rule to do various checks before inserting into or writing to a data source table. * A rule to do various checks before reading a table. e.g. do not allow to write the table used in source (INSERT INTO clause) e.g. whether you do not execute Hive queries without Hive support enabled: " * A rule to check whether the functions are supported only when Hive support is enabled" HiveOnlyCheck +: here org.apache.spark.sql.execution.datasources.PreWriteCheck$#failAnalysis is your friend, only check if you did some Java assert()) or @
  23. rules can be excluded from spark.sql.optimizer.excludedRules property dataset reduction - plan is rewritten to execute filters on data source side, eg. PushDownPredicate; it reverses filter and project: case Filter(condition, project @ Project(fields, grandChild)) if fields.forall(_.deterministic) && canPushThroughCondition(grandChild, condition) => // Create a map of Aliases to their values from the child projection. // e.g., 'SELECT a + b AS c, d ...' produces Map(c -> a + b). val aliasMap = AttributeMap(fields.collect { case a: Alias => (a.toAttribute, a.child) }) project.copy(child = Filter(replaceAlias(condition, aliasMap), grandChild)) e.g. // *Project [amount#8L, id#9L] // +- *Filter (isnotnull(amount#8L) && (amount#8L > 10)) insead of Project → Filter (bottom up read So that filter is executed before
  24. transform - recursively applies the rule on the AST; up or bottom - up goes down to up (children and at the end the current node) eg. operations are reversed for predicatepushdown (filter with project), sometimes the operations can be replaced (e.g. 2 Filter nodes with 1 Filter node containing both conditions, and later you can even remove it when the filter returns always true), removed (e;g when filter is always true, when the same SELECT is called twice) resolve - similar to transform but skips already analyzed sub-trees ; when resolve* is called, Apache Spark will start by checking the analyzed flag of the plan. In the case of a false value, it will simply skip the rule logic Important point to note: even though you use resolve*, a transform* can create a completely new plan and invalidate the value of analyzed flag Resolve applies mostly on the nodes that can be evaluted only once, like aliases resolution, subtitution methods (inclusion CTE plan, children plan substituted with window spec definitions), relations resolution
  25. Node - kind of container; most of the time it will be interpreted in the custom logical optimizations Expression - a stringified version of what do you want to do with the operator, e.g. list of columns, filter expression (simple, IN statement). Globally can be considered as a method taking some input and generating some output different variants: Unary (1 input, 1 output), named (e.g. alias), binary (2 inputs, 1 output), ternary (3 in, 1 out, e.g. months between)
  26. PART OF SPARKPLAN sequential execution, one tree level at a time ; doExecute can call leftInput.execute() and inside it operates mostly on the RDD functions like map, mapPartitions, foreachPartitions and so on doExecuteBroadcast - used for intance in BroadcastHasJoinExec to broadcast a part of the query to the rest of executors doPrepare - if something must be initialized before the physical execution; eg. subquery execution initializes here the subquery which is defined as a lazy val Future → private lazy val relationFuture: Future[Array[InternalRow]] and in doPrepare it's called as protected override def doPrepare(): Unit = { relationFuture } PART OF CODEGENSUPPORT trait for doProduce ⇒ produces generated code to process doConsume ⇒ processes rows or columns generated by the physical plan inputRDDs → input rows for this plan codegen optimizes CPU usage by generating a single optimized function in bytecode for the set of operators in a SQL query (when possible), instead of generating iterator code for each operator. https://www.slideshare.net/datamantra/anatomy-of-spark-sql-catalyst-part-2 https://www.slideshare.net/databricks/a-deep-dive-into-spark-sqls-catalyst-optimizer-with-yin-huai https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
  27. recall information about data catalogs say that the comment is not true but promising since catalog federation is an ongoing effort for Apache Spark → https://issues.apache.org/jira/browse/SPARK-15777
  28. but despite the lack of support, you can still extend the catalogs
  29. spark.sql.optimizer.excludedRules find inspirations ⇒ not clearly documented