Apache spark undocumented extensions

Apache Spark
undocumented extensions
Sandeep Joshi (DCEngines)
17 August 2017

Well known APIs
● Define Data Source
● Define Streaming Sink/Source
● Add UDF
● Custom Encoder (convert objects to/from JVM to Spark)

Unknown extensions
● Customize the Spark shell
● Customize the UI
● Add a new DDL command
● Create a Custom Spark Executor
● Customizing Cluster manager, Catalog manager, Block Manager, Shuffle
Manager, Cache Manager, Catalyst Expressions, Memory Manager
Will cover first 4 in this talk

Code for this talk is online
● Customize the Spark shell (directory = new_repl)
● Customize the UI (webui)
● Add a new DDL command (new_command)
● Create a Custom Spark Executor (custom_executor)
https://github.com/sanjosh/scala/tree/master/spark_extensions

http://backtobazics.com/big-data/spark/understanding-apache-spark-architecture/
Spark architecture

JVMs that run with Apache Spark

Spark shell
When you run “bin/spark-shell”
It internally loads jar and executes from “org.apache.spark.repl.Main”
REPL means “read, evaluate, print loop” (scala)
repl.Main internally invokes “SparkILoop” which runs interpreter.

Customizing the Spark shell
SparkILoop (spark)
MySparkILoop (yours)
● initializeSpark
ILoop (scala lib)
● commands
● loadFiles
● resetCommand
● printWelcome
● ...
1. Create a new scala project
2. Keep package name the same
“org.apache.spark.repl”
3. Inherit from SparkILoop and override
4. Create a new main class
“MySparkIMain.scala”
5. Invoke “bin/spark-shell” with “--jars
your.jar” and “--class
org.apache.spark.repl.MyMain”

class MySparkILoop(...) extends SparkILoop(...) {
override def printWelcome(): Unit {...}
}
object MyMain {
def main(...) {
val interp = new MySparkILoop()
interp.process(...)
}

Voila ! Welcome to custom shell

Master
Workers(1..n)
Driver
WebUI classes
WebUI
WebUITab
WebUIPage
SparkUISparkContext
attachTab
attachPage
Driver, Master and all Workers each run the Jetty HTTP server on different ports
WorkerWebUI
MasterWebUI

Customizing the Driver UI
1. Create a new scala project
2. Retain same package name as UI classes “org.apache.spark”
3. Extend SparkUITab and WebUIPage to create your own tabs and pages.
Define render() functions
4. Create function which will call “sparkContext.ui.attachTab(new
TabYouCreated)”
5. Invoke bin/spark-shell with “--jars your.jar”.
6. Call your function with “sparkContext” as argument

Customizing the Driver UI Added
new tab

Spark Driver versus Executor
Parser - create AST (syntax tree)
Analyzer - resolution against Catalog etc
Optimizer - convert AST to logical plan
and optimize
SparkPlanner - Physical plan (code
generation using Janino)
DAG Scheduler : break into stages and
tasks. Dispatch to Executors
Executor
Streams
Data Sources
Spark Driver ExecutorBackend
TASKS
Java
bytecode

Spark Parser classes
ParserInterface
AbstractSqlParser
SparkSqlParserSessionState
SqlBaseVisitor
ASTBuilder
SparkASTBuilder

Adding a new DDL command - steps
1. Derive from AbstractSqlParser. If parsePlan() of default class fails, call
another parser.
2. This new parser can be written using StandardTokenParsers. Here, you
define which LogicalPlan node (PrintCommand) to create when the parser
sees your newly added command.
3. Add a new DDL Strategy to SessionState which sees PrintCommand and
creates PrintRunnableCommand (physical plan). The run() method of the
PrintRunnableCommand should execute what you want.

Parser for new command
class SparkExtendedSqlParser extends StandardTokenParsers {
lexical.reserved += ("PRINTME")
def parse(input: String): LogicalPlan = {def parse(input: String): LogicalPlan = {omit}
protected lazy val start: Parser[LogicalPlan] = printCommand
// create LogicalPlan Node after parsing PRINTME
protected lazy val printCommand: Parser[LogicalPlan] =
"PRINTME" ~> ident ^^ {
case ident => PrintCommand(ident)
}
}

Spark Strategy for new command
class DDLStrategy(sparkSession: SparkSession) extends SparkStrategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = {
plan match{
case printcmd@PrintCommand(parameter) => {
ExecutedCommandExec(PrintRunnableCommand(printcmd)) :: Nil
(omit)
}

Parse Command execution
SparkSqlParser PrintCommand
Create new instance
DDL
Strategy
Command
printCommand()
RunnableCommand
PrintRunnableCommand
Create new instance
run
Query
Execution

http://backtobazics.com/big-data/spark/understanding-apache-spark-architecture/

Spark Executor framework
ExecutorBackend
Executor ClassLoader
registerExecutor
create
createClassLoader
launchTask
TaskRunner
loadClass
Inside
Executor
Driver
Coarse Grained
Cluster Messages
registerExecutor
launchTask
Worker

Customizing Executor
1. Create new project with same package name “org.apache.spark.executor”.
2. Inherit from ExecutorBackend using CoarseGrainedExecutorBackend code as
reference.
3. Inherit from Executor and override methods as required
4. Find driver port and worker port for running processes.
5. Run “java -jar custom_executor.jar --driver-url --executor-id --app-id
--worker-url”
Check example online
https://github.com/sanjosh/scala/tree/master/spark_extensions/custom_executor

New executor added
Detected
new custom
executor

General principles - how to extend
1. Override SparkSession + (SessionState or SharedState) or SparkContext to
point to your extended classes. You could also use new experimental
SparkSessionExtensions in 2.2
2. Your extensions must have the same package
name(org.apache.spark.whatever) in order to access protected methods in
the package or base classes.
3. Supply your jar and class to “spark-submit” or “spark-shell” via “--jars” and
“--class” instead of creating fat or shaded jars.
4. Access new functions using your derived Session object

Cluster manager API
ExternalClusterManager
● createTaskScheduler
● createSchedulerBackend
● initialize
TaskScheduler
● start
● stop
● submitTasks
● cancelTasks
● setDAGScheduler
● createTaskSetManager
SchedulerBackend
● start
● stop
● reviveOffers
● requestExecutors
● killExecutors
● createDriverEndpoint
Examples :
1. Mesos
2. Yarn
3. Geode
4. Kubernetes
create create
SPARK
Driver
Cluster Manager
Spark Workers

external catalog API
ExternalCatalog
● createDatabase
● dropDatabase
● createTable
● dropTable
HiveExternalCatalog
InMemoryCatalog
YourCatalog
SparkSession SharedState
YourSession YourSharedState

Apache spark undocumented extensions

More Related Content

What's hot

Similar to Apache spark undocumented extensions

More from Sandeep Joshi

Recently uploaded

Apache spark undocumented extensions