Apache Spark
undocumented extensions
Sandeep Joshi (DCEngines)
17 August 2017
Well known APIs
● Define Data Source
● Define Streaming Sink/Source
● Add UDF
● Custom Encoder (convert objects to/from JVM to Spark)
Unknown extensions
● Customize the Spark shell
● Customize the UI
● Add a new DDL command
● Create a Custom Spark Executor
● Customizing Cluster manager, Catalog manager, Block Manager, Shuffle
Manager, Cache Manager, Catalyst Expressions, Memory Manager
Will cover first 4 in this talk
Lego blocks analogy
Code for this talk is online
● Customize the Spark shell (directory = new_repl)
● Customize the UI (webui)
● Add a new DDL command (new_command)
● Create a Custom Spark Executor (custom_executor)
https://github.com/sanjosh/scala/tree/master/spark_extensions
http://backtobazics.com/big-data/spark/understanding-apache-spark-architecture/
Spark architecture
JVMs that run with Apache Spark
Customizing the spark shell
Spark shell
When you run “bin/spark-shell”
It internally loads jar and executes from “org.apache.spark.repl.Main”
REPL means “read, evaluate, print loop” (scala)
repl.Main internally invokes “SparkILoop” which runs interpreter.
Customizing the Spark shell
SparkILoop (spark)
MySparkILoop (yours)
● initializeSpark
ILoop (scala lib)
● commands
● loadFiles
● resetCommand
● printWelcome
● ...
1. Create a new scala project
2. Keep package name the same
“org.apache.spark.repl”
3. Inherit from SparkILoop and override
4. Create a new main class
“MySparkIMain.scala”
5. Invoke “bin/spark-shell” with “--jars
your.jar” and “--class
org.apache.spark.repl.MyMain”
Customizing the Spark shell
class MySparkILoop(...) extends SparkILoop(...) {
override def printWelcome(): Unit {...}
}
object MyMain {
def main(...) {
val interp = new MySparkILoop()
interp.process(...)
}
Customizing the Spark shell
Voila ! Welcome to custom shell
Customizing the UI
This is Spark Master UI
This is Spark Worker UI
This is Spark Driver UI
Master
Workers(1..n)
Driver
WebUI classes
WebUI
WebUITab
WebUIPage
SparkUISparkContext
attachTab
attachPage
Driver, Master and all Workers each run the Jetty HTTP server on different ports
WorkerWebUI
MasterWebUI
Customizing the Driver UI
1. Create a new scala project
2. Retain same package name as UI classes “org.apache.spark”
3. Extend SparkUITab and WebUIPage to create your own tabs and pages.
Define render() functions
4. Create function which will call “sparkContext.ui.attachTab(new
TabYouCreated)”
5. Invoke bin/spark-shell with “--jars your.jar”.
6. Call your function with “sparkContext” as argument
Customizing the Driver UI Added
new tab
Adding a new DDL Command
Spark Driver versus Executor
Parser - create AST (syntax tree)
Analyzer - resolution against Catalog etc
Optimizer - convert AST to logical plan
and optimize
SparkPlanner - Physical plan (code
generation using Janino)
DAG Scheduler : break into stages and
tasks. Dispatch to Executors
Executor
Streams
Data Sources
Spark Driver ExecutorBackend
TASKS
Java
bytecode
Inside Spark Driver
Spark Parser classes
ParserInterface
AbstractSqlParser
SparkSqlParserSessionState
SqlBaseVisitor
ASTBuilder
SparkASTBuilder
Adding a new DDL command - steps
1. Derive from AbstractSqlParser. If parsePlan() of default class fails, call
another parser.
2. This new parser can be written using StandardTokenParsers. Here, you
define which LogicalPlan node (PrintCommand) to create when the parser
sees your newly added command.
3. Add a new DDL Strategy to SessionState which sees PrintCommand and
creates PrintRunnableCommand (physical plan). The run() method of the
PrintRunnableCommand should execute what you want.
Parser for new command
class SparkExtendedSqlParser extends StandardTokenParsers {
lexical.reserved += ("PRINTME")
def parse(input: String): LogicalPlan = {def parse(input: String): LogicalPlan = {omit}
protected lazy val start: Parser[LogicalPlan] = printCommand
// create LogicalPlan Node after parsing PRINTME
protected lazy val printCommand: Parser[LogicalPlan] =
"PRINTME" ~> ident ^^ {
case ident => PrintCommand(ident)
}
}
Spark Strategy for new command
class DDLStrategy(sparkSession: SparkSession) extends SparkStrategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = {
plan match{
case printcmd@PrintCommand(parameter) => {
ExecutedCommandExec(PrintRunnableCommand(printcmd)) :: Nil
(omit)
}
Parse Command execution
SparkSqlParser PrintCommand
Create new instance
DDL
Strategy
Command
printCommand()
RunnableCommand
PrintRunnableCommand
Create new instance
run
Query
Execution
Custom Executor
http://backtobazics.com/big-data/spark/understanding-apache-spark-architecture/
Spark Executor framework
ExecutorBackend
Executor ClassLoader
registerExecutor
create
createClassLoader
launchTask
TaskRunner
loadClass
Inside
Executor
Driver
Coarse Grained
Cluster Messages
registerExecutor
launchTask
Worker
Customizing Executor
1. Create new project with same package name “org.apache.spark.executor”.
2. Inherit from ExecutorBackend using CoarseGrainedExecutorBackend code as
reference.
3. Inherit from Executor and override methods as required
4. Find driver port and worker port for running processes.
5. Run “java -jar custom_executor.jar --driver-url --executor-id --app-id
--worker-url”
Check example online
https://github.com/sanjosh/scala/tree/master/spark_extensions/custom_executor
New executor added
Detected
new custom
executor
General principles - how to extend
1. Override SparkSession + (SessionState or SharedState) or SparkContext to
point to your extended classes. You could also use new experimental
SparkSessionExtensions in 2.2
2. Your extensions must have the same package
name(org.apache.spark.whatever) in order to access protected methods in
the package or base classes.
3. Supply your jar and class to “spark-submit” or “spark-shell” via “--jars” and
“--class” instead of creating fat or shaded jars.
4. Access new functions using your derived Session object
Conclusion
Backup/Misc
Cluster manager API
ExternalClusterManager
● createTaskScheduler
● createSchedulerBackend
● initialize
TaskScheduler
● start
● stop
● submitTasks
● cancelTasks
● setDAGScheduler
● createTaskSetManager
SchedulerBackend
● start
● stop
● reviveOffers
● requestExecutors
● killExecutors
● createDriverEndpoint
Examples :
1. Mesos
2. Yarn
3. Geode
4. Kubernetes
create create
SPARK
Driver
Cluster Manager
Spark Workers
external catalog API
ExternalCatalog
● createDatabase
● dropDatabase
● createTable
● dropTable
HiveExternalCatalog
InMemoryCatalog
YourCatalog
SparkSession SharedState
YourSession YourSharedState

Apache spark undocumented extensions

  • 1.
    Apache Spark undocumented extensions SandeepJoshi (DCEngines) 17 August 2017
  • 2.
    Well known APIs ●Define Data Source ● Define Streaming Sink/Source ● Add UDF ● Custom Encoder (convert objects to/from JVM to Spark)
  • 3.
    Unknown extensions ● Customizethe Spark shell ● Customize the UI ● Add a new DDL command ● Create a Custom Spark Executor ● Customizing Cluster manager, Catalog manager, Block Manager, Shuffle Manager, Cache Manager, Catalyst Expressions, Memory Manager Will cover first 4 in this talk
  • 4.
  • 5.
    Code for thistalk is online ● Customize the Spark shell (directory = new_repl) ● Customize the UI (webui) ● Add a new DDL command (new_command) ● Create a Custom Spark Executor (custom_executor) https://github.com/sanjosh/scala/tree/master/spark_extensions
  • 6.
  • 7.
    JVMs that runwith Apache Spark
  • 8.
  • 9.
    Spark shell When yourun “bin/spark-shell” It internally loads jar and executes from “org.apache.spark.repl.Main” REPL means “read, evaluate, print loop” (scala) repl.Main internally invokes “SparkILoop” which runs interpreter.
  • 10.
    Customizing the Sparkshell SparkILoop (spark) MySparkILoop (yours) ● initializeSpark ILoop (scala lib) ● commands ● loadFiles ● resetCommand ● printWelcome ● ... 1. Create a new scala project 2. Keep package name the same “org.apache.spark.repl” 3. Inherit from SparkILoop and override 4. Create a new main class “MySparkIMain.scala” 5. Invoke “bin/spark-shell” with “--jars your.jar” and “--class org.apache.spark.repl.MyMain”
  • 11.
    Customizing the Sparkshell class MySparkILoop(...) extends SparkILoop(...) { override def printWelcome(): Unit {...} } object MyMain { def main(...) { val interp = new MySparkILoop() interp.process(...) }
  • 12.
    Customizing the Sparkshell Voila ! Welcome to custom shell
  • 13.
  • 14.
    This is SparkMaster UI
  • 15.
    This is SparkWorker UI
  • 16.
    This is SparkDriver UI
  • 17.
    Master Workers(1..n) Driver WebUI classes WebUI WebUITab WebUIPage SparkUISparkContext attachTab attachPage Driver, Masterand all Workers each run the Jetty HTTP server on different ports WorkerWebUI MasterWebUI
  • 18.
    Customizing the DriverUI 1. Create a new scala project 2. Retain same package name as UI classes “org.apache.spark” 3. Extend SparkUITab and WebUIPage to create your own tabs and pages. Define render() functions 4. Create function which will call “sparkContext.ui.attachTab(new TabYouCreated)” 5. Invoke bin/spark-shell with “--jars your.jar”. 6. Call your function with “sparkContext” as argument
  • 19.
    Customizing the DriverUI Added new tab
  • 20.
    Adding a newDDL Command
  • 21.
    Spark Driver versusExecutor Parser - create AST (syntax tree) Analyzer - resolution against Catalog etc Optimizer - convert AST to logical plan and optimize SparkPlanner - Physical plan (code generation using Janino) DAG Scheduler : break into stages and tasks. Dispatch to Executors Executor Streams Data Sources Spark Driver ExecutorBackend TASKS Java bytecode
  • 22.
  • 23.
  • 24.
    Adding a newDDL command - steps 1. Derive from AbstractSqlParser. If parsePlan() of default class fails, call another parser. 2. This new parser can be written using StandardTokenParsers. Here, you define which LogicalPlan node (PrintCommand) to create when the parser sees your newly added command. 3. Add a new DDL Strategy to SessionState which sees PrintCommand and creates PrintRunnableCommand (physical plan). The run() method of the PrintRunnableCommand should execute what you want.
  • 25.
    Parser for newcommand class SparkExtendedSqlParser extends StandardTokenParsers { lexical.reserved += ("PRINTME") def parse(input: String): LogicalPlan = {def parse(input: String): LogicalPlan = {omit} protected lazy val start: Parser[LogicalPlan] = printCommand // create LogicalPlan Node after parsing PRINTME protected lazy val printCommand: Parser[LogicalPlan] = "PRINTME" ~> ident ^^ { case ident => PrintCommand(ident) } }
  • 26.
    Spark Strategy fornew command class DDLStrategy(sparkSession: SparkSession) extends SparkStrategy { def apply(plan: LogicalPlan): Seq[SparkPlan] = { plan match{ case printcmd@PrintCommand(parameter) => { ExecutedCommandExec(PrintRunnableCommand(printcmd)) :: Nil (omit) }
  • 27.
    Parse Command execution SparkSqlParserPrintCommand Create new instance DDL Strategy Command printCommand() RunnableCommand PrintRunnableCommand Create new instance run Query Execution
  • 28.
  • 29.
  • 30.
    Spark Executor framework ExecutorBackend ExecutorClassLoader registerExecutor create createClassLoader launchTask TaskRunner loadClass Inside Executor Driver Coarse Grained Cluster Messages registerExecutor launchTask Worker
  • 31.
    Customizing Executor 1. Createnew project with same package name “org.apache.spark.executor”. 2. Inherit from ExecutorBackend using CoarseGrainedExecutorBackend code as reference. 3. Inherit from Executor and override methods as required 4. Find driver port and worker port for running processes. 5. Run “java -jar custom_executor.jar --driver-url --executor-id --app-id --worker-url” Check example online https://github.com/sanjosh/scala/tree/master/spark_extensions/custom_executor
  • 32.
  • 33.
    General principles -how to extend 1. Override SparkSession + (SessionState or SharedState) or SparkContext to point to your extended classes. You could also use new experimental SparkSessionExtensions in 2.2 2. Your extensions must have the same package name(org.apache.spark.whatever) in order to access protected methods in the package or base classes. 3. Supply your jar and class to “spark-submit” or “spark-shell” via “--jars” and “--class” instead of creating fat or shaded jars. 4. Access new functions using your derived Session object
  • 34.
  • 35.
  • 36.
    Cluster manager API ExternalClusterManager ●createTaskScheduler ● createSchedulerBackend ● initialize TaskScheduler ● start ● stop ● submitTasks ● cancelTasks ● setDAGScheduler ● createTaskSetManager SchedulerBackend ● start ● stop ● reviveOffers ● requestExecutors ● killExecutors ● createDriverEndpoint Examples : 1. Mesos 2. Yarn 3. Geode 4. Kubernetes create create SPARK Driver Cluster Manager Spark Workers
  • 37.
    external catalog API ExternalCatalog ●createDatabase ● dropDatabase ● createTable ● dropTable HiveExternalCatalog InMemoryCatalog YourCatalog SparkSession SharedState YourSession YourSharedState