Supporting running Spark scripts directly from a browser would bring the user experience up. Indeed, everybody has a Web navigator, the command line can be avoided, built-in graphing and visualization make it easy to explore and understand data with just a few clicks. This also simplifies the administration as now everything becomes centralized in a service and is accessible by non native clients. For this purpose, an open source Spark Job Server was developed in order to provide Scala, SQL and Python in a Web shell. The main Hadoop components of the platform are also integrated in the same interface. This talk describes the architecture of the Spark Server and its main features: # Scala, Python, SQL submissions # Impersonation # Security # Job progress / canceling # YARN / HDFS / Hive integration The server also ships with a friendly user interface built as a Hue app. We will focus on explaining how they were built, how to use the API and which lessons were learned. The final end user interaction will be live demoed.
16. • REST Web server in Scala
• Interactive Spark Sessions and Batch Jobs
• Type Introspection for Visualization
• Running sessions in YARN local
• Backends: Scala, Python, R
• Open Source:
https://github.com/cloudera/hue/tree/master/app
s/spark/java
LIVY
SPARK SERVER
17. LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
Livy Server
Scalatra
Session Manager
Session
18. LIVY WEB SERVER
ARCHITECTURE
Livy Server
YARN
Master
Scalatra
Spark Client
Session Manager
Session
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1
27. INTERPRETERS
• Pipe stdin/stdout to a running shell
• Execute the code / send to Spark workers
• Perform magic operations
• One interpreter by language
• “Swappable” with other kernels (python,
spark..)
Interpreter
> println(1 + 1)
2
println(1 + 1)
2
29. INTERPRETER FLOW CHART
Receive lines Split lines
Send output
to server
Success
Incomplete
Merge with
next line
Error
Execute LineMagic!
Lines
left?
Magic line?
No
Yes
NoYes
30. LIVY INTERPRETERS
trait Interpreter {
def state: State
def execute(code: String): Future[JValue]
def close(): Unit
}
sealed trait State
case class NotStarted() extends State
case class Starting() extends State
case class Idle() extends State
case class Running() extends State
case class Busy() extends State
case class Error() extends State
case class ShuttingDown() extends State
case class Dead() extends State
31. LIVY INTERPRETERS
trait Interpreter {
def state: State
def execute(code: String): Future[JValue]
def close(): Unit
}
sealed trait State
case class NotStarted() extends State
case class Starting() extends State
case class Idle() extends State
case class Running() extends State
case class Busy() extends State
case class Error() extends State
case class ShuttingDown() extends State
case class Dead() extends State
32. SPARK INTERPRETER
class SparkInterpeter extends Interpreter {
…
private var _state: State = NotStarted()
private val outputStream = new ByteArrayOutputStream()
private var sparkIMain: SparkIMain = _
def start() = {
...
_state = Starting()
sparkIMain = new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))
sparkIMain.initializeSynchronous()
...
Interpreter
new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))
33. SPARK INTERPRETER
private var sparkContext: SparkContext = _
def start() = {
...
val sparkConf = new SparkConf(true)
sparkContext = new SparkContext(sparkConf)
sparkIMain.beQuietDuring {
sparkIMain.bind("sc", "org.apache.spark.SparkContext",
sparkContext, List("""@transient"""))
}
_state = Idle()
}
sparkIMain.bind("sc", "org.apache.spark.SparkContext",
sparkContext, List("""@transient"""))
34. EXECUTING SPARK
private def executeLine(code: String): ExecuteResult = {
code match {
case MAGIC_REGEX(magic, rest) =>
executeMagic(magic, rest)
case _ =>
scala.Console.withOut(outputStream) {
sparkIMain.interpret(code) match {
case Results.Success => ExecuteComplete(readStdout())
case Results.Incomplete => ExecuteIncomplete(readStdout())
case Results.Error => ExecuteError(readStdout())
}
...
case MAGIC_REGEX(magic, rest) =>
case _ =>
35. INTERPRETER MAGIC
private val MAGIC_REGEX = "^%(w+)W*(.*)".r
private def executeMagic(magic: String, rest: String): ExecuteResponse = {
magic match {
case "json" => executeJsonMagic(rest)
case "table" => executeTableMagic(rest)
case _ => ExecuteError(f"Unknown magic command $magic")
}
}
case "json" => executeJsonMagic(rest)
case "table" => executeTableMagic(rest)
case _ => ExecuteError(f"Unknown magic command $magic")
36. INTERPRETER MAGIC
private def executeJsonMagic(name: String): ExecuteResponse = {
sparkIMain.valueOfTerm(name) match {
case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value)))
case None => ExecuteError(f"Value $name does not exist")
}
}
case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value)))
Why do we want to do this? Currently it’s difficult to visualize results from Spark. Spark has a great interactive tool called “spark-shell” that allows you to interact with large datasets on the commandline. For example, here is a session where we are counting the words used by shakespeare. Running this computation is easy, but spark-shell doesn’t provide any tools for visualizing the results.
One option is to save the output to a file, then use a tool like Hue to import it into a Hive table and visualize it. We are obviously big fans of Hue, but there are still too many steps to go through to get to this point. If we want to change the script, say to filter out words like “the” and “and”, we need to go back to the shell, rerun our code snippet, save it to a file, then reimport it into the UI. It’s a slow process.
Multi languages
Inherit Hue’s sharing, export/import
Hello, I’m Erick Tryzelaar, and I’m going to talk about the Livy Spark Server, which is our backend for Hue’s Notebook application.
Livy is a REST web server that allows a tool like Hue to interactively execute scala and spark commands, just like spark-shell. It goes beyond it by adding type introspection, which allows a frontend like Hue to render results in interactive visualizations. Furthermore it allows sessions to be run inside YARN to support horizontally scaling out to hundreds of active sessions. It also supports a Python and R backend. Finally, it’s fully open source, and currently being developed in Hue.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
The Livy server is built upon Scalatra and Jetty. Creating a session is as simple as POSTing to a particular URL. Behind the scenes, livy will communicate with the YARN master to allocate some nodes to launch the interactive sessions. This is all done asynchronously as there’s not telling when there will be resources available to run the sessions. Once the nodes have been allocated, Livy will start an intepreter on one of the nodes which takes care of creating the Spark Context, which actually runs the spark operations. After it’s setup, the session signals to the Livy Server that it’s ready for commands. At that point, the Client can simply POST their code to a url on the livy server.
Let’s see it in action. On the left we see creating a “spark” session. You could also fill in “pyspark” and “sparkR” here if you want those sessions. On the right is us executing simple math in the session itself.
We don’t have too much time to drill down into the code, but we did want to take this moment to at least dive into how the interpreters work.
Livy’s interpreters are conceptually very simple devices. They take in one or more lines of code and execute them in a shell environment. These shells perform the computation and interact with the spark environment. They’re also abstract. As I mentioned earlier, Livy currently has 3 languages built into it: Scala, Python and R, with more to come.
Here is the interpreter loop that livy manages. First is to split up the lines and feed them one at a time into the interpreter. If the line is a regular, non-magic line, it gets executed and the result can be of three states. Success, where we’ll continue to execute the next line, incomplete, where the input is not a complete statement, such as an “if” statement with an open bracket. Or an error, which stops the execution of these lines. The other case are magic lines, which are special commands to the interpreter itself. For example, asking the interpreter to convert a value into a json type.
Now for some code. As we saw earlier, the interpreter is a simple state machine that executes code and eventually produces JSON responses by way of a Future.
Now for some code. As we saw earlier, the interpreter is a simple state machine that executes code and eventually produces JSON responses by way of a Future.
In order to implement this interface, the spark interpreter needs to first create the real interpreter, SparkIMain. It’s pretty simple to create. We just need to construct it with a buffer that acts as the interpreters Standard Output.
Once the SparkIMain has been initialized, we need to create the Spark Context that communicates with all of the spark workers. Injecting this variable into the interpreter is quite simple with this “bind” method.
Now that the session is up and running we can execute code inside of it. I’ve skipped some of the other book keeping in order to show the actual heart of the execution here. Ignore the magic case at the moment. Execution is also quite simple, we first temporarily replace standard out with our buffer, and then have the interpreter execute the code. There are three conditions for the response. First the command executed. Second, this code is incomplete because maybe it has an open parenthesis. Finally, an error if some exception occurred. Altogether quite simple and doesn’t require any changes to Spark to do this.
And now the magic. I mentioned earlier that livy supports type introspection. The way it does it is through these in-band magic commands which start with the percent command. The spark interpreter currently supports two magic commands, “json” and “table”. The “json” will convert any type into a json value, and “table” will convert any type into a table-ish object that’s used for our visualization.
Here is our json magic. it takes advantage of json4s’s Extraction.decompose to try to convert values. We special case RDDs since they can’t be directly transformed into json. Instead we just pull out the first 10 items so we can at least show something.
The table magic does something similar, but it’s a bit large to compress into slides. We’ll see it’s results next.
The table magic does something similar, but it’s a bit large to compress into slides. We’ll see it’s results next.
Finally here it is in action. Here we’re taking our shakespeare code from earlier. If we run this snippet inside livy, it returns an output mimetype of application/json, with the results inlined without encoding in the output.
Finally here it is in action. Here we’re taking our shakespeare code from earlier. If we run this snippet inside livy, it returns an output mimetype of application/json, with the results inlined without encoding in the output.
Fingers crossed for a lot of reasons, it’s master and the VM was broken till 4 AM.
Next: learn more