Big Data Scala by the Bay: Interactive Spark in your Browser

INTERACTIVE
SPARK IN YOUR
BROWSER
Romain Rigaux romain@cloudera.com
Erick Tryzelaar erickt@cloudera.com

GOAL
OF HUE
WEB INTERFACE FOR ANALYZING DATA
WITH APACHE HADOOP
SIMPLIFY AND INTEGRATE
FREE AND OPEN SOURCE
—> WEB “EXCEL” FOR HADOOP

VIEW FROM
30K FEET
Hadoop Web Server
You, your colleagues and even that
friend that uses IE9 ;)

WHY SPARK?
SIMPLER (PYTHON, STREAMING,
INTERACTIVE…)
OPENS UP DATA TO SCIENCE
SPARK —> MR
Apache Spark
Spark
Streaming
MLlib
(machine learning)
GraphX
(graph)
Spark SQL

WHY
IN HUE?
MARRIED WITH FULL HADOOP ECOSYSTEM
(Hive Tables, HDFS, Job Browser…)

WHY
IN HUE?
Multi user, YARN, Impersonation/Security
Not yet-another-app-to-install
...

• It works
HISTORY
V1: OOZIE
THE GOOD
• Submit through Oozie
• Slow
THE BAD

• It works better
HISTORY
V2: SPARK IGNITER
THE GOOD
• Compiler Jar
• Batch
THE BAD

• It works even better
• Scala / Python / R shells
• Jar / Py batches
• Notebook UI
• YARN
HISTORY
V3: NOTEBOOK
THE GOOD
• Still new
THE BAD

GENERAL
ARCHITECTURE
Livy
Spark
Spark
Spark
YARN
Backend partWeb part

Notebook with snippets
WEB
ARCHITECTURE
Server
Spark
Scala
Common API
Pig Hive
Livy … HS2
Scala
Hive
Specific APIs
AJAX
create_session()
execute()
…
REST Thrift
OpenSession()
ExecuteStatement()
/session
/sessions/{sessionId}/statements

• REST Web server in Scala
• Interactive Spark Sessions and Batch Jobs
• Type Introspection for Visualization
• Running sessions in YARN local
• Backends: Scala, Python, R
• Open Source:
https://github.com/cloudera/hue/tree/master/app
s/spark/java
LIVY
SPARK SERVER

LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
Livy Server
Scalatra
Session Manager
Session

LIVY WEB SERVER
ARCHITECTURE
Livy Server
YARN
Master
Scalatra
Spark Client
Session Manager
Session
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1

LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1
2
Livy Server
Scalatra
Session Manager
Session

LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1
2
3
Livy Server
Scalatra
Session Manager
Session

LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1
2
3
4
Livy Server
Scalatra
Session Manager
Session

LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1
2
3
4
5
Livy Server
Scalatra
Session Manager
Session

LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1
2
3
4
5
6
Livy Server
Scalatra
Session Manager
Session

LIVY WEB SERVER
ARCHITECTURE
YARN
Master
Spark Client
YARN
Node
Spark
Interpreter
Spark
Context
YARN
Node
Spark
Worker
YARN
Node
Spark
Worker
1 7
2
3
4
5
6
Livy Server
Scalatra
Session Manager
Session

SESSION CREATION
AND EXECUTION
% curl -XPOST localhost:8998/sessions
-d '{"kind": "spark"}'
{
"id": 0,
"kind": "spark",
"log": [...],
"state": "idle"
}
% curl -XPOST localhost:8998/sessions/0/statements -d '{"code": "1+1"}'
{
"id": 0,
"output": {
"data": { "text/plain": "res0: Int = 2" },
"execution_count": 0,
"status": "ok"
},
"state": "available"
}

LIVY INTERPRETERS
Scala, Python, R…

INTERPRETERS
• Pipe stdin/stdout to a running shell
• Execute the code / send to Spark workers
• Perform magic operations
• One interpreter by language
• “Swappable” with other kernels (python,
spark..)
Interpreter
> println(1 + 1)
2
println(1 + 1)
2

INTERPRETER FLOW
CURL
Hue
Livy Server Livy Session Interpreter
1+1
2
{
“data”: {
“application/json”: “2”
}
}
1+1
2

INTERPRETER FLOW CHART
Receive lines Split lines
Send output
to server
Success
Incomplete
Merge with
next line
Error
Execute LineMagic!
Lines
left?
Magic line?
No
Yes
NoYes

LIVY INTERPRETERS
trait Interpreter {
def state: State
def execute(code: String): Future[JValue]
def close(): Unit
}
sealed trait State
case class NotStarted() extends State
case class Starting() extends State
case class Idle() extends State
case class Running() extends State
case class Busy() extends State
case class Error() extends State
case class ShuttingDown() extends State
case class Dead() extends State

SPARK INTERPRETER
class SparkInterpeter extends Interpreter {
…
private var _state: State = NotStarted()
private val outputStream = new ByteArrayOutputStream()
private var sparkIMain: SparkIMain = _
def start() = {
...
_state = Starting()
sparkIMain = new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))
sparkIMain.initializeSynchronous()
...
Interpreter
new SparkIMain(new Settings(), new JPrintWriter(outputStream, true))

SPARK INTERPRETER
private var sparkContext: SparkContext = _
def start() = {
...
val sparkConf = new SparkConf(true)
sparkContext = new SparkContext(sparkConf)
sparkIMain.beQuietDuring {
sparkIMain.bind("sc", "org.apache.spark.SparkContext",
sparkContext, List("""@transient"""))
}
_state = Idle()
}
sparkIMain.bind("sc", "org.apache.spark.SparkContext",
sparkContext, List("""@transient"""))

EXECUTING SPARK
private def executeLine(code: String): ExecuteResult = {
code match {
case MAGIC_REGEX(magic, rest) =>
executeMagic(magic, rest)
case _ =>
scala.Console.withOut(outputStream) {
sparkIMain.interpret(code) match {
case Results.Success => ExecuteComplete(readStdout())
case Results.Incomplete => ExecuteIncomplete(readStdout())
case Results.Error => ExecuteError(readStdout())
}
...
case MAGIC_REGEX(magic, rest) =>
case _ =>

INTERPRETER MAGIC
private val MAGIC_REGEX = "^%(w+)W*(.*)".r
private def executeMagic(magic: String, rest: String): ExecuteResponse = {
magic match {
case "json" => executeJsonMagic(rest)
case "table" => executeTableMagic(rest)
case _ => ExecuteError(f"Unknown magic command $magic")
}
}
case "json" => executeJsonMagic(rest)
case "table" => executeTableMagic(rest)
case _ => ExecuteError(f"Unknown magic command $magic")

INTERPRETER MAGIC
private def executeJsonMagic(name: String): ExecuteResponse = {
sparkIMain.valueOfTerm(name) match {
case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value)))
case None => ExecuteError(f"Value $name does not exist")
}
}
case Some(value: RDD[_]) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value.asInstanceOf[RDD[_]].take(10))))
case Some(value) => ExecuteMagic(Extraction.decompose(Map(
"application/json" -> value)))

TABLE MAGIC
"application/vnd.livy.table.v1+json": {
"headers": [
{ "name": "count", "type": "BIGINT_TYPE" },
{ "name": "name", "type": "STRING_TYPE" }
],
"data": [
[ 23407, "the" ],
[ 19540, "I" ],
[ 18358, "and" ],
...
]
}
val lines = sc.textFile("shakespeare.txt");
val counts = lines.
flatMap(line => line.split(" ")).
map(word => (word, 1)).
reduceByKey(_ + _).
sortBy(-_._2).
map { case (w, c) =>
Map("word" -> w, "count" -> c)
}
%table counts%table counts

TABLE MAGIC
"application/vnd.livy.table.v1+json": {
"headers": [
{ "name": "count", "type": "BIGINT_TYPE" },
{ "name": "name", "type": "STRING_TYPE" }
],
"data": [
[ 23407, "the" ],
[ 19540, "I" ],
[ 18358, "and" ],
...
]
}
val counts = lines.
reduceByKey(_ + _).
sortBy(-_._2).
}
%table counts

JSON MAGIC
val counts = lines.
reduceByKey(_ + _).
sortBy(-_._2).
}
%json counts
{
"id": 0,
"output": {
"application/json": [
{ "count": 506610, "word": "" },
{ "count": 23407, "word": "the" },
{ "count": 19540, "word": "I" },
...
]
...
}
%json counts

JSON MAGIC
val counts = lines.
reduceByKey(_ + _).
sortBy(-_._2).
}
%json counts
{
"id": 0,
"output": {
"application/json": [
{ "count": 506610, "word": "" },
{ "count": 23407, "word": "the" },
{ "count": 19540, "word": "I" },
...
]
...
}

• Stability and Scaling
• Security
• iPython/Jupyter backends
and file format
COMING SOON

TWITTER
@gethue
USER GROUP
hue-user@
WEBSITE
http://gethue.com
LEARN
http://learn.gethue.com
THANKS!

Big Data Scala by the Bay: Interactive Spark in your Browser

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Scala by the Bay: Interactive Spark in your Browser

Similar to Big Data Scala by the Bay: Interactive Spark in your Browser (20)

More from gethue

More from gethue (9)

Recently uploaded

Recently uploaded (20)

Big Data Scala by the Bay: Interactive Spark in your Browser

Editor's Notes