7. Mist
• Mist is a thin service on top of Spark which makes it possible to execute Scala & Python Spark Jobs
from application layers and get synchronous, asynchronous, and reactive results as well as provide
an API to external clients.
• It implements Spark as a Service and creates a unified API layer for building enterprise solutions
and services on top of a Big Data lake.
www.provectus.com
7
8. Mist
● HTTP and Messaging (MQTT) API
● Scala & Python Spark job execution
● Works with Standalone, Mesos, Yarn any Spark config
● Support for Spark SQL and Hive
● High Availability and Fault Tolerance
● Persist job state for self healing
● Async and sync API, JSON job results
www.provectus.com
8
Why We Needed a Mist
9. Mist
Build the project
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly
Create configuration file
Run
spark-submit --class io.hydrosphere.mist.Mist
--driver-java-options "-Dconfig.file=/path/to/application.conf"
target/scala-2.10/mist-assembly-0.2.0.jar
www.provectus.com
9
Running
10. Mist
www.provectus.com
Configuration
10
# spark master url can be either of three: local, yarn, mesos (local by default)
mist.spark.master = "local[*]"
# number of threads: one thread for one job
mist.settings.threadNumber = 16
# http interface (off by default)
mist.http.on = true
mist.http.host = "192.168.10.13"
mist.http.port = 2003
11. Mist
www.provectus.com
Configuration
11
# MQTT interface (off by default)
mist.mqtt.on = true
mist.mqtt.host = "192.168.10.33"
mist.mqtt.port = 1883
# mist listens this topic for incoming
requests
mist.mqtt.subscribeTopic = "foo"
# mist answers in this topic with the results
mist.mqtt.publishTopic = "foo"
13. Mist
www.provectus.com
Configuration
13
# default settings for all contexts
# timeout for each job in context
mist.contextDefaults.timeout = 100 days
# mist can kill context after job finished (off by default)
mist.contextDefaults.disposable = false
# settings for SparkConf
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
14. Mist
www.provectus.com
Configuration
14
# settings can be overridden for each context
mist.contexts.foo.timeout = 100 days
mist.contexts.foo.sparkConf = {
spark.scheduler.mode = "FIFO"
}
mist.contexts.bar.timeout = 1000 second
mist.contexts.bar.disposable = true
# mist can create context on start, so we don't waste time on first request
mist.contextSettings.onstart = ["foo"]
15. Mist
Spark Job at Mist
Mist Scala Spark Job
In order to prepare your job to run on Mist you should extend scala object from MistJob and implement abstract method
doStuff :
def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = ???
def doStuff(context: SQLContext, parameters: Map[String, Any]): Map[String, Any] = ???
def doStuff(context: HiveContext, parameters: Map[String, Any]): Map[String, Any] = ???
www.provectus.com
15
16. Mist
Spark Job at Mist
Example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Building Mist jobs
Add Mist as dependency in your build.sbt:
libraryDependencies += "io.hydrosphere" % "mist" % "0.2.0"
www.provectus.com
16
17. Mist
Spark Job at Mist
Mist Python Spark Job
Import mist and implement method doStuff.
The following are Spark Contexts aliases to be used for convenience:
job.sc = SparkContext
job.sqlc = SQL Context
job.hc = Hive Context
www.provectus.com
17
18. Mist
Spark Job at Mist
for examplimport mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
pylist = []
count = 0
while count < list.size():
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
www.provectus.com
18
22. Mist
www.provectus.com
22
● Super parallel mode Support multi JVM
● Cluster mode and node framework
● Add logging
● Restification
● Support streaming contexts/jobs
● Apache Kafka support
● AMQP support
● Web UI
Your contributions are very welcome on Github!
https://github.com/Hydrospheredata/mist
Road Map