Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vadim Chelyshov, Software engineer at Provectus

MISTMIST
Serverless proxy to Apache Spark

VADIM CHELYSHOVVADIM CHELYSHOV
github.com/dos65
hydrosphere.io
Scalalaz podcast (RUS)
Scala-digest (RUS)

HYDROSPHERE MISTHYDROSPHERE MIST
1.0.0-RC16
https://github.com/Hydrospheredata/mist

WHAT PROBLEM DOES IT SOLVE?WHAT PROBLEM DOES IT SOLVE?

Getting Big Data Projects to Production Is aGetting Big Data Projects to Production Is a
ChallengeChallenge
Only 15 percent of businesses reported deploying their
big data project to production, eﬀectively unchanged
from last year (14 percent).
https://www.gartner.com/newsroom/id/3466117

Data engineers vs. data scientistsData engineers vs. data scientists
A common starting point is 2-3 data engineers for every
data scientist. For some organizations with more
complex data engineering requirements, this can be 4-5
data engineers per data scientist
https://www.oreilly.com/ideas/data-engineers-vs-
data-scientists

The lack of tooling?The lack of tooling?

The lack of right tooling?The lack of right tooling?

Getting Spark Projects to ProductionGetting Spark Projects to Production
How to
provide a way to run our job?
return a result?
report an error?
run multiply jobs in parallel?

LET'S COMPARE WITH USUALLET'S COMPARE WITH USUAL
THINGSTHINGS

DatabasesDatabases
Postgresql - 1995
CREATE OR REPLACE FUNCTION foo(param VARCHAR)
RETURNS TABLE
...

Web applicationsWeb applications
Java servlet - 1997
public class NewServlet extends HttpServlet {
@Override
protected void doGet(
HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
...
}
}

Big data processingBig data processing
Hadoop - 2005
Apache Spark - 2011
object MyApp {
def main(arg: String[]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
...
}
}
./bin/spark-submit ...

RunRun
Databases SELECT * FROM foo(param)
Web applications curl
Spark Shell + Trigger

Return resultReturn result
Databases built-in protocol
Web applications built-in protocol
Spark Read from fs/storages

Result errorResult error
Databases explicit error information
Web
applications
HTTP CODE + explicit error
infomation
Spark exit code

ParallelismParallelism
Databases by default
Web applications by default
Spark Good luck!

What a hell is going on!
These thing aren't
about big data!

It isn't normal development experinceIt isn't normal development experince
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is diﬀicult
Poor interfaces

Just a special service!Just a special service!
Artifacts/settings CRUD
API for launching jobs and receiving their status
Spark driver launcher
A library do describe operations using Spark

What is Mist?What is Mist?
Serverless proxy to Apache Spark

Serverless proxy ?
Apache Spark ✔

It's a combinations of
Programming framework
HTTP/Async interface to run deploy and run spark
programms
Spark contexts / driver manager

ConceptsConcepts
Function
user code with Spark program to be deployed on Mist
Artifact
file (.jar or .py) that contains a Function
Context
settings for Mist Worker where Function is being
executed
Job
a result of the Function execution

Mist instancesMist instances
Mist Worker
Spark driver application which invokes functions
Mist Master
exposes http/async api
stores functions/artifacts/contexts
run/manage workers
run functions on workers

Mist Master - Http ApiMist Master - Http Api
/v2/api/functions
/v2/api/artifacts
/v2/api/contexts
/v2/api/jobs

Vanilla SparkVanilla Spark
object WordCount {
def main(arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}
}

MistFnMistFn
object WordCount extends MistFn {
override def handle: Handle = {
arg[String]("path").onSparkContext((path: String, sc: SparkContext) => {
val counts = sc.textFile(path)
.flatMap(_.split(" "))
.map(w => w -> 1)
.reduceByKey(_ + _ )
// counts.saveAsTextFile("hdfs://...")
counts.collect().toMap
}).asHandle
}
}

Vanilla SparkVanilla Spark
./bin/spark-submit
--class <main-class>
--master <master-url>
--deploy-mode <deploy-mode>
--conf <key>=<value>
... # other options
<application-jar>
[application-arguments]

Mist-cli configMist-cli config
model = Context
name = mycontext
data {
spark-conf {
spark.master = <master url>
}
}
model = Artifact
name = wordcount
data.file-path = "./target/scala-2.11/<application-jar>"
model = Function
name = word-count
data {
path = <artifact_jar>
class-name = "HelloMist$"
context = mycontext
}

Mist-cli apply & runMist-cli apply & run
$ mist-cli -f apply conf
$ curl -X POST -d <json-input>
'http://localhost:2004/v2/api/functions/word-count-example/jobs'
{"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}

A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is diﬀicult
Poor interfaces

Parallelism/MultitenancyParallelism/Multitenancy
Usual 1 req ~= thread
Spark 1 req ~= process

ContextContext
Mist Context describes parameters of Mist worker and
Spark context
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
spark.submit.deployMode = cluster
spark.executor.instances = 2
spark.executor.cores = 2
spark.executor.memory = 1G
}
}

Mist instancesMist instances
Mist Worker
receives req -> invokes functions -> returns result
back to master
Mist Master
queue requests
sends them on workers

RunRun
$ curl -X POST -d <json-input>
'http://localhost:2004/v2/api/functions/word-count-example/jobs'
{"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}

ScalingScaling
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
...
}

Worker modesWorker modes
Exclusive
starts new worker for every request
Shared
reuses worker instance
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
worker-mode = "shared"
// or
worker-mode = "exlsusive"
...
}

Multi clusterMulti cluster
Contexts may be configured to work on diﬀerent
clusters
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
}
}
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = spark://

Pi EstimationPi Estimation
val count = sc.parallelize(1 to samples).filter { _ =>
val x = math.random
val y = math.random
x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / samples}")
}

Pi EstimationPi Estimation
What signature is better?
def estimatePi1(samples: Int): Unit
// vs
def estimatePi2(samples: Int): Double

Word count againWord count again
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}

Word count againWord count again
What signature is better?
def wordCount(from: String, to: String): Unit
// vs
def wordCount(from: String, to: String): Map[String, Int]
// or
sealed trait OutputFile
case class HdfsFile(path: String) extends OutputFile
case class S3File(path: String) extends OutputFile
def wordCount(from: String, to: String): OutputFile

HOW TO REPRESENT A SPARKHOW TO REPRESENT A SPARK
JOB?JOB?

Spark ApplicationSpark Application
def main(args: Array[String]): Unit
Array[String] => Unit

MistFnMistFn
Context is already described
Json input instead of arguments
Json output instead of Unit
// vs
(Json, SparkContext) => Json

SparkContext typesSparkContext types
SparkContext
SparkSession
StreamingContext
SQLContext
// vs
(Json, ?) => Json

GeneralizeGeneralize
// developer
// pi estimation: samples => pi
(Int, SparkContext) => Double
// word count - input path => output path
(String, SparkSession) => String
...
(A, ?) => B
(A, B, ?) => C

Future plansFuture plans
Release 1.0.0
AWS EMR integration - on-demand EMR clusters
...

Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vadim Chelyshov, Software engineer at Provectus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vadim Chelyshov, Software engineer at Provectus

Similar to Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vadim Chelyshov, Software engineer at Provectus (20)

More from Provectus

More from Provectus (20)

Recently uploaded

Recently uploaded (20)

Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vadim Chelyshov, Software engineer at Provectus