5. Getting Big Data Projects to Production Is aGetting Big Data Projects to Production Is a
ChallengeChallenge
Only 15 percent of businesses reported deploying their
big data project to production, effectively unchanged
from last year (14 percent).
https://www.gartner.com/newsroom/id/3466117
6. Data engineers vs. data scientistsData engineers vs. data scientists
A common starting point is 2-3 data engineers for every
data scientist. For some organizations with more
complex data engineering requirements, this can be 4-5
data engineers per data scientist
https://www.oreilly.com/ideas/data-engineers-vs-
data-scientists
9. The lack of right tooling?The lack of right tooling?
10. Getting Spark Projects to ProductionGetting Spark Projects to Production
How to
provide a way to run our job?
return a result?
report an error?
run multiply jobs in parallel?
13. Web applicationsWeb applications
Java servlet - 1997
public class NewServlet extends HttpServlet {
@Override
protected void doGet(
HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
...
}
}
14. Big data processingBig data processing
Hadoop - 2005
Apache Spark - 2011
object MyApp {
def main(arg: String[]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
...
}
}
./bin/spark-submit ...
19. What a hell is going on!
These thing aren't
about big data!
20. It isn't normal development experinceIt isn't normal development experince
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
21. Just a special service!Just a special service!
Artifacts/settings CRUD
API for launching jobs and receiving their status
Spark driver launcher
A library do describe operations using Spark
25. What is Mist?What is Mist?
It's a combinations of
Programming framework
HTTP/Async interface to run deploy and run spark
programms
Spark contexts / driver manager
26. ConceptsConcepts
Function
user code with Spark program to be deployed on Mist
Artifact
file (.jar or .py) that contains a Function
Context
settings for Mist Worker where Function is being
executed
Job
a result of the Function execution
27. Mist instancesMist instances
Mist Worker
Spark driver application which invokes functions
Mist Master
exposes http/async api
stores functions/artifacts/contexts
run/manage workers
run functions on workers
35. A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
39. ContextContext
Mist Context describes parameters of Mist worker and
Spark context
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
spark.submit.deployMode = cluster
spark.executor.instances = 2
spark.executor.cores = 2
spark.executor.memory = 1G
}
}
40. Mist instancesMist instances
Mist Worker
receives req -> invokes functions -> returns result
back to master
Mist Master
queue requests
sends them on workers
47. Worker modesWorker modes
Exclusive
starts new worker for every request
Shared
reuses worker instance
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
worker-mode = "shared"
// or
worker-mode = "exlsusive"
...
}
48. Multi clusterMulti cluster
Contexts may be configured to work on different
clusters
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
}
}
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = spark://
50. A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
52. Pi EstimationPi Estimation
def main(arg: Array[String]): Unit = {
val count = sc.parallelize(1 to samples).filter { _ =>
val x = math.random
val y = math.random
x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / samples}")
}
53. Pi EstimationPi Estimation
What signature is better?
def estimatePi1(samples: Int): Unit
// vs
def estimatePi2(samples: Int): Double
54. Word count againWord count again
def main(arg: Array[String]): Unit = {
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}
55. Word count againWord count again
What signature is better?
def wordCount(from: String, to: String): Unit
// vs
def wordCount(from: String, to: String): Map[String, Int]
// or
sealed trait OutputFile
case class HdfsFile(path: String) extends OutputFile
case class S3File(path: String) extends OutputFile
def wordCount(from: String, to: String): OutputFile
58. MistFnMistFn
Context is already described
Json input instead of arguments
Json output instead of Unit
Array[String] => Unit
// vs
(Json, SparkContext) => Json
62. A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces