SlideShare a Scribd company logo
MISTMIST
Serverless proxy to Apache Spark
VADIM CHELYSHOVVADIM CHELYSHOV
github.com/dos65
hydrosphere.io
Scalalaz podcast (RUS)
Scala-digest (RUS)
HYDROSPHERE MISTHYDROSPHERE MIST
1.0.0-RC16
https://github.com/Hydrospheredata/mist
WHAT PROBLEM DOES IT SOLVE?WHAT PROBLEM DOES IT SOLVE?
Getting Big Data Projects to Production Is aGetting Big Data Projects to Production Is a
ChallengeChallenge
Only 15 percent of businesses reported deploying their
big data project to production, effectively unchanged
from last year (14 percent).
https://www.gartner.com/newsroom/id/3466117
Data engineers vs. data scientistsData engineers vs. data scientists
A common starting point is 2-3 data engineers for every
data scientist. For some organizations with more
complex data engineering requirements, this can be 4-5
data engineers per data scientist
https://www.oreilly.com/ideas/data-engineers-vs-
data-scientists
The lack of tooling?The lack of tooling?
The lack of right tooling?The lack of right tooling?
Getting Spark Projects to ProductionGetting Spark Projects to Production
How to
provide a way to run our job?
return a result?
report an error?
run multiply jobs in parallel?
LET'S COMPARE WITH USUALLET'S COMPARE WITH USUAL
THINGSTHINGS
DatabasesDatabases
Postgresql - 1995
CREATE OR REPLACE FUNCTION foo(param VARCHAR)
RETURNS TABLE
...
Web applicationsWeb applications
Java servlet - 1997
public class NewServlet extends HttpServlet {
@Override
protected void doGet(
HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
...
}
}
Big data processingBig data processing
Hadoop - 2005
Apache Spark - 2011
object MyApp {
def main(arg: String[]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
...
}
}
./bin/spark-submit ...
RunRun
Databases SELECT * FROM foo(param)
Web applications curl
Spark Shell + Trigger
Return resultReturn result
Databases built-in protocol
Web applications built-in protocol
Spark Read from fs/storages
Result errorResult error
Databases explicit error information
Web
applications
HTTP CODE + explicit error
infomation
Spark exit code
ParallelismParallelism
Databases by default
Web applications by default
Spark Good luck!
What a hell is going on!
These thing aren't
about big data!
It isn't normal development experinceIt isn't normal development experince
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
Just a special service!Just a special service!
Artifacts/settings CRUD
API for launching jobs and receiving their status
Spark driver launcher
A library do describe operations using Spark
MIST BASICSMIST BASICS
What is Mist?What is Mist?
Serverless proxy to Apache Spark
What is Mist?What is Mist?
Serverless proxy      ?
Apache Spark      ✔
What is Mist?What is Mist?
It's a combinations of
Programming framework
HTTP/Async interface to run deploy and run spark
programms
Spark contexts / driver manager
ConceptsConcepts
Function
 user code with Spark program to be deployed on Mist
Artifact
 file (.jar or .py) that contains a Function
Context
 settings for Mist Worker where Function is being
executed
Job
 a result of the Function execution
Mist instancesMist instances
Mist Worker
 Spark driver application which invokes functions
Mist Master
 exposes http/async api
 stores functions/artifacts/contexts
 run/manage workers
 run functions on workers
Mist Master - Http ApiMist Master - Http Api
/v2/api/functions
/v2/api/artifacts
/v2/api/contexts
/v2/api/jobs
WORD COUNTWORD COUNT
Vanilla SparkVanilla Spark
object WordCount {
def main(arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}
}
MistFnMistFn
object WordCount extends MistFn {
override def handle: Handle = {
arg[String]("path").onSparkContext((path: String, sc: SparkContext) => {
val counts = sc.textFile(path)
.flatMap(_.split(" "))
.map(w => w -> 1)
.reduceByKey(_ + _ )
// counts.saveAsTextFile("hdfs://...")
counts.collect().toMap
}).asHandle
}
}
Vanilla SparkVanilla Spark
./bin/spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options
<application-jar> 
[application-arguments]
Mist-cli configMist-cli config
model = Context
name = mycontext
data {
spark-conf {
spark.master = <master url>
}
}
model = Artifact
name = wordcount
data.file-path = "./target/scala-2.11/<application-jar>"
model = Function
name = word-count
data {
path = <artifact_jar>
class-name = "HelloMist$"
context = mycontext
}
Mist-cli apply & runMist-cli apply & run
$ mist-cli -f apply conf
$ curl -X POST -d <json-input> 
'http://localhost:2004/v2/api/functions/word-count-example/jobs'
{"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
SERVERLESS?SERVERLESS?
Parallelism/MultitenancyParallelism/Multitenancy
Usual 1 req ~= thread
Spark 1 req ~= process
ServerlessServerless
ContextContext
Mist Context describes parameters of Mist worker and
Spark context
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
spark.submit.deployMode = cluster
spark.executor.instances = 2
spark.executor.cores = 2
spark.executor.memory = 1G
}
}
Mist instancesMist instances
Mist Worker
 receives req -> invokes functions -> returns result
back to master
Mist Master
 queue requests
 sends them on workers
ApplyApply
RunRun
$ curl -X POST -d <json-input> 
'http://localhost:2004/v2/api/functions/word-count-example/jobs'
{"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
RunRun
RunRun
ScalingScaling
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
...
}
Worker modesWorker modes
Exclusive
 starts new worker for every request
Shared
 reuses worker instance
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
worker-mode = "shared"
// or
worker-mode = "exlsusive"
...
}
Multi clusterMulti cluster
Contexts may be configured to work on different
clusters
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
}
}
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = spark://
Multi clusterMulti cluster
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
SPARK FUNCTIONSPARK FUNCTION
Pi EstimationPi Estimation
def main(arg: Array[String]): Unit = {
val count = sc.parallelize(1 to samples).filter { _ =>
val x = math.random
val y = math.random
x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / samples}")
}
Pi EstimationPi Estimation
What signature is better?
def estimatePi1(samples: Int): Unit
// vs
def estimatePi2(samples: Int): Double
Word count againWord count again
def main(arg: Array[String]): Unit = {
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}
Word count againWord count again
What signature is better?
def wordCount(from: String, to: String): Unit
// vs
def wordCount(from: String, to: String): Map[String, Int]
// or
sealed trait OutputFile
case class HdfsFile(path: String) extends OutputFile
case class S3File(path: String) extends OutputFile
def wordCount(from: String, to: String): OutputFile
HOW TO REPRESENT A SPARKHOW TO REPRESENT A SPARK
JOB?JOB?
Spark ApplicationSpark Application
def main(args: Array[String]): Unit
Array[String] => Unit
MistFnMistFn
Context is already described
Json input instead of arguments
Json output instead of Unit
Array[String] => Unit
// vs
(Json, SparkContext) => Json
SparkContext typesSparkContext types
SparkContext
SparkSession
StreamingContext
SQLContext
Array[String] => Unit
// vs
(Json, ?) => Json
GeneralizeGeneralize
// developer
// pi estimation: samples => pi
(Int, SparkContext) => Double
// word count - input path => output path
(String, SparkSession) => String
...
(A, ?) => B
(A, B, ?) => C
SPARK FUNCTIONSPARK FUNCTION
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
Future plansFuture plans
Release 1.0.0
AWS EMR integration - on-demand EMR clusters
...

More Related Content

What's hot

A Modest Introduction To Swift
A Modest Introduction To SwiftA Modest Introduction To Swift
A Modest Introduction To Swift
John Anderson
 

What's hot (20)

今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
 
apidays LIVE Australia 2020 - Building distributed systems on the shoulders o...
apidays LIVE Australia 2020 - Building distributed systems on the shoulders o...apidays LIVE Australia 2020 - Building distributed systems on the shoulders o...
apidays LIVE Australia 2020 - Building distributed systems on the shoulders o...
 
A Modest Introduction To Swift
A Modest Introduction To SwiftA Modest Introduction To Swift
A Modest Introduction To Swift
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
My Top 5 APEX JavaScript API's
My Top 5 APEX JavaScript API'sMy Top 5 APEX JavaScript API's
My Top 5 APEX JavaScript API's
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Reactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiReactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz Khambati
 
InterConnect: Java, Node.js and Swift - Which, Why and When
InterConnect: Java, Node.js and Swift - Which, Why and WhenInterConnect: Java, Node.js and Swift - Which, Why and When
InterConnect: Java, Node.js and Swift - Which, Why and When
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
 
A Lifecycle Of Code Under Test by Robert Fornal
A Lifecycle Of Code Under Test by Robert FornalA Lifecycle Of Code Under Test by Robert Fornal
A Lifecycle Of Code Under Test by Robert Fornal
 
Why You Should Use TAPIs
Why You Should Use TAPIsWhy You Should Use TAPIs
Why You Should Use TAPIs
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
API Performance
API PerformanceAPI Performance
API Performance
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Terraform day02
Terraform day02Terraform day02
Terraform day02
 
Server Side Swift
Server Side SwiftServer Side Swift
Server Side Swift
 

Similar to Mist - Serverless proxy to Apache Spark

Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
Tom Croucher
 

Similar to Mist - Serverless proxy to Apache Spark (20)

Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Burn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesBurn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websites
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
 
Building Testable PHP Applications
Building Testable PHP ApplicationsBuilding Testable PHP Applications
Building Testable PHP Applications
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Infrastructure-as-code: bridging the gap between Devs and Ops
Infrastructure-as-code: bridging the gap between Devs and OpsInfrastructure-as-code: bridging the gap between Devs and Ops
Infrastructure-as-code: bridging the gap between Devs and Ops
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 

Recently uploaded (20)

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
In-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT ProfessionalsIn-Depth Performance Testing Guide for IT Professionals
In-Depth Performance Testing Guide for IT Professionals
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 

Mist - Serverless proxy to Apache Spark

  • 4. WHAT PROBLEM DOES IT SOLVE?WHAT PROBLEM DOES IT SOLVE?
  • 5. Getting Big Data Projects to Production Is aGetting Big Data Projects to Production Is a ChallengeChallenge Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 percent). https://www.gartner.com/newsroom/id/3466117
  • 6. Data engineers vs. data scientistsData engineers vs. data scientists A common starting point is 2-3 data engineers for every data scientist. For some organizations with more complex data engineering requirements, this can be 4-5 data engineers per data scientist https://www.oreilly.com/ideas/data-engineers-vs- data-scientists
  • 7. The lack of tooling?The lack of tooling?
  • 8.
  • 9. The lack of right tooling?The lack of right tooling?
  • 10. Getting Spark Projects to ProductionGetting Spark Projects to Production How to provide a way to run our job? return a result? report an error? run multiply jobs in parallel?
  • 11. LET'S COMPARE WITH USUALLET'S COMPARE WITH USUAL THINGSTHINGS
  • 12. DatabasesDatabases Postgresql - 1995 CREATE OR REPLACE FUNCTION foo(param VARCHAR) RETURNS TABLE ...
  • 13. Web applicationsWeb applications Java servlet - 1997 public class NewServlet extends HttpServlet { @Override protected void doGet( HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { ... } }
  • 14. Big data processingBig data processing Hadoop - 2005 Apache Spark - 2011 object MyApp { def main(arg: String[]): Unit = { val conf = new SparkConf().setAppName(...) val sc = new SparkContext(conf) ... } } ./bin/spark-submit ...
  • 15. RunRun Databases SELECT * FROM foo(param) Web applications curl Spark Shell + Trigger
  • 16. Return resultReturn result Databases built-in protocol Web applications built-in protocol Spark Read from fs/storages
  • 17. Result errorResult error Databases explicit error information Web applications HTTP CODE + explicit error infomation Spark exit code
  • 18. ParallelismParallelism Databases by default Web applications by default Spark Good luck!
  • 19. What a hell is going on! These thing aren't about big data!
  • 20. It isn't normal development experinceIt isn't normal development experince A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 21. Just a special service!Just a special service! Artifacts/settings CRUD API for launching jobs and receiving their status Spark driver launcher A library do describe operations using Spark
  • 23. What is Mist?What is Mist? Serverless proxy to Apache Spark
  • 24. What is Mist?What is Mist? Serverless proxy      ? Apache Spark      ✔
  • 25. What is Mist?What is Mist? It's a combinations of Programming framework HTTP/Async interface to run deploy and run spark programms Spark contexts / driver manager
  • 26. ConceptsConcepts Function  user code with Spark program to be deployed on Mist Artifact  file (.jar or .py) that contains a Function Context  settings for Mist Worker where Function is being executed Job  a result of the Function execution
  • 27. Mist instancesMist instances Mist Worker  Spark driver application which invokes functions Mist Master  exposes http/async api  stores functions/artifacts/contexts  run/manage workers  run functions on workers
  • 28. Mist Master - Http ApiMist Master - Http Api /v2/api/functions /v2/api/artifacts /v2/api/contexts /v2/api/jobs
  • 30. Vanilla SparkVanilla Spark object WordCount { def main(arg: Array[String]): Unit = { val conf = new SparkConf().setAppName(...) val sc = new SparkContext(conf) val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") } }
  • 31. MistFnMistFn object WordCount extends MistFn { override def handle: Handle = { arg[String]("path").onSparkContext((path: String, sc: SparkContext) => { val counts = sc.textFile(path) .flatMap(_.split(" ")) .map(w => w -> 1) .reduceByKey(_ + _ ) // counts.saveAsTextFile("hdfs://...") counts.collect().toMap }).asHandle } }
  • 32. Vanilla SparkVanilla Spark ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]
  • 33. Mist-cli configMist-cli config model = Context name = mycontext data { spark-conf { spark.master = <master url> } } model = Artifact name = wordcount data.file-path = "./target/scala-2.11/<application-jar>" model = Function name = word-count data { path = <artifact_jar> class-name = "HelloMist$" context = mycontext }
  • 34. Mist-cli apply & runMist-cli apply & run $ mist-cli -f apply conf $ curl -X POST -d <json-input> 'http://localhost:2004/v2/api/functions/word-count-example/jobs' {"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
  • 35. A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 39. ContextContext Mist Context describes parameters of Mist worker and Spark context model = Context name = cluster_ctx data { precreated = false spark-conf { spark.master = yarn spark.submit.deployMode = cluster spark.executor.instances = 2 spark.executor.cores = 2 spark.executor.memory = 1G } }
  • 40. Mist instancesMist instances Mist Worker  receives req -> invokes functions -> returns result back to master Mist Master  queue requests  sends them on workers
  • 42. RunRun $ curl -X POST -d <json-input> 'http://localhost:2004/v2/api/functions/word-count-example/jobs' {"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
  • 45. ScalingScaling model = Context name = cluster_ctx data { precreated = false max-parallel-jobs = 2 ... }
  • 46.
  • 47. Worker modesWorker modes Exclusive  starts new worker for every request Shared  reuses worker instance model = Context name = cluster_ctx data { precreated = false max-parallel-jobs = 2 worker-mode = "shared" // or worker-mode = "exlsusive" ... }
  • 48. Multi clusterMulti cluster Contexts may be configured to work on different clusters model = Context name = cluster_ctx data { precreated = false spark-conf { spark.master = yarn } } model = Context name = cluster_ctx data { precreated = false spark-conf { spark.master = spark://
  • 50. A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 52. Pi EstimationPi Estimation def main(arg: Array[String]): Unit = { val count = sc.parallelize(1 to samples).filter { _ => val x = math.random val y = math.random x*x + y*y < 1 }.count() println(s"Pi is roughly ${4.0 * count / samples}") }
  • 53. Pi EstimationPi Estimation What signature is better? def estimatePi1(samples: Int): Unit // vs def estimatePi2(samples: Int): Double
  • 54. Word count againWord count again def main(arg: Array[String]): Unit = { val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") }
  • 55. Word count againWord count again What signature is better? def wordCount(from: String, to: String): Unit // vs def wordCount(from: String, to: String): Map[String, Int] // or sealed trait OutputFile case class HdfsFile(path: String) extends OutputFile case class S3File(path: String) extends OutputFile def wordCount(from: String, to: String): OutputFile
  • 56. HOW TO REPRESENT A SPARKHOW TO REPRESENT A SPARK JOB?JOB?
  • 57. Spark ApplicationSpark Application def main(args: Array[String]): Unit Array[String] => Unit
  • 58. MistFnMistFn Context is already described Json input instead of arguments Json output instead of Unit Array[String] => Unit // vs (Json, SparkContext) => Json
  • 60. GeneralizeGeneralize // developer // pi estimation: samples => pi (Int, SparkContext) => Double // word count - input path => output path (String, SparkSession) => String ... (A, ?) => B (A, B, ?) => C
  • 62. A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 63. Future plansFuture plans Release 1.0.0 AWS EMR integration - on-demand EMR clusters ...