SlideShare a Scribd company logo
1 of 63
Download to read offline
MISTMIST
Serverless proxy to Apache Spark
VADIM CHELYSHOVVADIM CHELYSHOV
github.com/dos65
hydrosphere.io
Scalalaz podcast (RUS)
Scala-digest (RUS)
HYDROSPHERE MISTHYDROSPHERE MIST
1.0.0-RC16
https://github.com/Hydrospheredata/mist
WHAT PROBLEM DOES IT SOLVE?WHAT PROBLEM DOES IT SOLVE?
Getting Big Data Projects to Production Is aGetting Big Data Projects to Production Is a
ChallengeChallenge
Only 15 percent of businesses reported deploying their
big data project to production, effectively unchanged
from last year (14 percent).
https://www.gartner.com/newsroom/id/3466117
Data engineers vs. data scientistsData engineers vs. data scientists
A common starting point is 2-3 data engineers for every
data scientist. For some organizations with more
complex data engineering requirements, this can be 4-5
data engineers per data scientist
https://www.oreilly.com/ideas/data-engineers-vs-
data-scientists
The lack of tooling?The lack of tooling?
The lack of right tooling?The lack of right tooling?
Getting Spark Projects to ProductionGetting Spark Projects to Production
How to
provide a way to run our job?
return a result?
report an error?
run multiply jobs in parallel?
LET'S COMPARE WITH USUALLET'S COMPARE WITH USUAL
THINGSTHINGS
DatabasesDatabases
Postgresql - 1995
CREATE OR REPLACE FUNCTION foo(param VARCHAR)
RETURNS TABLE
...
Web applicationsWeb applications
Java servlet - 1997
public class NewServlet extends HttpServlet {
@Override
protected void doGet(
HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
...
}
}
Big data processingBig data processing
Hadoop - 2005
Apache Spark - 2011
object MyApp {
def main(arg: String[]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
...
}
}
./bin/spark-submit ...
RunRun
Databases SELECT * FROM foo(param)
Web applications curl
Spark Shell + Trigger
Return resultReturn result
Databases built-in protocol
Web applications built-in protocol
Spark Read from fs/storages
Result errorResult error
Databases explicit error information
Web
applications
HTTP CODE + explicit error
infomation
Spark exit code
ParallelismParallelism
Databases by default
Web applications by default
Spark Good luck!
What a hell is going on!
These thing aren't
about big data!
It isn't normal development experinceIt isn't normal development experince
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
Just a special service!Just a special service!
Artifacts/settings CRUD
API for launching jobs and receiving their status
Spark driver launcher
A library do describe operations using Spark
MIST BASICSMIST BASICS
What is Mist?What is Mist?
Serverless proxy to Apache Spark
What is Mist?What is Mist?
Serverless proxy      ?
Apache Spark      ✔
What is Mist?What is Mist?
It's a combinations of
Programming framework
HTTP/Async interface to run deploy and run spark
programms
Spark contexts / driver manager
ConceptsConcepts
Function
 user code with Spark program to be deployed on Mist
Artifact
 file (.jar or .py) that contains a Function
Context
 settings for Mist Worker where Function is being
executed
Job
 a result of the Function execution
Mist instancesMist instances
Mist Worker
 Spark driver application which invokes functions
Mist Master
 exposes http/async api
 stores functions/artifacts/contexts
 run/manage workers
 run functions on workers
Mist Master - Http ApiMist Master - Http Api
/v2/api/functions
/v2/api/artifacts
/v2/api/contexts
/v2/api/jobs
WORD COUNTWORD COUNT
Vanilla SparkVanilla Spark
object WordCount {
def main(arg: Array[String]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}
}
MistFnMistFn
object WordCount extends MistFn {
override def handle: Handle = {
arg[String]("path").onSparkContext((path: String, sc: SparkContext) => {
val counts = sc.textFile(path)
.flatMap(_.split(" "))
.map(w => w -> 1)
.reduceByKey(_ + _ )
// counts.saveAsTextFile("hdfs://...")
counts.collect().toMap
}).asHandle
}
}
Vanilla SparkVanilla Spark
./bin/spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options
<application-jar> 
[application-arguments]
Mist-cli configMist-cli config
model = Context
name = mycontext
data {
spark-conf {
spark.master = <master url>
}
}
model = Artifact
name = wordcount
data.file-path = "./target/scala-2.11/<application-jar>"
model = Function
name = word-count
data {
path = <artifact_jar>
class-name = "HelloMist$"
context = mycontext
}
Mist-cli apply & runMist-cli apply & run
$ mist-cli -f apply conf
$ curl -X POST -d <json-input> 
'http://localhost:2004/v2/api/functions/word-count-example/jobs'
{"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
SERVERLESS?SERVERLESS?
Parallelism/MultitenancyParallelism/Multitenancy
Usual 1 req ~= thread
Spark 1 req ~= process
ServerlessServerless
ContextContext
Mist Context describes parameters of Mist worker and
Spark context
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
spark.submit.deployMode = cluster
spark.executor.instances = 2
spark.executor.cores = 2
spark.executor.memory = 1G
}
}
Mist instancesMist instances
Mist Worker
 receives req -> invokes functions -> returns result
back to master
Mist Master
 queue requests
 sends them on workers
ApplyApply
RunRun
$ curl -X POST -d <json-input> 
'http://localhost:2004/v2/api/functions/word-count-example/jobs'
{"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
RunRun
RunRun
ScalingScaling
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
...
}
Worker modesWorker modes
Exclusive
 starts new worker for every request
Shared
 reuses worker instance
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
worker-mode = "shared"
// or
worker-mode = "exlsusive"
...
}
Multi clusterMulti cluster
Contexts may be configured to work on different
clusters
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
}
}
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = spark://
Multi clusterMulti cluster
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
SPARK FUNCTIONSPARK FUNCTION
Pi EstimationPi Estimation
def main(arg: Array[String]): Unit = {
val count = sc.parallelize(1 to samples).filter { _ =>
val x = math.random
val y = math.random
x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / samples}")
}
Pi EstimationPi Estimation
What signature is better?
def estimatePi1(samples: Int): Unit
// vs
def estimatePi2(samples: Int): Double
Word count againWord count again
def main(arg: Array[String]): Unit = {
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}
Word count againWord count again
What signature is better?
def wordCount(from: String, to: String): Unit
// vs
def wordCount(from: String, to: String): Map[String, Int]
// or
sealed trait OutputFile
case class HdfsFile(path: String) extends OutputFile
case class S3File(path: String) extends OutputFile
def wordCount(from: String, to: String): OutputFile
HOW TO REPRESENT A SPARKHOW TO REPRESENT A SPARK
JOB?JOB?
Spark ApplicationSpark Application
def main(args: Array[String]): Unit
Array[String] => Unit
MistFnMistFn
Context is already described
Json input instead of arguments
Json output instead of Unit
Array[String] => Unit
// vs
(Json, SparkContext) => Json
SparkContext typesSparkContext types
SparkContext
SparkSession
StreamingContext
SQLContext
Array[String] => Unit
// vs
(Json, ?) => Json
GeneralizeGeneralize
// developer
// pi estimation: samples => pi
(Int, SparkContext) => Double
// word count - input path => output path
(String, SparkSession) => String
...
(A, ?) => B
(A, B, ?) => C
SPARK FUNCTIONSPARK FUNCTION
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
Future plansFuture plans
Release 1.0.0
AWS EMR integration - on-demand EMR clusters
...

More Related Content

What's hot

今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?Sho Yoshida
 
apidays LIVE Australia - Building distributed systems on the shoulders of gia...
apidays LIVE Australia - Building distributed systems on the shoulders of gia...apidays LIVE Australia - Building distributed systems on the shoulders of gia...
apidays LIVE Australia - Building distributed systems on the shoulders of gia...apidays
 
A Modest Introduction To Swift
A Modest Introduction To SwiftA Modest Introduction To Swift
A Modest Introduction To SwiftJohn Anderson
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Jean-Georges Perrin
 
My Top 5 APEX JavaScript API's
My Top 5 APEX JavaScript API'sMy Top 5 APEX JavaScript API's
My Top 5 APEX JavaScript API'sRoel Hartman
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive HookMinwoo Kim
 
Reactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiReactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiAziz Khambati
 
InterConnect: Java, Node.js and Swift - Which, Why and When
InterConnect: Java, Node.js and Swift - Which, Why and WhenInterConnect: Java, Node.js and Swift - Which, Why and When
InterConnect: Java, Node.js and Swift - Which, Why and WhenChris Bailey
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraRustam Aliyev
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Sematext Group, Inc.
 
A Lifecycle Of Code Under Test by Robert Fornal
A Lifecycle Of Code Under Test by Robert FornalA Lifecycle Of Code Under Test by Robert Fornal
A Lifecycle Of Code Under Test by Robert FornalQA or the Highway
 
Why You Should Use TAPIs
Why You Should Use TAPIsWhy You Should Use TAPIs
Why You Should Use TAPIsJeffrey Kemp
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharpergranicz
 
Server Side Swift
Server Side SwiftServer Side Swift
Server Side SwiftJens Ravens
 

What's hot (20)

今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?今時なウェブ開発をSmalltalkでやってみる?
今時なウェブ開発をSmalltalkでやってみる?
 
apidays LIVE Australia - Building distributed systems on the shoulders of gia...
apidays LIVE Australia - Building distributed systems on the shoulders of gia...apidays LIVE Australia - Building distributed systems on the shoulders of gia...
apidays LIVE Australia - Building distributed systems on the shoulders of gia...
 
A Modest Introduction To Swift
A Modest Introduction To SwiftA Modest Introduction To Swift
A Modest Introduction To Swift
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
My Top 5 APEX JavaScript API's
My Top 5 APEX JavaScript API'sMy Top 5 APEX JavaScript API's
My Top 5 APEX JavaScript API's
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Reactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz KhambatiReactive Programming - ReactFoo 2020 - Aziz Khambati
Reactive Programming - ReactFoo 2020 - Aziz Khambati
 
InterConnect: Java, Node.js and Swift - Which, Why and When
InterConnect: Java, Node.js and Swift - Which, Why and WhenInterConnect: Java, Node.js and Swift - Which, Why and When
InterConnect: Java, Node.js and Swift - Which, Why and When
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
 
A Lifecycle Of Code Under Test by Robert Fornal
A Lifecycle Of Code Under Test by Robert FornalA Lifecycle Of Code Under Test by Robert Fornal
A Lifecycle Of Code Under Test by Robert Fornal
 
Why You Should Use TAPIs
Why You Should Use TAPIsWhy You Should Use TAPIs
Why You Should Use TAPIs
 
Apache spark basics
Apache spark basicsApache spark basics
Apache spark basics
 
Embuk internals
Embuk internalsEmbuk internals
Embuk internals
 
API Performance
API PerformanceAPI Performance
API Performance
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Terraform day02
Terraform day02Terraform day02
Terraform day02
 
Server Side Swift
Server Side SwiftServer Side Swift
Server Side Swift
 

Similar to Mist - Serverless proxy to Apache Spark

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Burn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesBurn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesLindsay Holmwood
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling WaterSri Ambati
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applicationsTom Croucher
 
Building Testable PHP Applications
Building Testable PHP ApplicationsBuilding Testable PHP Applications
Building Testable PHP Applicationschartjes
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Infrastructure-as-code: bridging the gap between Devs and Ops
Infrastructure-as-code: bridging the gap between Devs and OpsInfrastructure-as-code: bridging the gap between Devs and Ops
Infrastructure-as-code: bridging the gap between Devs and OpsMykyta Protsenko
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 

Similar to Mist - Serverless proxy to Apache Spark (20)

Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Burn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websitesBurn down the silos! Helping dev and ops gel on high availability websites
Burn down the silos! Helping dev and ops gel on high availability websites
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
H2O PySparkling Water
H2O PySparkling WaterH2O PySparkling Water
H2O PySparkling Water
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Writing robust Node.js applications
Writing robust Node.js applicationsWriting robust Node.js applications
Writing robust Node.js applications
 
Building Testable PHP Applications
Building Testable PHP ApplicationsBuilding Testable PHP Applications
Building Testable PHP Applications
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Infrastructure-as-code: bridging the gap between Devs and Ops
Infrastructure-as-code: bridging the gap between Devs and OpsInfrastructure-as-code: bridging the gap between Devs and Ops
Infrastructure-as-code: bridging the gap between Devs and Ops
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Mist - Serverless proxy to Apache Spark

  • 4. WHAT PROBLEM DOES IT SOLVE?WHAT PROBLEM DOES IT SOLVE?
  • 5. Getting Big Data Projects to Production Is aGetting Big Data Projects to Production Is a ChallengeChallenge Only 15 percent of businesses reported deploying their big data project to production, effectively unchanged from last year (14 percent). https://www.gartner.com/newsroom/id/3466117
  • 6. Data engineers vs. data scientistsData engineers vs. data scientists A common starting point is 2-3 data engineers for every data scientist. For some organizations with more complex data engineering requirements, this can be 4-5 data engineers per data scientist https://www.oreilly.com/ideas/data-engineers-vs- data-scientists
  • 7. The lack of tooling?The lack of tooling?
  • 8.
  • 9. The lack of right tooling?The lack of right tooling?
  • 10. Getting Spark Projects to ProductionGetting Spark Projects to Production How to provide a way to run our job? return a result? report an error? run multiply jobs in parallel?
  • 11. LET'S COMPARE WITH USUALLET'S COMPARE WITH USUAL THINGSTHINGS
  • 12. DatabasesDatabases Postgresql - 1995 CREATE OR REPLACE FUNCTION foo(param VARCHAR) RETURNS TABLE ...
  • 13. Web applicationsWeb applications Java servlet - 1997 public class NewServlet extends HttpServlet { @Override protected void doGet( HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { ... } }
  • 14. Big data processingBig data processing Hadoop - 2005 Apache Spark - 2011 object MyApp { def main(arg: String[]): Unit = { val conf = new SparkConf().setAppName(...) val sc = new SparkContext(conf) ... } } ./bin/spark-submit ...
  • 15. RunRun Databases SELECT * FROM foo(param) Web applications curl Spark Shell + Trigger
  • 16. Return resultReturn result Databases built-in protocol Web applications built-in protocol Spark Read from fs/storages
  • 17. Result errorResult error Databases explicit error information Web applications HTTP CODE + explicit error infomation Spark exit code
  • 18. ParallelismParallelism Databases by default Web applications by default Spark Good luck!
  • 19. What a hell is going on! These thing aren't about big data!
  • 20. It isn't normal development experinceIt isn't normal development experince A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 21. Just a special service!Just a special service! Artifacts/settings CRUD API for launching jobs and receiving their status Spark driver launcher A library do describe operations using Spark
  • 23. What is Mist?What is Mist? Serverless proxy to Apache Spark
  • 24. What is Mist?What is Mist? Serverless proxy      ? Apache Spark      ✔
  • 25. What is Mist?What is Mist? It's a combinations of Programming framework HTTP/Async interface to run deploy and run spark programms Spark contexts / driver manager
  • 26. ConceptsConcepts Function  user code with Spark program to be deployed on Mist Artifact  file (.jar or .py) that contains a Function Context  settings for Mist Worker where Function is being executed Job  a result of the Function execution
  • 27. Mist instancesMist instances Mist Worker  Spark driver application which invokes functions Mist Master  exposes http/async api  stores functions/artifacts/contexts  run/manage workers  run functions on workers
  • 28. Mist Master - Http ApiMist Master - Http Api /v2/api/functions /v2/api/artifacts /v2/api/contexts /v2/api/jobs
  • 30. Vanilla SparkVanilla Spark object WordCount { def main(arg: Array[String]): Unit = { val conf = new SparkConf().setAppName(...) val sc = new SparkContext(conf) val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") } }
  • 31. MistFnMistFn object WordCount extends MistFn { override def handle: Handle = { arg[String]("path").onSparkContext((path: String, sc: SparkContext) => { val counts = sc.textFile(path) .flatMap(_.split(" ")) .map(w => w -> 1) .reduceByKey(_ + _ ) // counts.saveAsTextFile("hdfs://...") counts.collect().toMap }).asHandle } }
  • 32. Vanilla SparkVanilla Spark ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]
  • 33. Mist-cli configMist-cli config model = Context name = mycontext data { spark-conf { spark.master = <master url> } } model = Artifact name = wordcount data.file-path = "./target/scala-2.11/<application-jar>" model = Function name = word-count data { path = <artifact_jar> class-name = "HelloMist$" context = mycontext }
  • 34. Mist-cli apply & runMist-cli apply & run $ mist-cli -f apply conf $ curl -X POST -d <json-input> 'http://localhost:2004/v2/api/functions/word-count-example/jobs' {"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
  • 35. A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 39. ContextContext Mist Context describes parameters of Mist worker and Spark context model = Context name = cluster_ctx data { precreated = false spark-conf { spark.master = yarn spark.submit.deployMode = cluster spark.executor.instances = 2 spark.executor.cores = 2 spark.executor.memory = 1G } }
  • 40. Mist instancesMist instances Mist Worker  receives req -> invokes functions -> returns result back to master Mist Master  queue requests  sends them on workers
  • 42. RunRun $ curl -X POST -d <json-input> 'http://localhost:2004/v2/api/functions/word-count-example/jobs' {"id":"00e5294d-77f4-463b-85b6-4b73185eab1a"}
  • 45. ScalingScaling model = Context name = cluster_ctx data { precreated = false max-parallel-jobs = 2 ... }
  • 46.
  • 47. Worker modesWorker modes Exclusive  starts new worker for every request Shared  reuses worker instance model = Context name = cluster_ctx data { precreated = false max-parallel-jobs = 2 worker-mode = "shared" // or worker-mode = "exlsusive" ... }
  • 48. Multi clusterMulti cluster Contexts may be configured to work on different clusters model = Context name = cluster_ctx data { precreated = false spark-conf { spark.master = yarn } } model = Context name = cluster_ctx data { precreated = false spark-conf { spark.master = spark://
  • 50. A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 52. Pi EstimationPi Estimation def main(arg: Array[String]): Unit = { val count = sc.parallelize(1 to samples).filter { _ => val x = math.random val y = math.random x*x + y*y < 1 }.count() println(s"Pi is roughly ${4.0 * count / samples}") }
  • 53. Pi EstimationPi Estimation What signature is better? def estimatePi1(samples: Int): Unit // vs def estimatePi2(samples: Int): Double
  • 54. Word count againWord count again def main(arg: Array[String]): Unit = { val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") }
  • 55. Word count againWord count again What signature is better? def wordCount(from: String, to: String): Unit // vs def wordCount(from: String, to: String): Map[String, Int] // or sealed trait OutputFile case class HdfsFile(path: String) extends OutputFile case class S3File(path: String) extends OutputFile def wordCount(from: String, to: String): OutputFile
  • 56. HOW TO REPRESENT A SPARKHOW TO REPRESENT A SPARK JOB?JOB?
  • 57. Spark ApplicationSpark Application def main(args: Array[String]): Unit Array[String] => Unit
  • 58. MistFnMistFn Context is already described Json input instead of arguments Json output instead of Unit Array[String] => Unit // vs (Json, SparkContext) => Json
  • 60. GeneralizeGeneralize // developer // pi estimation: samples => pi (Int, SparkContext) => Double // word count - input path => output path (String, SparkSession) => String ... (A, ?) => B (A, B, ?) => C
  • 62. A lot of actions over ssh-session to test, deploy and run your code Improssible to run job without direct shell command Process more that one job in parallel is difficult Poor interfaces
  • 63. Future plansFuture plans Release 1.0.0 AWS EMR integration - on-demand EMR clusters ...