SlideShare a Scribd company logo
Productionizing Spark
and the Spark REST Job Server
Evan Chan
Distinguished Engineer
@TupleJump
Who am I
• Distinguished Engineer, Tuplejump
• @evanfchan
• http://github.com/velvia
• User and contributor to Spark since 0.9
• Co-creator and maintainer of Spark Job Server
2
TupleJump
✤ Tuplejump is a big data technology leader providing solutions and
development partnership.
✤ FiloDB - Spark-based analytics database for time series and event
data (github.com/tuplejump/FiloDB)
✤ Calliope - the first Spark-Cassandra integration
✤ Stargate - an open source Lucene indexer for Cassandra
✤ SnackFS - open source HDFS for Cassandra
3
TupleJump - Big Data Dev Partners
4
Deploying Spark
5
Choices, choices, choices
• YARN, Mesos, Standalone?
• With a distribution?
• What environment?
• How should I deploy?
• Hosted options?
• What about dependencies?
6
Basic Terminology
• The Spark documentation is really quite good.
7
What all the clusters have in common
• YARN, Mesos, and Standalone all support the following
features:
–Running the Spark driver app in cluster mode
–Restarts of the driver app upon failure
–UI to examine state of workers and apps
8
Spark Standalone Mode
• The easiest clustering mode to deploy**
–Use make-distribution.sh to package, copy to all nodes
–sbin/start-master.sh on master node, then start slaves
–Test with spark-shell
• HA Master through Zookeeper election
• Must dedicate whole cluster to Spark
• In latest survey, used by almost half of Spark users
9
Apache Mesos
• Was started by Matias in 2007 before he worked on Spark!
• Can run your entire company on Mesos, not just big data
–Great support for micro services - Docker, Marathon
–Can run non-JVM workloads like MPI
• Commercial backing from Mesosphere
• Heavily used at Twitter and AirBNB
• The Mesosphere DCOS will revolutionize Spark et al deployment - “dcos package
install spark” !!
10
Mesos vs YARN
• Mesos is a two-level resource manager, with pluggable schedulers
–You can run YARN on Mesos, with YARN delegating resource offers
to Mesos (Project Myriad)
–You can run multiple schedulers within Mesos, and write your own
• If you’re already a Hadoop / Cloudera etc shop, YARN is easy choice
• If you’re starting out, go 100% Mesos
11
Mesos Coarse vs Fine-Grained
• Spark offers two modes to run Mesos Spark apps in (and you can
choose per driver app):
–coarse-grained: Spark allocates fixed number of workers for
duration of driver app
–fine-grained (default): Dynamic executor allocation per task, but
higher overhead per task
• Use coarse-grained if you run low-latency jobs
12
What about Datastax DSE?
• Cassandra, Hadoop, Spark all bundled in one distribution, collocated
• Custom cluster manager and HA/failover logic for Spark Master,
using Cassandra gossip
• Can use CFS (Cassandra-based HDFS), SnackFS, or plain
Cassandra tables for storage
–or use Tachyon to cache, then no need to collocate (use
Mesosphere DCOS)
13
Hosted Apache Spark
• Spark on Amazon EMR - first class citizen now
–Direct S3 access!
• Google Compute Engine - “Click to Deploy” Hadoop+Spark
• Databricks Cloud
• Many more coming
• What you notice about the different environments:
–Everybody has their own way of starting: spark-submit vs dse spark vs aws
emr … vs dcos spark …
14
Mesosphere DCOS
• Automates deployment to AWS,
Google, etc.
• Common API and UI, better cost
and control, cloud
• Load balancing and routing,
Mesos for resource sharing
• dcos package install spark
15
Configuring Spark
16
Building Spark
• Make sure you build for the right Hadoop version
• eg mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
• Make sure you build for the right Scala version - Spark supports
both 2.10 and 2.11
17
Jars schmars
• Dependency conflicts are the worst part of Spark dev
• Every distro has slightly different jars - eg CDH < 5.4 packaged a different version of
Akka
• Leave out Hive if you don’t need it
• Use the Spark UI “Environment” tab to check jars and how they got there
• spark-submit —jars / —packages forwards jars to every executor (unless it’s an
HDFS / HTTP path)
• Spark-env.sh SPARK_CLASSPATH - include dep jars you’ve deployed to every node
18
Jars schmars
• You don’t need to package every dependency with your Spark application!
• spark-streaming is included in the distribution
• spark-streaming includes some Kafka jars already
• etc.
19
ClassPath Configuration Options
• spark.driver.userClassPathFirst, spark.executor.userClassPathFirst
• One way to solve dependency conflicts - make sure YOUR jars are loaded first, ahead of
Spark’s jars
• Client mode: use spark-submit options
• —driver-class-path, —driver-library-path
• Spark Classloader order of resolution
• Classes specified via —jars, —packages first (if above flag is set)
• Everything else in SPARK_CLASSPATH
20
Some useful config options
21
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.default.parallelism or pass # partitions for shuffle/reduce tasks as second arg
spark.scheduler.mode
FAIR - enable parallelism within apps (multi-tenant or low-
latency apps like SQL server)
spark.shuffle.memoryFraction,
spark.storage.memoryFraction
Fraction of Java heap to allocate for shuffle and RDD
caching, respectively, before spilling to disk
spark.cleaner.ttl
Enables periodic cleanup of cached RDDs, good for long-
lived jobs
spark.akka.frameSize
Increase the default of 10 (MB) to send back very large
results to the driver app (code smell)
spark.task.maxFailures # of retries for task failure is this - 1
Control Spark SQL Shuffles
• By default, Spark SQL / DataFrames will use 200
partitions when doing any groupBy / distinct operations
• sqlContext.setConf(
"spark.sql.shuffle.partitions", "16")
22
Prevent temp files from filling disks
• (Spark Standalone mode only)
• spark.worker.cleanup.enabled = true
• spark.worker.cleanup.interval
• Configuring executor log file retention/rotation
spark.executor.logs.rolling.maxRetainedFiles = 90
spark.executor.logs.rolling.strategy = time
23
Tuning Spark GC
• Lots of cached RDDs = huge old gen GC cycles, stop-the-world GC
• Know which operations consume more memory (sorts, shuffles)
• Try the new G1GC … avoids whole-heap scans
• -XX:+UseG1GC
• https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-
spark-applications.html
24
Running Spark Applications
25
Run your apps in the cluster
• spark-submit: —deploy-mode cluster
• Spark Job Server: deploy SJS to the cluster
• Drivers and executors are very chatty - want to reduce latency and decrease
chance of networking timeouts
• Want to avoid running jobs on your local machine
26
Automatic Driver Restarts
• Standalone: —deploy-mode cluster —supervise
• YARN: —deploy-mode cluster
• Mesos: use Marathon to restart dead slaves
• Periodic checkpointing: important for recovering data
• RDD checkpointing helps reduce long RDD lineages
27
Speeding up application startup
• Spark-submit’s —packages option is super convenient for downloading
dependencies, but avoid it in production
• Downloads tons of jars from Maven when driver starts up, then executors
copy all the jars from driver
• Deploy frequently used dependencies to worker nodes yourself
• For really fast Spark jobs, use the Spark Job Server and share a SparkContext
amongst jobs!
28
Spark(Context) Metrics
• Spark’s built in MetricsSystem has sources (Spark info, JVM, etc.) and sinks
(Graphite, etc.)
• Configure metrics.properties (template in spark conf/ dir) and use these
params to spark-submit
--files=/path/to/metrics.properties 
--conf spark.metrics.conf=metrics.properties
• See http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-
and-grafana/
29
Application Metrics
• Missing Hadoop counters? Use Spark Accumulators
• https://gist.github.com/ibuenros/9b94736c2bad2f4b8e2
3
• Above registers accumulators as a source to Spark’s
MetricsSystem
30
Watch how RDDs are cached
• RDDs cached to disk could slow down computation
31
Are your jobs stuck?
• First check cluster resources - does a job have enough CPU/mem?
• Take a thread dump of executors:
32
The Worst Killer - Classpath
• Classpath / jar versioning issues may cause Spark to hang silently. Debug
using the Environment tab of the UI:
33
Spark Job Server
34
35
http://github.com/spark-jobserver/spark-jobserver
Open Source!!
Also find it on spark-packages.org
Spark Job Server - What
• REST Interface for your Spark jobs
• Streaming, SQL, extendable
• Job history and configuration logged to a database
• Enable interactive low-latency queries (SQL/Dataframes
works too) of cached RDDs and tables
36
Spark Job Server - Where
37
Kafka
Spark
Streaming
Datastore Spark
Spark Job
Server
Internal users Internet
HTTP/HTTPS
Spark Job Server - Why
• Spark as a service
• Share Spark across the Enterprise
• HTTPS and LDAP Authentication
• Enterprises - easy integration with other teams, any language
• Share in-memory RDDs across logical jobs
• Low-latency queries
38
Used in Production - Worldwide
39
Used in Production
• As of last month, officially included in
Datastax Enterprise 4.8!
40
Active Community
• Large number of contributions from community
• HTTPS/LDAP contributed by team at KNIME
• Multiple committers
• Gitter IM channel, active Google group
41
Platform Independent
• Spark Standalone
• Mesos
• Yarn
• Docker
• Example: platform-independent LDAP auth, HTTPS, can be
used as a portal
42
Example Workflow
43
Creating a Job Server Project
• sbt assembly -> fat jar -> upload to job server
• "provided" is used. Don’t want SBT assembly to include
the whole job server jar.
• Java projects should be possible too
44
resolvers += "Job Server Bintray" at "https://dl.bintray.com/spark-jobserver/maven"
libraryDependencies += "spark.jobserver" % "job-server-api" % "0.5.0" % "provided"
• In your build.sbt, add this
/**
* A super-simple Spark job example that implements the SparkJob trait and
* can be submitted to the job server.
*/
object WordCountExample extends SparkJob {
override def validate(sc: SparkContext, config: Config): SparkJobValidation =
{
Try(config.getString(“input.string”))
.map(x => SparkJobValid)
.getOrElse(SparkJobInvalid(“No input.string”))
}
override def runJob(sc: SparkContext, config: Config): Any = {
val dd = sc.parallelize(config.getString(“input.string”).split(" ").toSeq)
dd.map((_, 1)).reduceByKey(_ + _).collect().toMap
}
}
Example Job Server Job
45
What’s Different?
• Job does not create Context, Job Server does
• Decide when I run the job: in own context, or in pre-created context
• Allows for very modular Spark development
• Break up a giant Spark app into multiple logical jobs
• Example:
• One job to load DataFrames tables
• One job to query them
• One job to run diagnostics and report debugging information
46
Submitting and Running a Job
47
✦ curl --data-binary @../target/mydemo.jar
localhost:8090/jars/demo
OK[11:32 PM] ~
✦ curl -d "input.string = A lazy dog jumped mean dog"
'localhost:8090/jobs?appName=demo&classPath=WordCountExample
&sync=true'
{
"status": "OK",
"RESULT": {
"lazy": 1,
"jumped": 1,
"A": 1,
"mean": 1,
"dog": 2
}
}
Retrieve Job Statuses
48
~/s/jobserver (evan-working-1 ↩=) curl
'localhost:8090/jobs?limit=2'
[{
"duration": "77.744 secs",
"classPath": "ooyala.cnd.CreateMaterializedView",
"startTime": "2013-11-26T20:13:09.071Z",
"context": "8b7059dd-ooyala.cnd.CreateMaterializedView",
"status": "FINISHED",
"jobId": "9982f961-aaaa-4195-88c2-962eae9b08d9"
}, {
"duration": "58.067 secs",
"classPath": "ooyala.cnd.CreateMaterializedView",
"startTime": "2013-11-26T20:22:03.257Z",
"context": "d0a5ebdc-ooyala.cnd.CreateMaterializedView",
"status": "FINISHED",
"jobId": "e9317383-6a67-41c4-8291-9c140b6d8459"
}]⏎
Use Case: Fast Query Jobs
49
Spark as a Query Engine
• Goal: spark jobs that run in under a second and answers queries on shared
RDD data
• Query params passed in as job config
• Need to minimize context creation overhead
–Thus many jobs sharing the same SparkContext
• On-heap RDD caching means no serialization loss
• Need to consider concurrent jobs (fair scheduling)
50
51
RDDLoad Data
Query
Job
Spark
Executors
Cassandra
REST Job Server
Query
Job
Query
Result
Query
Result
new SparkContext
Create
query
context
Load
some
data
Sharing Data Between Jobs
• RDD Caching
–Benefit: no need to serialize data. Especially useful for indexes etc.
–Job server provides a NamedRdds trait for thread-safe CRUD of cached
RDDs by name
• (Compare to SparkContext’s API which uses an integer ID and is not
thread safe)
• For example, at Ooyala a number of fields are multiplexed into the RDD
name: timestamp:customerID:granularity
52
Data Concurrency
• With fair scheduler, multiple Job Server jobs can run simultaneously on one
SparkContext
• Managing multiple updates to RDDs
–Cache keeps track of which RDDs being updated
–Example: thread A spark job creates RDD “A” at t0
–thread B fetches RDD “A” at t1 > t0
–Both threads A and B, using NamedRdds, will get the RDD at time t2 when
thread A finishes creating the RDD “A”
53
Spark SQL/Hive Query Server
✤ Start a context based on SQLContext:
curl -d "" '127.0.0.1:8090/contexts/sql-context?context-
factory=spark.jobserver.context.SQLContextFactory'
✤ Run a job for loading and caching tables in DataFrames
curl -d ""
'127.0.0.1:8090/jobs?appName=test&classPath=spark.jobserver.SqlLoaderJob&context
=sql-context&sync=true'
✤ Supply a query to a Query Job. All queries are logged in database by
Spark Job Server.
curl -d ‘sql=“SELECT count(*) FROM footable”’
'127.0.0.1:8090/jobs?appName=test&classPath=spark.jobserver.SqlQueryJob&context=
sql-context&sync=true'
54
Example: Combining Streaming And Spark SQL
55
SparkSQLStreamingContext
Kafka Streaming
Job
SQL
Query
Job
DataFrame
s
Spark Job Server
SQL Query
SparkSQLStreamingJob
56
trait SparkSqlStreamingJob extends SparkJobBase {
type C = SQLStreamingContext
}
class SQLStreamingContext(c: SparkContext) {
val streamingContext = new StreamingContext(c, ...)
val sqlContext = new SQLContext(c)
}
Now you have access to both StreamingContext and
SQLContext, and it can be shared across jobs!
SparkSQLStreamingContext
57
To start this context:
curl -d "" “localhost:8090/contexts/stream_sqltest?context-
factory=com.abc.SQLStreamingContextFactory"
class SQLStreamingContextFactory extends SparkContextFactory {
import SparkJobUtils._
type C = SQLStreamingContext with ContextLike
def makeContext(config: Config, contextConfig: Config, contextName: String): C = {
val batchInterval = contextConfig.getInt("batch_interval")
val conf = configToSparkConf(config, contextConfig, contextName)
new SQLStreamingContext(new SparkContext(conf), Seconds(batchInterval)) with ContextLike {
def sparkContext: SparkContext = this.streamingContext.sparkContext
def isValidJob(job: SparkJobBase): Boolean = job.isInstanceOf[SparkSqlStreamingJob]
// Stop the streaming context, but not the SparkContext so that it can be re-used
// to create another streaming context if required:
def stop() { this.streamingContext.stop(false) }
}
}
}
Future Work
58
Future Plans
• PR: Forked JVMs for supporting many concurrent
contexts
• True HA operation
• Swagger API documentation
59
HA for Job Server
Job
Server 1
Job
Server 2
Active
Job
Context
Gossip
Load balancer
60
Database
GET /jobs/<id>
HA and Hot Failover for Jobs
Job
Server 1
Job
Server 2
Active
Job
Context
HDFS
Standby
Job
Context
Gossip
Checkpoint
61
Thanks for your contributions!
• All of these were community contributed:
–HTTPS and Auth
–saving and retrieving job configuration
–forked JVM per context
• Your contributions are very welcome on Github!
62
63
And Everybody is Hiring!!
Thank you!
Why We Needed a Job Server
• Our vision for Spark is as a multi-team big data service
• What gets repeated by every team:
• Bastion box for running Hadoop/Spark jobs
• Deploys and process monitoring
• Tracking and serializing job status, progress, and job results
• Job validation
• No easy way to kill jobs
• Polyglot technology stack - Ruby scripts run jobs, Go services
Architecture
Completely Async Design
✤ http://spray.io - probably the fastest JVM HTTP
microframework
✤ Akka Actor based, non blocking
✤ Futures used to manage individual jobs. (Note that
Spark is using Scala futures to manage job stages now)
✤ Single JVM for now, but easy to distribute later via
remote Actors / Akka Cluster
Async Actor Flow
Spray web
API
Request
actor
Local
Supervisor
Job
Manager
Job 1
Future
Job 2
Future
Job Status
Actor
Job Result
Actor
Message flow fully documented
Production Usage
Metadata Store
✤ JarInfo, JobInfo, ConfigInfo
✤ JobSqlDAO. Store metadata to SQL database by JDBC interface.
✤ Easily configured by spark.sqldao.jdbc.url
✤ jdbc:mysql://dbserver:3306/jobserverdb
✤ Multiple Job Servers can share the same MySQL.
✤ Jars uploaded once but accessible by all servers.
✤ The default will be JobSqlDAO and H2.
✤ Single H2 DB file. Serialization and deserialization are handled by H2.
Deployment and Metrics
✤ spark-jobserver repo comes with a full suite of tests and
deploy scripts:
✤ server_deploy.sh for regular server pushes
✤ server_package.sh for Mesos and Chronos .tar.gz
✤ /metricz route for codahale-metrics monitoring
✤ /healthz route for health check
Challenges and Lessons
• Spark is based around contexts - we need a Job Server oriented around logical
jobs
• Running multiple SparkContexts in the same process
• Better long term solution is forked JVM per SparkContext
• Workaround: spark.driver.allowMultipleContexts = true
• Dynamic jar and class loading is tricky
• Manage threads carefully - each context uses lots of threads

More Related Content

What's hot

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Summit
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Evan Chan
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
Vritika Godara
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
Peng Cheng
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
UserReport
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
pumaranikar
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Evan Chan
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Continuous Processing in Structured Streaming with Jose Torres
 Continuous Processing in Structured Streaming with Jose Torres Continuous Processing in Structured Streaming with Jose Torres
Continuous Processing in Structured Streaming with Jose Torres
Databricks
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Spark
SparkSpark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Evan Chan
 

What's hot (20)

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
How to build your query engine in spark
How to build your query engine in sparkHow to build your query engine in spark
How to build your query engine in spark
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 
Continuous Processing in Structured Streaming with Jose Torres
 Continuous Processing in Structured Streaming with Jose Torres Continuous Processing in Structured Streaming with Jose Torres
Continuous Processing in Structured Streaming with Jose Torres
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Spark
SparkSpark
Spark
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 

Viewers also liked

Building a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a ServiceBuilding a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a Service
Cloudera, Inc.
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
Chetan Khatri
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
Chetan Khatri
 
Talk3 data-analytics with-pandas-and-num py-chetan-khatri
Talk3 data-analytics with-pandas-and-num py-chetan-khatriTalk3 data-analytics with-pandas-and-num py-chetan-khatri
Talk3 data-analytics with-pandas-and-num py-chetan-khatri
chetkhatri
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
Jen Aman
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Data science bootcamp day2
Data science bootcamp day2Data science bootcamp day2
Data science bootcamp day2
Chetan Khatri
 
Filme terror 2013
Filme terror 2013Filme terror 2013
Filme terror 2013
Rafael Wolf
 
RobertJMontgomeryJR V4
RobertJMontgomeryJR V4RobertJMontgomeryJR V4
RobertJMontgomeryJR V4
Robert Montgomery
 
FCA Intern Presentation
FCA Intern PresentationFCA Intern Presentation
FCA Intern Presentation
Kendall Moore
 
Survey Monkey Results
Survey Monkey ResultsSurvey Monkey Results
Survey Monkey Results
paigecrossland
 
Publication plan slideshare
Publication plan slidesharePublication plan slideshare
Publication plan slideshare
caitlinhardinASmedia
 
Continuous Deployment with Containers
Continuous Deployment with ContainersContinuous Deployment with Containers
Continuous Deployment with Containers
David Papp
 
Data science bootcamp day 3
Data science bootcamp day 3Data science bootcamp day 3
Data science bootcamp day 3
Chetan Khatri
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
Data Science London
 
Publication plan
Publication plan Publication plan
Publication plan
caitlinhardinASmedia
 

Viewers also liked (20)

Building a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a ServiceBuilding a REST Job Server for Interactive Spark as a Service
Building a REST Job Server for Interactive Spark as a Service
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
Alumni talk-university-of-kachchh
Alumni talk-university-of-kachchhAlumni talk-university-of-kachchh
Alumni talk-university-of-kachchh
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
 
Talk3 data-analytics with-pandas-and-num py-chetan-khatri
Talk3 data-analytics with-pandas-and-num py-chetan-khatriTalk3 data-analytics with-pandas-and-num py-chetan-khatri
Talk3 data-analytics with-pandas-and-num py-chetan-khatri
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Data science bootcamp day2
Data science bootcamp day2Data science bootcamp day2
Data science bootcamp day2
 
Filme terror 2013
Filme terror 2013Filme terror 2013
Filme terror 2013
 
RobertJMontgomeryJR V4
RobertJMontgomeryJR V4RobertJMontgomeryJR V4
RobertJMontgomeryJR V4
 
FCA Intern Presentation
FCA Intern PresentationFCA Intern Presentation
FCA Intern Presentation
 
Survey Monkey Results
Survey Monkey ResultsSurvey Monkey Results
Survey Monkey Results
 
Mart6ha
Mart6haMart6ha
Mart6ha
 
Publication plan slideshare
Publication plan slidesharePublication plan slideshare
Publication plan slideshare
 
Continuous Deployment with Containers
Continuous Deployment with ContainersContinuous Deployment with Containers
Continuous Deployment with Containers
 
Data science bootcamp day 3
Data science bootcamp day 3Data science bootcamp day 3
Data science bootcamp day 3
 
Numpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunchingNumpy, the Python foundation for number crunching
Numpy, the Python foundation for number crunching
 
Publication plan
Publication plan Publication plan
Publication plan
 

Similar to Productionizing Spark and the REST Job Server- Evan Chan

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
Legacy Typesafe (now Lightbend)
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Ahmet Bulut
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
Spark Summit
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
Spark Summit
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Databricks
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
trihug
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 

Similar to Productionizing Spark and the REST Job Server- Evan Chan (20)

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Lessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On DockerLessons Learned From Running Spark On Docker
Lessons Learned From Running Spark On Docker
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
TriHUG talk on Spark and Shark
TriHUG talk on Spark and SharkTriHUG talk on Spark and Shark
TriHUG talk on Spark and Shark
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
Grant McAlister
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
sheetal singh$A17
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
revolutionary575
 
Semantic Web and organizational data .pptx
Semantic Web and organizational data .pptxSemantic Web and organizational data .pptx
Semantic Web and organizational data .pptx
Kanchana Weerasinghe
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
AnujaGaikwad28
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
satpalsheravatmumbai
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
Jongwook Woo
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
PyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache ArrowPyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache Arrow
Uwe Korn
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
sheetal singh$A17
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
huseindihon
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
ginni singh$A17
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
tanupasswan6
 
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
kinni singh$A17
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
Cyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & PricingCyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & Pricing
BaraDaniel1
 

Recently uploaded (20)

Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
AWS re:Invent 2023 - Deep dive into Amazon Aurora and its innovations DAT408
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
Female Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service An...
 
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
Celebrity Girls Call Andheri 9930245274 Unlimited Short Providing Girls Servi...
 
Semantic Web and organizational data .pptx
Semantic Web and organizational data .pptxSemantic Web and organizational data .pptx
Semantic Web and organizational data .pptx
 
Celonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptxCelonis Busniess Analyst Virtual Internship.pptx
Celonis Busniess Analyst Virtual Internship.pptx
 
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
VIP Kanpur Girls Call Kanpur 0X0000000X Doorstep High-Profile Girl Service Ca...
 
History and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big DataHistory and Application of LLM Leveraging Big Data
History and Application of LLM Leveraging Big Data
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
PyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache ArrowPyData Sofia May 2024 - Intro to Apache Arrow
PyData Sofia May 2024 - Intro to Apache Arrow
 
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
Exclusive Girls Call Noida 🎈🔥9873940964 🔥💋🎈 Provide Best And Top Girl Service...
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
potential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in generalpotential usefulness of multi-agent maze-solving in general
potential usefulness of multi-agent maze-solving in general
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
Celebrity Girls Call Noida 9873940964 Unlimited Short Providing Girls Service...
 
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
Busty Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service And...
 
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
New Girls Call Noida 9873940964 Unlimited Short Providing Girls Service Avail...
 
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
Cyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & PricingCyber Insurance Mathematical Model & Pricing
Cyber Insurance Mathematical Model & Pricing
 

Productionizing Spark and the REST Job Server- Evan Chan

  • 1. Productionizing Spark and the Spark REST Job Server Evan Chan Distinguished Engineer @TupleJump
  • 2. Who am I • Distinguished Engineer, Tuplejump • @evanfchan • http://github.com/velvia • User and contributor to Spark since 0.9 • Co-creator and maintainer of Spark Job Server 2
  • 3. TupleJump ✤ Tuplejump is a big data technology leader providing solutions and development partnership. ✤ FiloDB - Spark-based analytics database for time series and event data (github.com/tuplejump/FiloDB) ✤ Calliope - the first Spark-Cassandra integration ✤ Stargate - an open source Lucene indexer for Cassandra ✤ SnackFS - open source HDFS for Cassandra 3
  • 4. TupleJump - Big Data Dev Partners 4
  • 6. Choices, choices, choices • YARN, Mesos, Standalone? • With a distribution? • What environment? • How should I deploy? • Hosted options? • What about dependencies? 6
  • 7. Basic Terminology • The Spark documentation is really quite good. 7
  • 8. What all the clusters have in common • YARN, Mesos, and Standalone all support the following features: –Running the Spark driver app in cluster mode –Restarts of the driver app upon failure –UI to examine state of workers and apps 8
  • 9. Spark Standalone Mode • The easiest clustering mode to deploy** –Use make-distribution.sh to package, copy to all nodes –sbin/start-master.sh on master node, then start slaves –Test with spark-shell • HA Master through Zookeeper election • Must dedicate whole cluster to Spark • In latest survey, used by almost half of Spark users 9
  • 10. Apache Mesos • Was started by Matias in 2007 before he worked on Spark! • Can run your entire company on Mesos, not just big data –Great support for micro services - Docker, Marathon –Can run non-JVM workloads like MPI • Commercial backing from Mesosphere • Heavily used at Twitter and AirBNB • The Mesosphere DCOS will revolutionize Spark et al deployment - “dcos package install spark” !! 10
  • 11. Mesos vs YARN • Mesos is a two-level resource manager, with pluggable schedulers –You can run YARN on Mesos, with YARN delegating resource offers to Mesos (Project Myriad) –You can run multiple schedulers within Mesos, and write your own • If you’re already a Hadoop / Cloudera etc shop, YARN is easy choice • If you’re starting out, go 100% Mesos 11
  • 12. Mesos Coarse vs Fine-Grained • Spark offers two modes to run Mesos Spark apps in (and you can choose per driver app): –coarse-grained: Spark allocates fixed number of workers for duration of driver app –fine-grained (default): Dynamic executor allocation per task, but higher overhead per task • Use coarse-grained if you run low-latency jobs 12
  • 13. What about Datastax DSE? • Cassandra, Hadoop, Spark all bundled in one distribution, collocated • Custom cluster manager and HA/failover logic for Spark Master, using Cassandra gossip • Can use CFS (Cassandra-based HDFS), SnackFS, or plain Cassandra tables for storage –or use Tachyon to cache, then no need to collocate (use Mesosphere DCOS) 13
  • 14. Hosted Apache Spark • Spark on Amazon EMR - first class citizen now –Direct S3 access! • Google Compute Engine - “Click to Deploy” Hadoop+Spark • Databricks Cloud • Many more coming • What you notice about the different environments: –Everybody has their own way of starting: spark-submit vs dse spark vs aws emr … vs dcos spark … 14
  • 15. Mesosphere DCOS • Automates deployment to AWS, Google, etc. • Common API and UI, better cost and control, cloud • Load balancing and routing, Mesos for resource sharing • dcos package install spark 15
  • 17. Building Spark • Make sure you build for the right Hadoop version • eg mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package • Make sure you build for the right Scala version - Spark supports both 2.10 and 2.11 17
  • 18. Jars schmars • Dependency conflicts are the worst part of Spark dev • Every distro has slightly different jars - eg CDH < 5.4 packaged a different version of Akka • Leave out Hive if you don’t need it • Use the Spark UI “Environment” tab to check jars and how they got there • spark-submit —jars / —packages forwards jars to every executor (unless it’s an HDFS / HTTP path) • Spark-env.sh SPARK_CLASSPATH - include dep jars you’ve deployed to every node 18
  • 19. Jars schmars • You don’t need to package every dependency with your Spark application! • spark-streaming is included in the distribution • spark-streaming includes some Kafka jars already • etc. 19
  • 20. ClassPath Configuration Options • spark.driver.userClassPathFirst, spark.executor.userClassPathFirst • One way to solve dependency conflicts - make sure YOUR jars are loaded first, ahead of Spark’s jars • Client mode: use spark-submit options • —driver-class-path, —driver-library-path • Spark Classloader order of resolution • Classes specified via —jars, —packages first (if above flag is set) • Everything else in SPARK_CLASSPATH 20
  • 21. Some useful config options 21 spark.serializer org.apache.spark.serializer.KryoSerializer spark.default.parallelism or pass # partitions for shuffle/reduce tasks as second arg spark.scheduler.mode FAIR - enable parallelism within apps (multi-tenant or low- latency apps like SQL server) spark.shuffle.memoryFraction, spark.storage.memoryFraction Fraction of Java heap to allocate for shuffle and RDD caching, respectively, before spilling to disk spark.cleaner.ttl Enables periodic cleanup of cached RDDs, good for long- lived jobs spark.akka.frameSize Increase the default of 10 (MB) to send back very large results to the driver app (code smell) spark.task.maxFailures # of retries for task failure is this - 1
  • 22. Control Spark SQL Shuffles • By default, Spark SQL / DataFrames will use 200 partitions when doing any groupBy / distinct operations • sqlContext.setConf( "spark.sql.shuffle.partitions", "16") 22
  • 23. Prevent temp files from filling disks • (Spark Standalone mode only) • spark.worker.cleanup.enabled = true • spark.worker.cleanup.interval • Configuring executor log file retention/rotation spark.executor.logs.rolling.maxRetainedFiles = 90 spark.executor.logs.rolling.strategy = time 23
  • 24. Tuning Spark GC • Lots of cached RDDs = huge old gen GC cycles, stop-the-world GC • Know which operations consume more memory (sorts, shuffles) • Try the new G1GC … avoids whole-heap scans • -XX:+UseG1GC • https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for- spark-applications.html 24
  • 26. Run your apps in the cluster • spark-submit: —deploy-mode cluster • Spark Job Server: deploy SJS to the cluster • Drivers and executors are very chatty - want to reduce latency and decrease chance of networking timeouts • Want to avoid running jobs on your local machine 26
  • 27. Automatic Driver Restarts • Standalone: —deploy-mode cluster —supervise • YARN: —deploy-mode cluster • Mesos: use Marathon to restart dead slaves • Periodic checkpointing: important for recovering data • RDD checkpointing helps reduce long RDD lineages 27
  • 28. Speeding up application startup • Spark-submit’s —packages option is super convenient for downloading dependencies, but avoid it in production • Downloads tons of jars from Maven when driver starts up, then executors copy all the jars from driver • Deploy frequently used dependencies to worker nodes yourself • For really fast Spark jobs, use the Spark Job Server and share a SparkContext amongst jobs! 28
  • 29. Spark(Context) Metrics • Spark’s built in MetricsSystem has sources (Spark info, JVM, etc.) and sinks (Graphite, etc.) • Configure metrics.properties (template in spark conf/ dir) and use these params to spark-submit --files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties • See http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite- and-grafana/ 29
  • 30. Application Metrics • Missing Hadoop counters? Use Spark Accumulators • https://gist.github.com/ibuenros/9b94736c2bad2f4b8e2 3 • Above registers accumulators as a source to Spark’s MetricsSystem 30
  • 31. Watch how RDDs are cached • RDDs cached to disk could slow down computation 31
  • 32. Are your jobs stuck? • First check cluster resources - does a job have enough CPU/mem? • Take a thread dump of executors: 32
  • 33. The Worst Killer - Classpath • Classpath / jar versioning issues may cause Spark to hang silently. Debug using the Environment tab of the UI: 33
  • 36. Spark Job Server - What • REST Interface for your Spark jobs • Streaming, SQL, extendable • Job history and configuration logged to a database • Enable interactive low-latency queries (SQL/Dataframes works too) of cached RDDs and tables 36
  • 37. Spark Job Server - Where 37 Kafka Spark Streaming Datastore Spark Spark Job Server Internal users Internet HTTP/HTTPS
  • 38. Spark Job Server - Why • Spark as a service • Share Spark across the Enterprise • HTTPS and LDAP Authentication • Enterprises - easy integration with other teams, any language • Share in-memory RDDs across logical jobs • Low-latency queries 38
  • 39. Used in Production - Worldwide 39
  • 40. Used in Production • As of last month, officially included in Datastax Enterprise 4.8! 40
  • 41. Active Community • Large number of contributions from community • HTTPS/LDAP contributed by team at KNIME • Multiple committers • Gitter IM channel, active Google group 41
  • 42. Platform Independent • Spark Standalone • Mesos • Yarn • Docker • Example: platform-independent LDAP auth, HTTPS, can be used as a portal 42
  • 44. Creating a Job Server Project • sbt assembly -> fat jar -> upload to job server • "provided" is used. Don’t want SBT assembly to include the whole job server jar. • Java projects should be possible too 44 resolvers += "Job Server Bintray" at "https://dl.bintray.com/spark-jobserver/maven" libraryDependencies += "spark.jobserver" % "job-server-api" % "0.5.0" % "provided" • In your build.sbt, add this
  • 45. /** * A super-simple Spark job example that implements the SparkJob trait and * can be submitted to the job server. */ object WordCountExample extends SparkJob { override def validate(sc: SparkContext, config: Config): SparkJobValidation = { Try(config.getString(“input.string”)) .map(x => SparkJobValid) .getOrElse(SparkJobInvalid(“No input.string”)) } override def runJob(sc: SparkContext, config: Config): Any = { val dd = sc.parallelize(config.getString(“input.string”).split(" ").toSeq) dd.map((_, 1)).reduceByKey(_ + _).collect().toMap } } Example Job Server Job 45
  • 46. What’s Different? • Job does not create Context, Job Server does • Decide when I run the job: in own context, or in pre-created context • Allows for very modular Spark development • Break up a giant Spark app into multiple logical jobs • Example: • One job to load DataFrames tables • One job to query them • One job to run diagnostics and report debugging information 46
  • 47. Submitting and Running a Job 47 ✦ curl --data-binary @../target/mydemo.jar localhost:8090/jars/demo OK[11:32 PM] ~ ✦ curl -d "input.string = A lazy dog jumped mean dog" 'localhost:8090/jobs?appName=demo&classPath=WordCountExample &sync=true' { "status": "OK", "RESULT": { "lazy": 1, "jumped": 1, "A": 1, "mean": 1, "dog": 2 } }
  • 48. Retrieve Job Statuses 48 ~/s/jobserver (evan-working-1 ↩=) curl 'localhost:8090/jobs?limit=2' [{ "duration": "77.744 secs", "classPath": "ooyala.cnd.CreateMaterializedView", "startTime": "2013-11-26T20:13:09.071Z", "context": "8b7059dd-ooyala.cnd.CreateMaterializedView", "status": "FINISHED", "jobId": "9982f961-aaaa-4195-88c2-962eae9b08d9" }, { "duration": "58.067 secs", "classPath": "ooyala.cnd.CreateMaterializedView", "startTime": "2013-11-26T20:22:03.257Z", "context": "d0a5ebdc-ooyala.cnd.CreateMaterializedView", "status": "FINISHED", "jobId": "e9317383-6a67-41c4-8291-9c140b6d8459" }]⏎
  • 49. Use Case: Fast Query Jobs 49
  • 50. Spark as a Query Engine • Goal: spark jobs that run in under a second and answers queries on shared RDD data • Query params passed in as job config • Need to minimize context creation overhead –Thus many jobs sharing the same SparkContext • On-heap RDD caching means no serialization loss • Need to consider concurrent jobs (fair scheduling) 50
  • 51. 51 RDDLoad Data Query Job Spark Executors Cassandra REST Job Server Query Job Query Result Query Result new SparkContext Create query context Load some data
  • 52. Sharing Data Between Jobs • RDD Caching –Benefit: no need to serialize data. Especially useful for indexes etc. –Job server provides a NamedRdds trait for thread-safe CRUD of cached RDDs by name • (Compare to SparkContext’s API which uses an integer ID and is not thread safe) • For example, at Ooyala a number of fields are multiplexed into the RDD name: timestamp:customerID:granularity 52
  • 53. Data Concurrency • With fair scheduler, multiple Job Server jobs can run simultaneously on one SparkContext • Managing multiple updates to RDDs –Cache keeps track of which RDDs being updated –Example: thread A spark job creates RDD “A” at t0 –thread B fetches RDD “A” at t1 > t0 –Both threads A and B, using NamedRdds, will get the RDD at time t2 when thread A finishes creating the RDD “A” 53
  • 54. Spark SQL/Hive Query Server ✤ Start a context based on SQLContext: curl -d "" '127.0.0.1:8090/contexts/sql-context?context- factory=spark.jobserver.context.SQLContextFactory' ✤ Run a job for loading and caching tables in DataFrames curl -d "" '127.0.0.1:8090/jobs?appName=test&classPath=spark.jobserver.SqlLoaderJob&context =sql-context&sync=true' ✤ Supply a query to a Query Job. All queries are logged in database by Spark Job Server. curl -d ‘sql=“SELECT count(*) FROM footable”’ '127.0.0.1:8090/jobs?appName=test&classPath=spark.jobserver.SqlQueryJob&context= sql-context&sync=true' 54
  • 55. Example: Combining Streaming And Spark SQL 55 SparkSQLStreamingContext Kafka Streaming Job SQL Query Job DataFrame s Spark Job Server SQL Query
  • 56. SparkSQLStreamingJob 56 trait SparkSqlStreamingJob extends SparkJobBase { type C = SQLStreamingContext } class SQLStreamingContext(c: SparkContext) { val streamingContext = new StreamingContext(c, ...) val sqlContext = new SQLContext(c) } Now you have access to both StreamingContext and SQLContext, and it can be shared across jobs!
  • 57. SparkSQLStreamingContext 57 To start this context: curl -d "" “localhost:8090/contexts/stream_sqltest?context- factory=com.abc.SQLStreamingContextFactory" class SQLStreamingContextFactory extends SparkContextFactory { import SparkJobUtils._ type C = SQLStreamingContext with ContextLike def makeContext(config: Config, contextConfig: Config, contextName: String): C = { val batchInterval = contextConfig.getInt("batch_interval") val conf = configToSparkConf(config, contextConfig, contextName) new SQLStreamingContext(new SparkContext(conf), Seconds(batchInterval)) with ContextLike { def sparkContext: SparkContext = this.streamingContext.sparkContext def isValidJob(job: SparkJobBase): Boolean = job.isInstanceOf[SparkSqlStreamingJob] // Stop the streaming context, but not the SparkContext so that it can be re-used // to create another streaming context if required: def stop() { this.streamingContext.stop(false) } } } }
  • 59. Future Plans • PR: Forked JVMs for supporting many concurrent contexts • True HA operation • Swagger API documentation 59
  • 60. HA for Job Server Job Server 1 Job Server 2 Active Job Context Gossip Load balancer 60 Database GET /jobs/<id>
  • 61. HA and Hot Failover for Jobs Job Server 1 Job Server 2 Active Job Context HDFS Standby Job Context Gossip Checkpoint 61
  • 62. Thanks for your contributions! • All of these were community contributed: –HTTPS and Auth –saving and retrieving job configuration –forked JVM per context • Your contributions are very welcome on Github! 62
  • 63. 63 And Everybody is Hiring!! Thank you!
  • 64. Why We Needed a Job Server • Our vision for Spark is as a multi-team big data service • What gets repeated by every team: • Bastion box for running Hadoop/Spark jobs • Deploys and process monitoring • Tracking and serializing job status, progress, and job results • Job validation • No easy way to kill jobs • Polyglot technology stack - Ruby scripts run jobs, Go services
  • 66. Completely Async Design ✤ http://spray.io - probably the fastest JVM HTTP microframework ✤ Akka Actor based, non blocking ✤ Futures used to manage individual jobs. (Note that Spark is using Scala futures to manage job stages now) ✤ Single JVM for now, but easy to distribute later via remote Actors / Akka Cluster
  • 67. Async Actor Flow Spray web API Request actor Local Supervisor Job Manager Job 1 Future Job 2 Future Job Status Actor Job Result Actor
  • 68. Message flow fully documented
  • 70. Metadata Store ✤ JarInfo, JobInfo, ConfigInfo ✤ JobSqlDAO. Store metadata to SQL database by JDBC interface. ✤ Easily configured by spark.sqldao.jdbc.url ✤ jdbc:mysql://dbserver:3306/jobserverdb ✤ Multiple Job Servers can share the same MySQL. ✤ Jars uploaded once but accessible by all servers. ✤ The default will be JobSqlDAO and H2. ✤ Single H2 DB file. Serialization and deserialization are handled by H2.
  • 71. Deployment and Metrics ✤ spark-jobserver repo comes with a full suite of tests and deploy scripts: ✤ server_deploy.sh for regular server pushes ✤ server_package.sh for Mesos and Chronos .tar.gz ✤ /metricz route for codahale-metrics monitoring ✤ /healthz route for health check
  • 72. Challenges and Lessons • Spark is based around contexts - we need a Job Server oriented around logical jobs • Running multiple SparkContexts in the same process • Better long term solution is forked JVM per SparkContext • Workaround: spark.driver.allowMultipleContexts = true • Dynamic jar and class loading is tricky • Manage threads carefully - each context uses lots of threads

Editor's Notes

  1. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  2. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  3. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  4. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  5. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  6. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  7. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  8. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  9. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  10. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  11. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  12. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  13. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  14. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  15. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  16. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  17. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  18. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  19. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  20. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  21. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  22. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  23. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  24. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  25. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  26. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  27. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  28. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  29. There are many ways to run Spark. There is a standalone cluster mode, and it can run under Mesos, or YARN.
  30. We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
  31. We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
  32. We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
  33. We will have 3 presentations at the upcoming Spark Summit, please be sure to watch for us.
  34. Assuming it’s a Scala project…..
  35. Every job has a Config object passed in….. 400….
  36. Have the option to choose at job start time, whether to run in own context or shared context. Pretty cool, I can submit a diagnostic jar to diagnose a problem.
  37. Complex queries GROUP BY, filters, etc.
  38. Here we have the workflow for low-latency jobs which share a single SparkContext. Unlike the ad-hoc job, here we explicitly create a context. With standalone / coarse mode, guarantee/fix CPU and memory resources. We run a job to load the data and populate the SparkContext RDDs. An API server submits query jobs against the job server, which executes them and returns JSON results in the same request. Jobs can be separate jars, written by separate teams, independently loaded, run, tested.