SlideShare a Scribd company logo
Simulate, Test, Compare, Exercise, and Yes, Even
Benchmark!
Spark-Bench
Emily May Curtin
The Weather Company
Atlanta, GA
@emilymaycurtin
@ecurtin
"S" + =
#SparkBench
Who Am I
• Artist, former documentary film editor
• 2017 Spark Summit East – Spark + Parquet
• Tabs, not spaces (at least for Scala)(where 1 tab = 2 spaces)
• Software Engineer, The Weather Company, an IBM Business
• Current lead developer of CODAIT/spark-bench
#SparkBench
Atlanta!!!!
#SparkBench
Spark-Bench
• History & Rewrite Motivation
• How Spark-Bench Works
• Detailed Example: Parquet vs. CSV Benchmarking
• More Examples
– Cluster Hammer
– Multiple Notebook Users
– Varying over Spark parameters
#SparkBench
2015 A.D.
#SparkBench
#SparkBench
2015 A.D.
#SparkBench
2015 A.D.
#SparkBench
2015 A.D.
Spark Benchmarking circa 2015 A.D.
#SparkBench
Spark-Bench 2015
"SparkBench: A Comprehensive Benchmarking
Suite For In Memory Data Analytics Platform Spark"
Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina
Salapura IBM TJ Watson Research Center - APC 2015
#SparkBench
Spark-Bench 2015
"SparkBench: A Comprehensive Benchmarking
Suite For In Memory Data Analytics Platform Spark"
Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina
Salapura IBM TJ Watson Research Center - APC 2015
#SparkBench
2017 A.D.
#SparkBench
2017 A.D.
#SparkBench
2017 A.D.
#SparkBench
2017 A.D.
#SparkBench
#SparkBench
Data Science
Experience
IBM Analytics
Engine (Beta)
AND MUCH,
MUCH MORE
Watson Machine
Learning
Watson Data Platform
#SparkBench
Hi-Bench, etc.
Spark vs. other big data engines
Profilers, Tracers
Fine-grain details of individual
Spark jobs and setups
💀databricks/spark-perf
Spark vs. Spark,
cluster performance
Too high level
Too low level
Juuuust right, sorta
As a researcher, I want to run a set
of standard benchmarks with
generated data against one
instance of Spark, one time.
--The user story that CODAIT/spark-bench version 1.0 fulfilled
#SparkBench
Good start, let’s keep going!
#SparkBench
#SparkBench
The Vision
As a Spark developer, I want to
run regression tests with my
changes in order to provide some
authoritative performance stats
with my PR.
#SparkBench
As a Spark developer, I want to
simulate multiple notebook users
hitting the same cluster so that I
can test my changes to the
dynamic allocation algorithm.
Supporting Highly Multitenant Spark Notebook Workloads
Craig Ingram, Brad Kaiser
#SparkBench
As an enterprise user of Spark, I
want to make my daily batch
ingest job run faster by tuning
parameters.
#SparkBench
As a data scientist, I want to run
several machine learning
algorithms over the same set of
data in order to compare results.
#SparkBench
As anybody who has ever had to
make a graph about Spark, I want
to get numbers easily and quickly
so I can focus on making
beautiful graphs.
#SparkBench
How It Works
#SparkBench
Structural Details
Old Version ✨New Version✨
Project Design Series of shell scripts calling
individually built jars
Scala project built using SBT
Build/Release Manually/Manually SBT, TravisCI, auto-release to
Github Releases on PR merge
Configuration Scattered in shell script variables Centralized in one config file with
many new capabilities
Parallelism None Three different levels!
Custom
Workloads
Requires writing a new subproject with
all accompanying bash
Implement our abstract class, Bring
Your Own Jar
#SparkBench
The Config File (a first glance)
spark-bench = {
spark-submit-config = [{
workload-suites = [
{
descr = "One run of SparkPi and that's it!"
benchmark-output = "console"
workloads = [
{
name = "sparkpi"
slices = 10
}
]
}
]
}]
}
#SparkBench
Workloads
• Atomic unit of spark-bench
• Optionally read from disk
• Optionally write results to disk
• All stages of the workload are timed
• Infinitely composable
#SparkBench
Workloads
• Machine Learning
– Kmeans
– Logistic Regression
– Other workloads ported from old version
• SQL queries
• Oddballs
– SparkPi
– Sleep
– Cache Test
#SparkBench
Workloads: Data Generators
• Graph data generator
• Kmeans data generator
• Linear Regression data generator
• More coming!
#SparkBench
What About Other Workloads?
1. Active development, more coming!
2. Contributions welcome J
3. Bring Your Own Workload!
{
name = "custom"
class = "com.seriouscompany.HelloString"
str = ["Hello", "Hi"]
}
#SparkBench
Custom Workloads
#SparkBench
{
name = "custom"
class = "com.seriouscompany.OurGiganticIngestDailyIngestJob"
date = "Tues Oct 26 18:58:25 EDT 2017"
source = "s3://buncha-junk/2017-10-26/"
}
{
name = "custom”
class = "com.example.WordGenerator”
output = "console”
rows = 10
cols = 3
word = "Cool stuff!!"
}
ß See our documentation on
custom workloads!
#SparkBench
Workload
Could be Kmeans,
SparkPi, SQL, Sleep, etc.
{
name = "sparkpi"
slices = 500000000
}
#SparkBench
#SparkBench
Workload
Could be Kmeans,
SparkPi, SQL, Sleep, etc.
Data Generation Workload{
name = "sparkpi"
slices = 500000000
}
{
name = "data-generation-kmeans"
rows = 1000
cols = 24
output = "/tmp/kmeans-data.parquet"
k = 45
}
#SparkBench
Data Generation,
Small Set
Data Generation, Large Set
{
name = "data-generation-kmeans"
rows = 100000000
cols = 24
output = "/tmp/large-data.parquet"
}
{
name = "data-generation-kmeans"
rows = 100
cols = 24
output = "/tmp/small-data.parquet"
}
#SparkBench
TIME
Workloads running
serially
#SparkBench
TIME
Workloads running
in parallel
#SparkBench
Workload Suites
• Logical group of workloads
• Run workloads serially or in parallel
• Control the repetition of workloads
• Control the output of benchmark results
• Can be composed with other workload suites to
run serially or in parallel
#SparkBench
One workload suite
with three workloads
running serially
workload-suites = [ {
descr = "Lol"
parallel = false
repeat = 1
benchmark-output = "console"
workloads = [
{
// Three workloads here
}
]
}]
#SparkBench
One workload suite
with three workloads
running in parallel
workload-suites = [ {
descr = "Lol"
parallel = true
repeat = 1
benchmark-output = "console"
workloads = [
{
// Three workloads here
}
]
}]
#SparkBench
Three workload suites running serially.
Two have workloads running serially,
one has workloads running in parallel.
suites-parallel = false
workload-suites = [
{
descr = "One"
parallel = false
},
{
descr = "Two"
parallel = false
},
{
descr = "Three"
parallel = true
}]
(config file is abbreviated for space)
#SparkBench
Three workload suites running in parallel.
Two have workloads running serially,
one has workloads running in parallel.
suites-parallel = true
workload-suites = [
{
descr = "One"
parallel = false
},
{
descr = "Two"
parallel = false
},
{
descr = "Three"
parallel = true
}]
(config file is abbreviated for space)
#SparkBench
./bin/spark-bench.sh my-amazing-config-file.conf
spark-bench-launch
spark-submit –jar spark-bench.jar
spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar
spark-submit –jar spark-bench.jar
#SparkBench
Spark-Submit-Config
• Control any parameter present in a spark-submit
script
• Produce and launch multiple spark-submit scripts
• Vary over parameters like executor-mem
• Run against different clusters or builds of Spark
• Can run serially or in parallel
#SparkBench
One spark-submit
with one workload
suite with three
workloads running
serially
spark-bench = {
spark-submit-config = [{
spark-args = {
master = "yarn"
}
workload-suites = [{
descr = "One"
benchmark-output = "console"
workloads = [
// . . .
]
}]
}]
}
(config file is abbreviated for space)
#SparkBench
Two spark-submits running in
parallel, each with one
workload suite, each with three
workloads running serially
#SparkBench
Two spark-submits running serially,
each with one workload suite,
one with three workloads running serially,
one with three workloads running in parallel
#SparkBench
One spark-submit with two workload
suites running serially.
The first workload suite generates
data in parallel.
The second workload suite runs
different workloads serially.
The second workload suite repeats.
#SparkBench
spark-bench = {
spark-submit-parallel = false
spark-submit-config = [{
spark-args = { // . . . }
suites-parallel = false
workload-suites = [{
descr = "Data Generation"
parallel = true
repeat = 1 // generate once and done!
workloads = { // . . . }
},
workload-suites = [{
descr = "Actual workloads that we want to benchmark"
parallel = false
repeat = 100 // repeat for statistical validity!
workloads = { // . . . }
}]
}]
}
#SparkBench
#SparkBench
Detailed Example: Parquet vs. CSV
#SparkBench
Example: Parquet vs. CSV
Goal: benchmark two different SQL queries over the
same dataset stored in two different formats
1. Generate a big dataset in CSV format
2. Convert that dataset to Parquet
3. Run first query over both sets
4. Run second query over both sets
5. Make beautiful graphs
#SparkBench
spark-bench = {
spark-submit-config = [{
spark-home = "/usr/iop/current/spark2-client/"
spark-args = {
master = "yarn"
executor-memory = "28G"
num-executors = 144
}
conf = {
"spark.dynamicAllocation.enabled" = "false"
"spark.dynamicAllocation.monitor.enabled" = "false"
"spark.shuffle.service.enabled" = "true"
}
suites-parallel = false
// …CONTINUED…
#SparkBench
workload-suites = [
{
descr = "Generate a dataset, then transform it to Parquet format"
benchmark-output = "hdfs:///tmp/emily/results-data-gen.csv"
parallel = false
workloads = [
{
name = "data-generation-kmeans"
rows = 10000000
cols = 24
output = "hdfs:///tmp/emily/kmeans-data.csv"
},
{
name = "sql"
query = "select * from input"
input = "hdfs:///tmp/emily/kmeans-data.csv"
output = "hdfs:///tmp/emily/kmeans-data.parquet"
}
]
},
// …CONTINUED…
#SparkBench
{
descr = "Run two different SQL over two different formats"
benchmark-output = "hdfs:///tmp/emily/results-sql.csv"
parallel = false
repeat = 10
workloads = [
{
name = "sql"
input = [
"hdfs:///tmp/emily/kmeans-data.csv",
"hdfs:///tmp/emily/kmeans-data.parquet"
]
query = [
"select * from input",
"select `0`, `22` from input where `0` < -0.9"
]
cache = false
}
]
}
]
}]
} You made it through the plaintext! Here’s a puppy!#SparkBench
#SparkBench
Parquet vs CSV Resultsname timestamp loadTime queryTime total_Runtime run cache queryStr input spark.executor.instances spark.submit.deployMode spark.master spa
sql 1.50714E+12 15895947762 8722036 15904669798 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 15232039388 41572417 15273611805 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 139462342 5652725 145115067 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 100552926 9120325 109673251 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 15221909508 6183625 15228093133 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 15634952815 6820382 15641773197 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 93264481 5243927 98508408 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 84948895 5573010 90521905 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 15270158275 8408445 15278566720 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14924461363 7999643 14932461006 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 86821299 13952212 100773511 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 100789178 6525153 107314331 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 15115682178 7085577 15122767755 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14829435321 6452434 14835887755 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 78654779 5291044 83945823 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 86702691 5641506 92344197 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 14881975292 7346304 14889321596 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14953655628 6749366 14960404994 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 102374878 4925237 107300115 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 79458610 4697124 84155734 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 14908128878 7429576 14915558454 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14922630998 4596120 14927227118 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 79144522 4191275 83335797 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 86207011 4111788 90318799 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 14909544529 5722627 14915267156 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14919548742 5643308 14925192050 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 85241859 4417683 89659542 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 93379964 4610765 97990729 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
#SparkBench
Examples: Cluster Hammer
#SparkBench
Cluster Hammer
#SparkBench
Cluster Hammer
#SparkBench
Cluster Hammer
#SparkBench
Examples: Multi-User Notebooks
#SparkBench
Multi-User Notebook
Data Generation,
sample.csv and
GIANT-SET.csv
Workloads,
working with the
sample.csv
Same workloads,
working with
GIANT-SET.csv
{
name = "sleep”
distribution = "poisson”
mean = 30000
}
Sleep
Workload
#SparkBench
Examples of Easy Extensions
• Add more "users" (workload suites)
• Have a workload suite that simulates heavy batch
jobs while notebook users attempt to use the
same cluster
• Toggle settings for dynamic allocation, etc.
• Benchmark a notebook user on a heavily
stressed cluster (cluster hammer!)
#SparkBench
Examples: Varying Spark Parameters
#SparkBench
Different Versions of Spark
spark-bench = {
spark-submit-parallel = false
spark-submit-config = [{
spark-home = [
"/opt/official-spark-build/",
"/opt/my-branch-of-spark-with-new-changes/"
]
spark-args = {
master = "yarn"
}
workload-suites = [{// workload suites}]
}]
}
#SparkBench
This will produce 2 spark-submits
Varying More Than One Param
spark-bench = {
spark-submit-parallel = false
spark-submit-config = [{
spark-home = [
"/opt/official-spark-build/",
"/opt/my-fork-of-spark-with-new-changes/"
]
spark-args = {
master = "yarn"
executor-mem = ["2G", "4G", "8G", "16G"]
}
workload-suites = [{// workload suites}]
}]
}
#SparkBench
This will produce 2 X 4 = 8 spark-submits
#SparkBench
Final Thoughts
#SparkBench
Installation
1. Grab the latest release from the releases page on Github
2. Unpack the tar
3. Set the environment variable for $SPARK_HOME.
4. Run the examples!
./bin/spark-bench.sh examples/minimal-example.conf
#SparkBench
Future Work & Contributions
• More workloads and data generators
• TPC-DS (on top of existing SQL query benchmark)
• Streaming workloads
• Python
Contributions very welcome from all levels!
Check out our ZenHub Board for details
#SparkBench
Shout-Out to My Team
Matthew Schauer
@showermat
Craig Ingram
@cin
Brad Kaiser
@brad-kaiser
*not an actual photo of my team
#SparkBench
Summary
Spark-Bench is a flexible, configurable framework for
Testing, Benchmarking, Simulating, Comparing,
and more!
#SparkBench
https://codait.github.io/spark-bench/
#SparkBench
CODAIT/spark-bench
Emily May Curtin
IBM Watson Data Platform
Atlanta, GA
@emilymaycurtin
@ecurtin
#EUeco8
#sparkbench
https://github.com/CODAIT/spark-bench
https://CODAIT.github.io/spark-bench/
Project
Documentation
Come See Me At DevNexus 2018!#SparkBench

More Related Content

What's hot

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Evan Chan
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
Ron Reiter
 
React Development with the MERN Stack
React Development with the MERN StackReact Development with the MERN Stack
React Development with the MERN Stack
Troy Miles
 
Reactive Web-Applications @ LambdaDays
Reactive Web-Applications @ LambdaDaysReactive Web-Applications @ LambdaDays
Reactive Web-Applications @ LambdaDays
Manuel Bernhardt
 
Presentation kyushu-2018
Presentation kyushu-2018Presentation kyushu-2018
Presentation kyushu-2018
masahitojp
 
Learn you some Ansible for great good!
Learn you some Ansible for great good!Learn you some Ansible for great good!
Learn you some Ansible for great good!
David Lapsley
 
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Ruby Meditation
 
Google App Engine With Java And Groovy
Google App Engine With Java And GroovyGoogle App Engine With Java And Groovy
Google App Engine With Java And Groovy
Ken Kousen
 
Async - react, don't wait - PingConf
Async - react, don't wait - PingConfAsync - react, don't wait - PingConf
Async - react, don't wait - PingConf
Johan Andrén
 
Transfer to kubernetes data platform from EMR
Transfer to kubernetes data platform from EMRTransfer to kubernetes data platform from EMR
Transfer to kubernetes data platform from EMR
창언 정
 
Azkaban
AzkabanAzkaban
Azkaban
wyukawa
 
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
confluent
 
Apache Camel Introduction & What's in the box
Apache Camel Introduction & What's in the boxApache Camel Introduction & What's in the box
Apache Camel Introduction & What's in the box
Claus Ibsen
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
Alex Payne
 
Life Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby AppsLife Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby Apps
Tristan Gomez
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 
Refactoring @ Mindvalley: Smells, Techniques and Patterns
Refactoring @ Mindvalley: Smells, Techniques and PatternsRefactoring @ Mindvalley: Smells, Techniques and Patterns
Refactoring @ Mindvalley: Smells, Techniques and Patterns
Tristan Gomez
 
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and JenkinsChasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Tomas Doran
 
Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)
Patrick Crowley
 

What's hot (20)

Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
 
React Development with the MERN Stack
React Development with the MERN StackReact Development with the MERN Stack
React Development with the MERN Stack
 
Reactive Web-Applications @ LambdaDays
Reactive Web-Applications @ LambdaDaysReactive Web-Applications @ LambdaDays
Reactive Web-Applications @ LambdaDays
 
Presentation kyushu-2018
Presentation kyushu-2018Presentation kyushu-2018
Presentation kyushu-2018
 
Play framework
Play frameworkPlay framework
Play framework
 
Learn you some Ansible for great good!
Learn you some Ansible for great good!Learn you some Ansible for great good!
Learn you some Ansible for great good!
 
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
Performance Optimization 101 for Ruby developers - Nihad Abbasov (ENG) | Ruby...
 
Google App Engine With Java And Groovy
Google App Engine With Java And GroovyGoogle App Engine With Java And Groovy
Google App Engine With Java And Groovy
 
Async - react, don't wait - PingConf
Async - react, don't wait - PingConfAsync - react, don't wait - PingConf
Async - react, don't wait - PingConf
 
Transfer to kubernetes data platform from EMR
Transfer to kubernetes data platform from EMRTransfer to kubernetes data platform from EMR
Transfer to kubernetes data platform from EMR
 
Azkaban
AzkabanAzkaban
Azkaban
 
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...
 
Apache Camel Introduction & What's in the box
Apache Camel Introduction & What's in the boxApache Camel Introduction & What's in the box
Apache Camel Introduction & What's in the box
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Life Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby AppsLife Beyond Rails: Creating Cross Platform Ruby Apps
Life Beyond Rails: Creating Cross Platform Ruby Apps
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 
Refactoring @ Mindvalley: Smells, Techniques and Patterns
Refactoring @ Mindvalley: Smells, Techniques and PatternsRefactoring @ Mindvalley: Smells, Techniques and Patterns
Refactoring @ Mindvalley: Smells, Techniques and Patterns
 
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and JenkinsChasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
 
Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)Migrating Legacy Data (Ruby Midwest)
Migrating Legacy Data (Ruby Midwest)
 

Similar to CODAIT/Spark-Bench

Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
Databricks
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
sparktc
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
Databricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Provectus
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
Вадим Челышов
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
Chris Fregly
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
Eyal Ben Ivri
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
Taewook Eom
 
CoffeeScript Design Patterns
CoffeeScript Design PatternsCoffeeScript Design Patterns
CoffeeScript Design Patterns
TrevorBurnham
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
Databricks
 

Similar to CODAIT/Spark-Bench (20)

Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Faster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-JobserverFaster Data Integration Pipeline Execution using Spark-Jobserver
Faster Data Integration Pipeline Execution using Spark-Jobserver
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Jumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on DatabricksJumpstart on Apache Spark 2.2 on Databricks
Jumpstart on Apache Spark 2.2 on Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...
 
Mist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache SparkMist - Serverless proxy to Apache Spark
Mist - Serverless proxy to Apache Spark
 
Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015Singapore Spark Meetup Dec 01 2015
Singapore Spark Meetup Dec 01 2015
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
CoffeeScript Design Patterns
CoffeeScript Design PatternsCoffeeScript Design Patterns
CoffeeScript Design Patterns
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 

Recently uploaded

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

CODAIT/Spark-Bench

  • 1. Simulate, Test, Compare, Exercise, and Yes, Even Benchmark! Spark-Bench Emily May Curtin The Weather Company Atlanta, GA @emilymaycurtin @ecurtin "S" + = #SparkBench
  • 2. Who Am I • Artist, former documentary film editor • 2017 Spark Summit East – Spark + Parquet • Tabs, not spaces (at least for Scala)(where 1 tab = 2 spaces) • Software Engineer, The Weather Company, an IBM Business • Current lead developer of CODAIT/spark-bench #SparkBench
  • 4. Spark-Bench • History & Rewrite Motivation • How Spark-Bench Works • Detailed Example: Parquet vs. CSV Benchmarking • More Examples – Cluster Hammer – Multiple Notebook Users – Varying over Spark parameters #SparkBench
  • 9. Spark Benchmarking circa 2015 A.D. #SparkBench
  • 10. Spark-Bench 2015 "SparkBench: A Comprehensive Benchmarking Suite For In Memory Data Analytics Platform Spark" Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina Salapura IBM TJ Watson Research Center - APC 2015 #SparkBench
  • 11. Spark-Bench 2015 "SparkBench: A Comprehensive Benchmarking Suite For In Memory Data Analytics Platform Spark" Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina Salapura IBM TJ Watson Research Center - APC 2015 #SparkBench
  • 16. #SparkBench Data Science Experience IBM Analytics Engine (Beta) AND MUCH, MUCH MORE Watson Machine Learning Watson Data Platform
  • 17. #SparkBench Hi-Bench, etc. Spark vs. other big data engines Profilers, Tracers Fine-grain details of individual Spark jobs and setups 💀databricks/spark-perf Spark vs. Spark, cluster performance Too high level Too low level Juuuust right, sorta
  • 18. As a researcher, I want to run a set of standard benchmarks with generated data against one instance of Spark, one time. --The user story that CODAIT/spark-bench version 1.0 fulfilled #SparkBench
  • 19. Good start, let’s keep going! #SparkBench
  • 21. As a Spark developer, I want to run regression tests with my changes in order to provide some authoritative performance stats with my PR. #SparkBench
  • 22. As a Spark developer, I want to simulate multiple notebook users hitting the same cluster so that I can test my changes to the dynamic allocation algorithm. Supporting Highly Multitenant Spark Notebook Workloads Craig Ingram, Brad Kaiser #SparkBench
  • 23. As an enterprise user of Spark, I want to make my daily batch ingest job run faster by tuning parameters. #SparkBench
  • 24. As a data scientist, I want to run several machine learning algorithms over the same set of data in order to compare results. #SparkBench
  • 25. As anybody who has ever had to make a graph about Spark, I want to get numbers easily and quickly so I can focus on making beautiful graphs. #SparkBench
  • 27. Structural Details Old Version ✨New Version✨ Project Design Series of shell scripts calling individually built jars Scala project built using SBT Build/Release Manually/Manually SBT, TravisCI, auto-release to Github Releases on PR merge Configuration Scattered in shell script variables Centralized in one config file with many new capabilities Parallelism None Three different levels! Custom Workloads Requires writing a new subproject with all accompanying bash Implement our abstract class, Bring Your Own Jar #SparkBench
  • 28. The Config File (a first glance) spark-bench = { spark-submit-config = [{ workload-suites = [ { descr = "One run of SparkPi and that's it!" benchmark-output = "console" workloads = [ { name = "sparkpi" slices = 10 } ] } ] }] } #SparkBench
  • 29. Workloads • Atomic unit of spark-bench • Optionally read from disk • Optionally write results to disk • All stages of the workload are timed • Infinitely composable #SparkBench
  • 30. Workloads • Machine Learning – Kmeans – Logistic Regression – Other workloads ported from old version • SQL queries • Oddballs – SparkPi – Sleep – Cache Test #SparkBench
  • 31. Workloads: Data Generators • Graph data generator • Kmeans data generator • Linear Regression data generator • More coming! #SparkBench
  • 32. What About Other Workloads? 1. Active development, more coming! 2. Contributions welcome J 3. Bring Your Own Workload! { name = "custom" class = "com.seriouscompany.HelloString" str = ["Hello", "Hi"] } #SparkBench
  • 33. Custom Workloads #SparkBench { name = "custom" class = "com.seriouscompany.OurGiganticIngestDailyIngestJob" date = "Tues Oct 26 18:58:25 EDT 2017" source = "s3://buncha-junk/2017-10-26/" } { name = "custom” class = "com.example.WordGenerator” output = "console” rows = 10 cols = 3 word = "Cool stuff!!" } ß See our documentation on custom workloads!
  • 35. Workload Could be Kmeans, SparkPi, SQL, Sleep, etc. { name = "sparkpi" slices = 500000000 } #SparkBench
  • 37. Workload Could be Kmeans, SparkPi, SQL, Sleep, etc. Data Generation Workload{ name = "sparkpi" slices = 500000000 } { name = "data-generation-kmeans" rows = 1000 cols = 24 output = "/tmp/kmeans-data.parquet" k = 45 } #SparkBench
  • 38. Data Generation, Small Set Data Generation, Large Set { name = "data-generation-kmeans" rows = 100000000 cols = 24 output = "/tmp/large-data.parquet" } { name = "data-generation-kmeans" rows = 100 cols = 24 output = "/tmp/small-data.parquet" } #SparkBench
  • 41. Workload Suites • Logical group of workloads • Run workloads serially or in parallel • Control the repetition of workloads • Control the output of benchmark results • Can be composed with other workload suites to run serially or in parallel #SparkBench
  • 42. One workload suite with three workloads running serially workload-suites = [ { descr = "Lol" parallel = false repeat = 1 benchmark-output = "console" workloads = [ { // Three workloads here } ] }] #SparkBench
  • 43. One workload suite with three workloads running in parallel workload-suites = [ { descr = "Lol" parallel = true repeat = 1 benchmark-output = "console" workloads = [ { // Three workloads here } ] }] #SparkBench
  • 44. Three workload suites running serially. Two have workloads running serially, one has workloads running in parallel. suites-parallel = false workload-suites = [ { descr = "One" parallel = false }, { descr = "Two" parallel = false }, { descr = "Three" parallel = true }] (config file is abbreviated for space) #SparkBench
  • 45. Three workload suites running in parallel. Two have workloads running serially, one has workloads running in parallel. suites-parallel = true workload-suites = [ { descr = "One" parallel = false }, { descr = "Two" parallel = false }, { descr = "Three" parallel = true }] (config file is abbreviated for space) #SparkBench
  • 46. ./bin/spark-bench.sh my-amazing-config-file.conf spark-bench-launch spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar #SparkBench
  • 47. Spark-Submit-Config • Control any parameter present in a spark-submit script • Produce and launch multiple spark-submit scripts • Vary over parameters like executor-mem • Run against different clusters or builds of Spark • Can run serially or in parallel #SparkBench
  • 48. One spark-submit with one workload suite with three workloads running serially spark-bench = { spark-submit-config = [{ spark-args = { master = "yarn" } workload-suites = [{ descr = "One" benchmark-output = "console" workloads = [ // . . . ] }] }] } (config file is abbreviated for space) #SparkBench
  • 49. Two spark-submits running in parallel, each with one workload suite, each with three workloads running serially #SparkBench
  • 50. Two spark-submits running serially, each with one workload suite, one with three workloads running serially, one with three workloads running in parallel #SparkBench
  • 51. One spark-submit with two workload suites running serially. The first workload suite generates data in parallel. The second workload suite runs different workloads serially. The second workload suite repeats. #SparkBench
  • 52. spark-bench = { spark-submit-parallel = false spark-submit-config = [{ spark-args = { // . . . } suites-parallel = false workload-suites = [{ descr = "Data Generation" parallel = true repeat = 1 // generate once and done! workloads = { // . . . } }, workload-suites = [{ descr = "Actual workloads that we want to benchmark" parallel = false repeat = 100 // repeat for statistical validity! workloads = { // . . . } }] }] } #SparkBench
  • 54. Detailed Example: Parquet vs. CSV #SparkBench
  • 55. Example: Parquet vs. CSV Goal: benchmark two different SQL queries over the same dataset stored in two different formats 1. Generate a big dataset in CSV format 2. Convert that dataset to Parquet 3. Run first query over both sets 4. Run second query over both sets 5. Make beautiful graphs #SparkBench
  • 56. spark-bench = { spark-submit-config = [{ spark-home = "/usr/iop/current/spark2-client/" spark-args = { master = "yarn" executor-memory = "28G" num-executors = 144 } conf = { "spark.dynamicAllocation.enabled" = "false" "spark.dynamicAllocation.monitor.enabled" = "false" "spark.shuffle.service.enabled" = "true" } suites-parallel = false // …CONTINUED… #SparkBench
  • 57. workload-suites = [ { descr = "Generate a dataset, then transform it to Parquet format" benchmark-output = "hdfs:///tmp/emily/results-data-gen.csv" parallel = false workloads = [ { name = "data-generation-kmeans" rows = 10000000 cols = 24 output = "hdfs:///tmp/emily/kmeans-data.csv" }, { name = "sql" query = "select * from input" input = "hdfs:///tmp/emily/kmeans-data.csv" output = "hdfs:///tmp/emily/kmeans-data.parquet" } ] }, // …CONTINUED… #SparkBench
  • 58. { descr = "Run two different SQL over two different formats" benchmark-output = "hdfs:///tmp/emily/results-sql.csv" parallel = false repeat = 10 workloads = [ { name = "sql" input = [ "hdfs:///tmp/emily/kmeans-data.csv", "hdfs:///tmp/emily/kmeans-data.parquet" ] query = [ "select * from input", "select `0`, `22` from input where `0` < -0.9" ] cache = false } ] } ] }] } You made it through the plaintext! Here’s a puppy!#SparkBench
  • 59. #SparkBench Parquet vs CSV Resultsname timestamp loadTime queryTime total_Runtime run cache queryStr input spark.executor.instances spark.submit.deployMode spark.master spa sql 1.50714E+12 15895947762 8722036 15904669798 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 15232039388 41572417 15273611805 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 139462342 5652725 145115067 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 100552926 9120325 109673251 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 15221909508 6183625 15228093133 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 15634952815 6820382 15641773197 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 93264481 5243927 98508408 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 84948895 5573010 90521905 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 15270158275 8408445 15278566720 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14924461363 7999643 14932461006 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 86821299 13952212 100773511 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 100789178 6525153 107314331 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 15115682178 7085577 15122767755 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14829435321 6452434 14835887755 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 78654779 5291044 83945823 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 86702691 5641506 92344197 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 14881975292 7346304 14889321596 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14953655628 6749366 14960404994 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 102374878 4925237 107300115 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 79458610 4697124 84155734 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 14908128878 7429576 14915558454 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14922630998 4596120 14927227118 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 79144522 4191275 83335797 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 86207011 4111788 90318799 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 14909544529 5722627 14915267156 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14919548742 5643308 14925192050 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 85241859 4417683 89659542 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 93379964 4610765 97990729 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
  • 66. Multi-User Notebook Data Generation, sample.csv and GIANT-SET.csv Workloads, working with the sample.csv Same workloads, working with GIANT-SET.csv { name = "sleep” distribution = "poisson” mean = 30000 } Sleep Workload #SparkBench
  • 67. Examples of Easy Extensions • Add more "users" (workload suites) • Have a workload suite that simulates heavy batch jobs while notebook users attempt to use the same cluster • Toggle settings for dynamic allocation, etc. • Benchmark a notebook user on a heavily stressed cluster (cluster hammer!) #SparkBench
  • 68. Examples: Varying Spark Parameters #SparkBench
  • 69. Different Versions of Spark spark-bench = { spark-submit-parallel = false spark-submit-config = [{ spark-home = [ "/opt/official-spark-build/", "/opt/my-branch-of-spark-with-new-changes/" ] spark-args = { master = "yarn" } workload-suites = [{// workload suites}] }] } #SparkBench This will produce 2 spark-submits
  • 70. Varying More Than One Param spark-bench = { spark-submit-parallel = false spark-submit-config = [{ spark-home = [ "/opt/official-spark-build/", "/opt/my-fork-of-spark-with-new-changes/" ] spark-args = { master = "yarn" executor-mem = ["2G", "4G", "8G", "16G"] } workload-suites = [{// workload suites}] }] } #SparkBench This will produce 2 X 4 = 8 spark-submits
  • 73. Installation 1. Grab the latest release from the releases page on Github 2. Unpack the tar 3. Set the environment variable for $SPARK_HOME. 4. Run the examples! ./bin/spark-bench.sh examples/minimal-example.conf #SparkBench
  • 74. Future Work & Contributions • More workloads and data generators • TPC-DS (on top of existing SQL query benchmark) • Streaming workloads • Python Contributions very welcome from all levels! Check out our ZenHub Board for details #SparkBench
  • 75. Shout-Out to My Team Matthew Schauer @showermat Craig Ingram @cin Brad Kaiser @brad-kaiser *not an actual photo of my team #SparkBench
  • 76. Summary Spark-Bench is a flexible, configurable framework for Testing, Benchmarking, Simulating, Comparing, and more! #SparkBench https://codait.github.io/spark-bench/
  • 78. CODAIT/spark-bench Emily May Curtin IBM Watson Data Platform Atlanta, GA @emilymaycurtin @ecurtin #EUeco8 #sparkbench https://github.com/CODAIT/spark-bench https://CODAIT.github.io/spark-bench/ Project Documentation Come See Me At DevNexus 2018!#SparkBench