SlideShare a Scribd company logo
Simulate, Test, Compare, Exercise, and Yes, Even
Benchmark!
Spark-Bench
Emily May Curtin
IBM Watson Data Platform
Atlanta, GA
@emilymaycurtin
@ecurtin
"S" + =
#EUeco8
#sparkbench
Who Am I
• Artist, former documentary film editor
• 2017 Spark Summit East – Spark + Parquet
• Tabs, not spaces (at least for Scala)(where 1 tab = 2 spaces)
• Software Engineer, IBM Watson Data Platform
• Current lead developer of SparkTC/spark-bench
#EUeco8 #sparkbench
Atlanta!!!!
#EUeco8 #sparkbench
Spark-Bench
• History & Rewrite Motivation
• How Spark-Bench Works
• Detailed Example: Parquet vs. CSV Benchmarking
• More Examples
– Cluster Hammer
– Multiple Notebook Users
– Varying over Spark parameters
#EUeco8 #sparkbench
2015 A.D.
#EUeco8 #sparkbench
#EUeco8 #sparkbench
2015 A.D.
#EUeco8 #sparkbench
2015 A.D.
#EUeco8 #sparkbench
2015 A.D.
Spark Benchmarking circa 2015 A.D.
#EUeco8 #sparkbench
Spark-Bench 2015
"SparkBench: A Comprehensive Benchmarking
Suite For In Memory Data Analytics Platform Spark"
Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina
Salapura IBM TJ Watson Research Center - APC 2015
#EUeco8 #sparkbench
Spark-Bench 2015
"SparkBench: A Comprehensive Benchmarking
Suite For In Memory Data Analytics Platform Spark"
Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina
Salapura IBM TJ Watson Research Center - APC 2015
#EUeco8 #sparkbench
2017 A.D.
#EUeco8 #sparkbench
2017 A.D.
#EUeco8 #sparkbench
2017 A.D.
#EUeco8 #sparkbench
2017 A.D.
#EUeco8 #sparkbench
#EUeco8 #sparkbench
Data Science
Experience
IBM Analytics
Engine (Beta)
AND MUCH,
MUCH MORE
Watson Machine
Learning
Watson Data Platform
As a researcher, I want to run a set
of standard benchmarks with
generated data against one
instance of Spark, one time.
--The user story that SparkTC/spark-bench version 1.0 fulfilled
#EUeco8 #sparkbench
Good start, let’s keep going!
#EUeco8 #sparkbench
#EUeco8 #sparkbench
The Vision
As a Spark developer, I want to
run regression tests with my
changes in order to provide some
authoritative performance stats
with my PR.
#EUeco8 #sparkbench
As a Spark developer, I want to
see regression tests run against
Spark nightly builds.
#EUeco8 #sparkbench
As a Parquet developer, I want to
compare performance between
Spark built with several different
versions of Parquet.
#EUeco8 #sparkbench
As a Spark developer, I want to
simulate multiple notebook users
hitting the same cluster so that I
can test my changes to the
dynamic allocation algorithm.
Supporting Highly Multitenant Spark Notebook Workloads
Craig Ingram, Brad Kaiser
#EUeco8 #sparkbench
As an enterprise user of Spark, I
want to make my daily batch
ingest job run faster by tuning
parameters.
#EUeco8 #sparkbench
As a DevOps professional, I want
to see how performance changes
for our daily jobs as a I add more
nodes to our prod cluster.
#EUeco8 #sparkbench
As a data scientist, I want to run
several machine learning
algorithms over the same set of
data in order to compare results.
#EUeco8 #sparkbench
As anybody who has ever had to
make a graph about Spark, I want
to get numbers easily and quickly
so I can focus on making
beautiful graphs.
#EUeco8 #sparkbench
How It Works
#EUeco8 #sparkbench
Structural Details
Old Version ✨ New Version✨
Project Design Series of shell scripts calling
individually built jars
Scala project built using SBT
Build/Release Manually/Manually SBT, TravisCI, auto-release to
Github Releases on PR merge
Configuration Scattered in shell script variables Centralized in one config file with
many new capabilities
Parallelism None Three different levels!
Custom
Workloads
Requires writing a new subproject with
all accompanying bash
Implement our abstract class, Bring
Your Own Jar
#EUeco8 #sparkbench
The Config File (a first glance)
spark-bench = {
spark-submit-config = [{
workload-suites = [
{
descr = "One run of SparkPi and that's it!"
benchmark-output = "console"
workloads = [
{
name = "sparkpi"
slices = 10
}
]
}
]
}]
}
#EUeco8 #sparkbench
Workloads
• Atomic unit of spark-bench
• Optionally read from disk
• Optionally write results to disk
• All stages of the workload are timed
• Infinitely composable
#EUeco8 #sparkbench
Workloads
• Machine Learning
– Kmeans
– Logistic Regression
– Other workloads ported from old version
• SQL queries
• Oddballs
– SparkPi
– Sleep
– Cache Test
#EUeco8 #sparkbench
Workloads: Data Generators
• Graph data generator
• Kmeans data generator
• Linear Regression data generator
• More coming!
#EUeco8 #sparkbench
What About Other Workloads?
1. Active development, more coming!
2. Contributions welcome J
3. Bring Your Own Workload!
{
name = "custom"
class = "com.seriouscompany.HelloString"
str = ["Hello", "Hi"]
}
#EUeco8 #sparkbench
Custom Workloads
#EUeco8 #sparkbench
{
name = "custom"
class = "com.seriouscompany.OurGiganticIngestDailyIngestJob"
date = "Tues Oct 26 18:58:25 EDT 2017"
source = "s3://buncha-junk/2017-10-26/"
}
{
name = "custom”
class = "com.example.WordGenerator”
output = "console”
rows = 10
cols = 3
word = "Cool stuff!!"
}
ß See our documentation on
custom workloads!
#EUeco8 #sparkbench
Workload
Could be Kmeans,
SparkPi, SQL, Sleep, etc.
{
name = "sparkpi"
slices = 500000000
}
#EUeco8 #sparkbench
#EUeco8 #sparkbench
Workload
Could be Kmeans,
SparkPi, SQL, Sleep, etc.
Data Generation Workload{
name = "sparkpi"
slices = 500000000
}
{
name = "data-generation-kmeans"
rows = 1000
cols = 24
output = "/tmp/kmeans-data.parquet"
k = 45
}
#EUeco8 #sparkbench
Data Generation,
Small Set
Data Generation, Large Set
{
name = "data-generation-kmeans"
rows = 100000000
cols = 24
output = "/tmp/large-data.parquet"
}
{
name = "data-generation-kmeans"
rows = 100
cols = 24
output = "/tmp/small-data.parquet"
}
#EUeco8 #sparkbench
TIME
Workloads running
serially
#EUeco8 #sparkbench
TIME
Workloads running
in parallel
#EUeco8 #sparkbench
Workload Suites
• Logical group of workloads
• Run workloads serially or in parallel
• Control the repetition of workloads
• Control the output of benchmark results
• Can be composed with other workload suites to
run serially or in parallel
#EUeco8 #sparkbench
One workload suite
with three workloads
running serially
workload-suites = [ {
descr = "Lol"
parallel = false
repeat = 1
benchmark-output = "console"
workloads = [
{
// Three workloads here
}
]
}]
#EUeco8 #sparkbench
One workload suite
with three workloads
running in parallel
workload-suites = [ {
descr = "Lol"
parallel = true
repeat = 1
benchmark-output = "console"
workloads = [
{
// Three workloads here
}
]
}]
#EUeco8 #sparkbench
Three workload suites running serially.
Two have workloads running serially,
one has workloads running in parallel.
suites-parallel = false
workload-suites = [
{
descr = "One"
parallel = false
},
{
descr = "Two"
parallel = false
},
{
descr = "Three"
parallel = true
}]
(config file is abbreviated for space)
#EUeco8 #sparkbench
Three workload suites running in parallel.
Two have workloads running serially,
one has workloads running in parallel.
suites-parallel = true
workload-suites = [
{
descr = "One"
parallel = false
},
{
descr = "Two"
parallel = false
},
{
descr = "Three"
parallel = true
}]
(config file is abbreviated for space)
#EUeco8 #sparkbench
./bin/spark-bench.sh my-amazing-config-file.conf
spark-bench-launch
spark-submit –jar spark-bench.jar
spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar
spark-submit –jar spark-bench.jar
#EUeco8 #sparkbench
Spark-Submit-Config
• Control any parameter present in a spark-submit
script
• Produce and launch multiple spark-submit scripts
• Vary over parameters like executor-mem
• Run against different clusters or builds of Spark
• Can run serially or in parallel
#EUeco8 #sparkbench
One spark-submit
with one workload
suite with three
workloads running
serially
spark-bench = {
spark-submit-config = [{
spark-args = {
master = "yarn"
}
workload-suites = [{
descr = "One"
benchmark-output = "console"
workloads = [
// . . .
]
}]
}]
}
(config file is abbreviated for space)
#EUeco8 #sparkbench
Two spark-submits running in
parallel, each with one
workload suite, each with three
workloads running serially
#EUeco8 #sparkbench
Two spark-submits running serially,
each with one workload suite,
one with three workloads running serially,
one with three workloads running in parallel
#EUeco8 #sparkbench
One spark-submit with two workload
suites running serially.
The first workload suite generates
data in parallel.
The second workload suite runs
different workloads serially.
The second workload suite repeats.
#EUeco8 #sparkbench
spark-bench = {
spark-submit-parallel = false
spark-submit-config = [{
spark-args = { // . . . }
suites-parallel = false
workload-suites = [{
descr = "Data Generation"
parallel = true
repeat = 1 // generate once and done!
workloads = { // . . . }
},
workload-suites = [{
descr = "Actual workloads that we want to benchmark"
parallel = false
repeat = 100 // repeat for statistical validity!
workloads = { // . . . }
}]
}]
}
#EUeco8 #sparkbench
#EUeco8 #sparkbench
Detailed Example: Parquet vs. CSV
#EUeco8 #sparkbench
Example: Parquet vs. CSV
Goal: benchmark two different SQL queries over the
same dataset stored in two different formats
1. Generate a big dataset in CSV format
2. Convert that dataset to Parquet
3. Run first query over both sets
4. Run second query over both sets
5. Make beautiful graphs
#EUeco8 #sparkbench
spark-bench = {
spark-submit-config = [{
spark-home = "/usr/iop/current/spark2-client/"
spark-args = {
master = "yarn"
executor-memory = "28G"
num-executors = 144
}
conf = {
"spark.dynamicAllocation.enabled" = "false"
"spark.dynamicAllocation.monitor.enabled" = "false"
"spark.shuffle.service.enabled" = "true"
}
suites-parallel = false
// …CONTINUED…
#EUeco8 #sparkbench
workload-suites = [
{
descr = "Generate a dataset, then transform it to Parquet format"
benchmark-output = "hdfs:///tmp/emily/results-data-gen.csv"
parallel = false
workloads = [
{
name = "data-generation-kmeans"
rows = 10000000
cols = 24
output = "hdfs:///tmp/emily/kmeans-data.csv"
},
{
name = "sql"
query = "select * from input"
input = "hdfs:///tmp/emily/kmeans-data.csv"
output = "hdfs:///tmp/emily/kmeans-data.parquet"
}
]
},
// …CONTINUED…
#EUeco8 #sparkbench
{
descr = "Run two different SQL over two different formats"
benchmark-output = "hdfs:///tmp/emily/results-sql.csv"
parallel = false
repeat = 10
workloads = [
{
name = "sql"
input = [
"hdfs:///tmp/emily/kmeans-data.csv",
"hdfs:///tmp/emily/kmeans-data.parquet"
]
query = [
"select * from input",
"select `0`, `22` from input where `0` < -0.9"
]
cache = false
}
]
}
]
}]
} You made it through the plaintext! Here’s a puppy!#EUeco8 #sparkbench
#EUeco8 #sparkbench
Parquet vs CSV Resultsname timestamp loadTime queryTime total_Runtime run cache queryStr input spark.executor.instances spark.submit.deployMode spark.master spa
sql 1.50714E+12 15895947762 8722036 15904669798 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 15232039388 41572417 15273611805 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 139462342 5652725 145115067 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 100552926 9120325 109673251 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 15221909508 6183625 15228093133 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 15634952815 6820382 15641773197 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 93264481 5243927 98508408 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 84948895 5573010 90521905 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 15270158275 8408445 15278566720 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14924461363 7999643 14932461006 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 86821299 13952212 100773511 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 100789178 6525153 107314331 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 15115682178 7085577 15122767755 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14829435321 6452434 14835887755 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 78654779 5291044 83945823 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 86702691 5641506 92344197 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 14881975292 7346304 14889321596 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14953655628 6749366 14960404994 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 102374878 4925237 107300115 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 79458610 4697124 84155734 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 14908128878 7429576 14915558454 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14922630998 4596120 14927227118 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 79144522 4191275 83335797 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 86207011 4111788 90318799 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 14909544529 5722627 14915267156 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 14919548742 5643308 14925192050 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G
sql 1.50714E+12 85241859 4417683 89659542 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
sql 1.50714E+12 93379964 4610765 97990729 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
#EUeco8 #sparkbench
Examples: Cluster Hammer
#EUeco8 #sparkbench
Cluster Hammer
#EUeco8 #sparkbench
Cluster Hammer
#EUeco8 #sparkbench
Cluster Hammer
#EUeco8 #sparkbench
Examples: Multi-User Notebooks
#EUeco8 #sparkbench
Multi-User Notebook
Data Generation,
sample.csv and
GIANT-SET.csv
Workloads,
working with the
sample.csv
Same workloads,
working with
GIANT-SET.csv
{
name = "sleep”
distribution = "poisson”
mean = 30000
}
Sleep
Workload
Examples of Easy Extensions
• Add more "users" (workload suites)
• Have a workload suite that simulates heavy batch
jobs while notebook users attempt to use the
same cluster
• Toggle settings for dynamic allocation, etc.
• Benchmark a notebook user on a heavily
stressed cluster (cluster hammer!)
Examples: Varying Spark Parameters
#EUeco8 #sparkbench
Different Versions of Spark
spark-bench = {
spark-submit-parallel = false
spark-submit-config = [{
spark-home = [
"/opt/official-spark-build/",
"/opt/my-branch-of-spark-with-new-changes/"
]
spark-args = {
master = "yarn"
}
workload-suites = [{// workload suites}]
}]
}
#EUeco8 #sparkbench
This will produce 2 spark-submits
Varying More Than One Param
spark-bench = {
spark-submit-parallel = false
spark-submit-config = [{
spark-home = [
"/opt/official-spark-build/",
"/opt/my-fork-of-spark-with-new-changes/"
]
spark-args = {
master = "yarn"
executor-mem = ["2G", "4G", "8G", "16G"]
}
workload-suites = [{// workload suites}]
}]
}
#EUeco8 #sparkbench
This will produce 2 X 4 = 8 spark-submits
Final Thoughts
#EUeco8 #sparkbench
Installation
1. Grab the latest release from the releases page on Github
2. Unpack the tar
3. Set the environment variable for $SPARK_HOME.
4. Run the examples!
./bin/spark-bench.sh examples/minimal-example.conf
#EUeco8 #sparkbench
Future Work & Contributions
• More workloads and data generators
• TPC-DS (on top of existing SQL query benchmark)
• Streaming workloads
• Python
Contributions very welcome from all levels!
Check out our ZenHub Board for details
#EUeco8 #sparkbench
Shout-Out to My Team
Matthew Schauer
@showermat
Craig Ingram
@cin
Brad Kaiser
@brad-kaiser
*not an actual photo of my team
#EUeco8 #sparkbench
Summary
Spark-Bench is a flexible, configurable framework for
Testing, Benchmarking, Simulating, Comparing,
and more!
#EUeco8 #sparkbench
https://sparktc.github.io/spark-bench/
SparkTC/spark-bench
Emily May Curtin
IBM Watson Data Platform
Atlanta, GA
@emilymaycurtin
@ecurtin
#EUeco8
#sparkbench
https://github.com/SparkTC/spark-bench
https://sparktc.github.io/spark-bench/
Project
Documentation

More Related Content

What's hot

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
Holden Karau
 
Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop
Edureka!
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Databricks
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Get Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAMGet Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAM
Jonathan Katz
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)
용호 최
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayDataWorks Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
Brendan Gregg
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경
[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경
[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경
NAVER Engineering
 

What's hot (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Spark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New YorkSpark Autotuning Talk - Strata New York
Spark Autotuning Talk - Strata New York
 
Learn Big Data & Hadoop
Learn Big Data & Hadoop Learn Big Data & Hadoop
Learn Big Data & Hadoop
 
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Get Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAMGet Your Insecure PostgreSQL Passwords to SCRAM
Get Your Insecure PostgreSQL Passwords to SCRAM
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)[자바카페] Elasticsearch Aggregation (2018)
[자바카페] Elasticsearch Aggregation (2018)
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경
[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경
[네이버오픈소스세미나] Contribution, 전쟁의 서막 : Apache OpenWhisk 성능 개선 - 김동경
 

Similar to Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark with Emily Curtain

CODAIT/Spark-Bench
CODAIT/Spark-BenchCODAIT/Spark-Bench
CODAIT/Spark-Bench
Emily Curtin
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
Anubhav Jain
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 
Road to sbt 1.0 paved with server
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with server
Eugene Yokota
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsLeonardo Borges
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Jen Aman
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Tim Callaghan
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
ScyllaDB
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
TestUpload
TestUploadTestUpload
TestUpload
ZarksaDS
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Iulian Pintoiu
 

Similar to Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark with Emily Curtain (20)

CODAIT/Spark-Bench
CODAIT/Spark-BenchCODAIT/Spark-Bench
CODAIT/Spark-Bench
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
FireWorks workflow software
FireWorks workflow softwareFireWorks workflow software
FireWorks workflow software
 
Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Road to sbt 1.0 paved with server
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with server
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event Systems
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Performance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons LearnedPerformance Benchmarking: Tips, Tricks, and Lessons Learned
Performance Benchmarking: Tips, Tricks, and Lessons Learned
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Testing Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
TestUpload
TestUploadTestUpload
TestUpload
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 

Recently uploaded

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 

Recently uploaded (20)

1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 

Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark with Emily Curtain

  • 1. Simulate, Test, Compare, Exercise, and Yes, Even Benchmark! Spark-Bench Emily May Curtin IBM Watson Data Platform Atlanta, GA @emilymaycurtin @ecurtin "S" + = #EUeco8 #sparkbench
  • 2. Who Am I • Artist, former documentary film editor • 2017 Spark Summit East – Spark + Parquet • Tabs, not spaces (at least for Scala)(where 1 tab = 2 spaces) • Software Engineer, IBM Watson Data Platform • Current lead developer of SparkTC/spark-bench #EUeco8 #sparkbench
  • 4. Spark-Bench • History & Rewrite Motivation • How Spark-Bench Works • Detailed Example: Parquet vs. CSV Benchmarking • More Examples – Cluster Hammer – Multiple Notebook Users – Varying over Spark parameters #EUeco8 #sparkbench
  • 9. Spark Benchmarking circa 2015 A.D. #EUeco8 #sparkbench
  • 10. Spark-Bench 2015 "SparkBench: A Comprehensive Benchmarking Suite For In Memory Data Analytics Platform Spark" Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina Salapura IBM TJ Watson Research Center - APC 2015 #EUeco8 #sparkbench
  • 11. Spark-Bench 2015 "SparkBench: A Comprehensive Benchmarking Suite For In Memory Data Analytics Platform Spark" Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina Salapura IBM TJ Watson Research Center - APC 2015 #EUeco8 #sparkbench
  • 16. #EUeco8 #sparkbench Data Science Experience IBM Analytics Engine (Beta) AND MUCH, MUCH MORE Watson Machine Learning Watson Data Platform
  • 17. As a researcher, I want to run a set of standard benchmarks with generated data against one instance of Spark, one time. --The user story that SparkTC/spark-bench version 1.0 fulfilled #EUeco8 #sparkbench
  • 18. Good start, let’s keep going! #EUeco8 #sparkbench
  • 20. As a Spark developer, I want to run regression tests with my changes in order to provide some authoritative performance stats with my PR. #EUeco8 #sparkbench
  • 21. As a Spark developer, I want to see regression tests run against Spark nightly builds. #EUeco8 #sparkbench
  • 22. As a Parquet developer, I want to compare performance between Spark built with several different versions of Parquet. #EUeco8 #sparkbench
  • 23. As a Spark developer, I want to simulate multiple notebook users hitting the same cluster so that I can test my changes to the dynamic allocation algorithm. Supporting Highly Multitenant Spark Notebook Workloads Craig Ingram, Brad Kaiser #EUeco8 #sparkbench
  • 24. As an enterprise user of Spark, I want to make my daily batch ingest job run faster by tuning parameters. #EUeco8 #sparkbench
  • 25. As a DevOps professional, I want to see how performance changes for our daily jobs as a I add more nodes to our prod cluster. #EUeco8 #sparkbench
  • 26. As a data scientist, I want to run several machine learning algorithms over the same set of data in order to compare results. #EUeco8 #sparkbench
  • 27. As anybody who has ever had to make a graph about Spark, I want to get numbers easily and quickly so I can focus on making beautiful graphs. #EUeco8 #sparkbench
  • 28. How It Works #EUeco8 #sparkbench
  • 29. Structural Details Old Version ✨ New Version✨ Project Design Series of shell scripts calling individually built jars Scala project built using SBT Build/Release Manually/Manually SBT, TravisCI, auto-release to Github Releases on PR merge Configuration Scattered in shell script variables Centralized in one config file with many new capabilities Parallelism None Three different levels! Custom Workloads Requires writing a new subproject with all accompanying bash Implement our abstract class, Bring Your Own Jar #EUeco8 #sparkbench
  • 30. The Config File (a first glance) spark-bench = { spark-submit-config = [{ workload-suites = [ { descr = "One run of SparkPi and that's it!" benchmark-output = "console" workloads = [ { name = "sparkpi" slices = 10 } ] } ] }] } #EUeco8 #sparkbench
  • 31. Workloads • Atomic unit of spark-bench • Optionally read from disk • Optionally write results to disk • All stages of the workload are timed • Infinitely composable #EUeco8 #sparkbench
  • 32. Workloads • Machine Learning – Kmeans – Logistic Regression – Other workloads ported from old version • SQL queries • Oddballs – SparkPi – Sleep – Cache Test #EUeco8 #sparkbench
  • 33. Workloads: Data Generators • Graph data generator • Kmeans data generator • Linear Regression data generator • More coming! #EUeco8 #sparkbench
  • 34. What About Other Workloads? 1. Active development, more coming! 2. Contributions welcome J 3. Bring Your Own Workload! { name = "custom" class = "com.seriouscompany.HelloString" str = ["Hello", "Hi"] } #EUeco8 #sparkbench
  • 35. Custom Workloads #EUeco8 #sparkbench { name = "custom" class = "com.seriouscompany.OurGiganticIngestDailyIngestJob" date = "Tues Oct 26 18:58:25 EDT 2017" source = "s3://buncha-junk/2017-10-26/" } { name = "custom” class = "com.example.WordGenerator” output = "console” rows = 10 cols = 3 word = "Cool stuff!!" } ß See our documentation on custom workloads!
  • 37. Workload Could be Kmeans, SparkPi, SQL, Sleep, etc. { name = "sparkpi" slices = 500000000 } #EUeco8 #sparkbench
  • 39. Workload Could be Kmeans, SparkPi, SQL, Sleep, etc. Data Generation Workload{ name = "sparkpi" slices = 500000000 } { name = "data-generation-kmeans" rows = 1000 cols = 24 output = "/tmp/kmeans-data.parquet" k = 45 } #EUeco8 #sparkbench
  • 40. Data Generation, Small Set Data Generation, Large Set { name = "data-generation-kmeans" rows = 100000000 cols = 24 output = "/tmp/large-data.parquet" } { name = "data-generation-kmeans" rows = 100 cols = 24 output = "/tmp/small-data.parquet" } #EUeco8 #sparkbench
  • 43. Workload Suites • Logical group of workloads • Run workloads serially or in parallel • Control the repetition of workloads • Control the output of benchmark results • Can be composed with other workload suites to run serially or in parallel #EUeco8 #sparkbench
  • 44. One workload suite with three workloads running serially workload-suites = [ { descr = "Lol" parallel = false repeat = 1 benchmark-output = "console" workloads = [ { // Three workloads here } ] }] #EUeco8 #sparkbench
  • 45. One workload suite with three workloads running in parallel workload-suites = [ { descr = "Lol" parallel = true repeat = 1 benchmark-output = "console" workloads = [ { // Three workloads here } ] }] #EUeco8 #sparkbench
  • 46. Three workload suites running serially. Two have workloads running serially, one has workloads running in parallel. suites-parallel = false workload-suites = [ { descr = "One" parallel = false }, { descr = "Two" parallel = false }, { descr = "Three" parallel = true }] (config file is abbreviated for space) #EUeco8 #sparkbench
  • 47. Three workload suites running in parallel. Two have workloads running serially, one has workloads running in parallel. suites-parallel = true workload-suites = [ { descr = "One" parallel = false }, { descr = "Two" parallel = false }, { descr = "Three" parallel = true }] (config file is abbreviated for space) #EUeco8 #sparkbench
  • 48. ./bin/spark-bench.sh my-amazing-config-file.conf spark-bench-launch spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar spark-submit –jar spark-bench.jar #EUeco8 #sparkbench
  • 49. Spark-Submit-Config • Control any parameter present in a spark-submit script • Produce and launch multiple spark-submit scripts • Vary over parameters like executor-mem • Run against different clusters or builds of Spark • Can run serially or in parallel #EUeco8 #sparkbench
  • 50. One spark-submit with one workload suite with three workloads running serially spark-bench = { spark-submit-config = [{ spark-args = { master = "yarn" } workload-suites = [{ descr = "One" benchmark-output = "console" workloads = [ // . . . ] }] }] } (config file is abbreviated for space) #EUeco8 #sparkbench
  • 51. Two spark-submits running in parallel, each with one workload suite, each with three workloads running serially #EUeco8 #sparkbench
  • 52. Two spark-submits running serially, each with one workload suite, one with three workloads running serially, one with three workloads running in parallel #EUeco8 #sparkbench
  • 53. One spark-submit with two workload suites running serially. The first workload suite generates data in parallel. The second workload suite runs different workloads serially. The second workload suite repeats. #EUeco8 #sparkbench
  • 54. spark-bench = { spark-submit-parallel = false spark-submit-config = [{ spark-args = { // . . . } suites-parallel = false workload-suites = [{ descr = "Data Generation" parallel = true repeat = 1 // generate once and done! workloads = { // . . . } }, workload-suites = [{ descr = "Actual workloads that we want to benchmark" parallel = false repeat = 100 // repeat for statistical validity! workloads = { // . . . } }] }] } #EUeco8 #sparkbench
  • 56. Detailed Example: Parquet vs. CSV #EUeco8 #sparkbench
  • 57. Example: Parquet vs. CSV Goal: benchmark two different SQL queries over the same dataset stored in two different formats 1. Generate a big dataset in CSV format 2. Convert that dataset to Parquet 3. Run first query over both sets 4. Run second query over both sets 5. Make beautiful graphs #EUeco8 #sparkbench
  • 58. spark-bench = { spark-submit-config = [{ spark-home = "/usr/iop/current/spark2-client/" spark-args = { master = "yarn" executor-memory = "28G" num-executors = 144 } conf = { "spark.dynamicAllocation.enabled" = "false" "spark.dynamicAllocation.monitor.enabled" = "false" "spark.shuffle.service.enabled" = "true" } suites-parallel = false // …CONTINUED… #EUeco8 #sparkbench
  • 59. workload-suites = [ { descr = "Generate a dataset, then transform it to Parquet format" benchmark-output = "hdfs:///tmp/emily/results-data-gen.csv" parallel = false workloads = [ { name = "data-generation-kmeans" rows = 10000000 cols = 24 output = "hdfs:///tmp/emily/kmeans-data.csv" }, { name = "sql" query = "select * from input" input = "hdfs:///tmp/emily/kmeans-data.csv" output = "hdfs:///tmp/emily/kmeans-data.parquet" } ] }, // …CONTINUED… #EUeco8 #sparkbench
  • 60. { descr = "Run two different SQL over two different formats" benchmark-output = "hdfs:///tmp/emily/results-sql.csv" parallel = false repeat = 10 workloads = [ { name = "sql" input = [ "hdfs:///tmp/emily/kmeans-data.csv", "hdfs:///tmp/emily/kmeans-data.parquet" ] query = [ "select * from input", "select `0`, `22` from input where `0` < -0.9" ] cache = false } ] } ] }] } You made it through the plaintext! Here’s a puppy!#EUeco8 #sparkbench
  • 61. #EUeco8 #sparkbench Parquet vs CSV Resultsname timestamp loadTime queryTime total_Runtime run cache queryStr input spark.executor.instances spark.submit.deployMode spark.master spa sql 1.50714E+12 15895947762 8722036 15904669798 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 15232039388 41572417 15273611805 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 139462342 5652725 145115067 0 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 100552926 9120325 109673251 0 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 15221909508 6183625 15228093133 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 15634952815 6820382 15641773197 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 93264481 5243927 98508408 1 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 84948895 5573010 90521905 1 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 15270158275 8408445 15278566720 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14924461363 7999643 14932461006 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 86821299 13952212 100773511 2 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 100789178 6525153 107314331 2 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 15115682178 7085577 15122767755 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14829435321 6452434 14835887755 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 78654779 5291044 83945823 3 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 86702691 5641506 92344197 3 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 14881975292 7346304 14889321596 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14953655628 6749366 14960404994 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 102374878 4925237 107300115 4 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 79458610 4697124 84155734 4 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 14908128878 7429576 14915558454 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14922630998 4596120 14927227118 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 79144522 4191275 83335797 5 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 86207011 4111788 90318799 5 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 14909544529 5722627 14915267156 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 14919548742 5643308 14925192050 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.csv 144client yarn 28G sql 1.50714E+12 85241859 4417683 89659542 6 FALSE select * from input hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G sql 1.50714E+12 93379964 4610765 97990729 6 FALSE select `0`, `22` from input where `0` < -0.9 hdfs:///tmp/emily/kmeans-data.parquet 144client yarn 28G
  • 68. Multi-User Notebook Data Generation, sample.csv and GIANT-SET.csv Workloads, working with the sample.csv Same workloads, working with GIANT-SET.csv { name = "sleep” distribution = "poisson” mean = 30000 } Sleep Workload
  • 69. Examples of Easy Extensions • Add more "users" (workload suites) • Have a workload suite that simulates heavy batch jobs while notebook users attempt to use the same cluster • Toggle settings for dynamic allocation, etc. • Benchmark a notebook user on a heavily stressed cluster (cluster hammer!)
  • 70. Examples: Varying Spark Parameters #EUeco8 #sparkbench
  • 71. Different Versions of Spark spark-bench = { spark-submit-parallel = false spark-submit-config = [{ spark-home = [ "/opt/official-spark-build/", "/opt/my-branch-of-spark-with-new-changes/" ] spark-args = { master = "yarn" } workload-suites = [{// workload suites}] }] } #EUeco8 #sparkbench This will produce 2 spark-submits
  • 72. Varying More Than One Param spark-bench = { spark-submit-parallel = false spark-submit-config = [{ spark-home = [ "/opt/official-spark-build/", "/opt/my-fork-of-spark-with-new-changes/" ] spark-args = { master = "yarn" executor-mem = ["2G", "4G", "8G", "16G"] } workload-suites = [{// workload suites}] }] } #EUeco8 #sparkbench This will produce 2 X 4 = 8 spark-submits
  • 73.
  • 75. Installation 1. Grab the latest release from the releases page on Github 2. Unpack the tar 3. Set the environment variable for $SPARK_HOME. 4. Run the examples! ./bin/spark-bench.sh examples/minimal-example.conf #EUeco8 #sparkbench
  • 76. Future Work & Contributions • More workloads and data generators • TPC-DS (on top of existing SQL query benchmark) • Streaming workloads • Python Contributions very welcome from all levels! Check out our ZenHub Board for details #EUeco8 #sparkbench
  • 77. Shout-Out to My Team Matthew Schauer @showermat Craig Ingram @cin Brad Kaiser @brad-kaiser *not an actual photo of my team #EUeco8 #sparkbench
  • 78. Summary Spark-Bench is a flexible, configurable framework for Testing, Benchmarking, Simulating, Comparing, and more! #EUeco8 #sparkbench https://sparktc.github.io/spark-bench/
  • 79.
  • 80. SparkTC/spark-bench Emily May Curtin IBM Watson Data Platform Atlanta, GA @emilymaycurtin @ecurtin #EUeco8 #sparkbench https://github.com/SparkTC/spark-bench https://sparktc.github.io/spark-bench/ Project Documentation